METR
2026-04-21 07:00 UTC
Score 63.0
USR-0147-20260421-research-aca-7d76dcc7
Full article
I. Introduction We want to measure and understand how much AI agents can accelerate AI R&D and how this is changing over time. There are various sources of evidence we can look to here, including anecdotes about autonomous contributions ( AlphaEvolve and TTT-Discover speeding up a GPU kernels, autoresearch yielding speedups in nanochat), progress on benchmarks, and uplift measurement (see our recent post for a longer discussion). One interesting source of evidence is cumulative progress on publicly tracked challenges like the NanoGPT speedrun, where we can compare agent contributions to human progress over time. Such challenges and leaderboards of cumulative progress on a task are especially useful when: The task maps to real AI R&D (e.g., pretraining a language model) Many contributors have built up a rich history of progress, giving a rough sense of how much human effort went into it (a cost curve) Agents can compete under comparable conditions and potentially make new contributions Let’s look at one such leaderboard: the nanogpt speedrun . The goal is to train a language model to a target validation loss on FineWeb using 8×H100 GPUs as fast as possible . It’s a small-scale version of LLM pretraining with a public history of contributions, with four recent ones credited to AI agents as of April 2026. The optimization activities map to pretraining research such as architecture changes, writing kernels, and improving optimizers. Contributions, such as the Muon optimizer , ha…