Scaling Laws

Empirical relationships showing that LLM performance improves predictably as a power law with increases in model parameters, training compute, and data size, enabling researchers to forecast model quality before training.

Kaplan et al. (2020) at OpenAI established the first LLM scaling laws: test loss decreases as a power law in model size (N), dataset tokens (D), and compute (C). Crucially, these relationships are smooth, predictable, and hold across many orders of magnitude. This turned LLM development from empirical alchemy into an engineering problem: compute a required quality target, run the scaling law backward to find the required training budget.

Hoffmann et al. (2022) (the Chinchilla paper) revised the optimal compute allocation: earlier models were undertrained—using too many parameters for too few tokens. Chinchilla laws prescribe training ~20 tokens per parameter for compute-optimal models. This led to a generation of smaller, better-trained models (Llama 3.2, Mistral) that outperformed larger but undertrained predecessors.

As models push toward AGI, there is evidence that scaling laws may plateau for next-token prediction as the fundamental limit of predicting text is approached. Inference-time compute scaling (chain-of-thought, extended thinking, test-time training) is emerging as the next scaling axis, with o3 and Claude Sonnet 4.7 extended thinking suggesting that more reasoning at inference time can substitute for more parameters at training time.