In machine learning, researchers split their data into training and test sets, let model builders compete on the test set, and call it a benchmark. Statistical tradition prescribed locking test sets in a vault, but machine learning practitioners shared them freely. Benchmarking shouldn鈥檛 have worked, but it did, and the machine learning community never figured out the science behind it. How did benchmarking, despite its flaws, lead to advances in AI? In The Emerging Science of Machine Learning Benchmarks, Moritz Hardt investigates why benchmarking works, and what purpose it serves.
Hardt draws on a growing body of work that has begun to lay out the science underpinning benchmarks; what emerges is a rich landscape of theoretical and empirical observations that can inform practitioners. He begins with the foundations, both mathematical and empirical, covering enough background material to make the book self-contained. He finds that model rankings, rather than model evaluation, are the primary scientific product of machine learning benchmarks. Turning to the challenges of benchmarking large language models, Hardt explains how benchmarks influence model training, complicating direct model comparisons. As model capabilities exceed those of human evaluators, researchers are running out of ways to test new models. If benchmarks are to serve us well in the future, we must place them on solid scientific ground. With this book, Hardt lays the foundation.
Moritz Hardt is director at the Max Planck Institute for Intelligent Systems and an honorary professor at the University of T眉bingen. He is the coauthor of Patterns, Predictions, and Actions: Foundations of Machine Learning (快色直播) and Fairness and Machine Learning: Limitations and Opportunities.
- Figures
- Preface
- Overview
- Who is this book for?
- Acknowledgments
- Prologue
- 1 Introduction From its roots, machine learning embraces the anything goes principle of scientific discovery. Machine learning benchmarks become the iron rule to tame the anything goes. But after decades of service, a crisis grips the benchmarking enterprise.
- The iron rule
- The ImageNet era
- The LLM era
- 2 Populations and predictions The mathematical foundations of machine learning follow the astronomical conception of society: Populations are probability distributions. Optimal predictors minimize loss functions on a probability distribution.
- 2.1 Prediction
- 2.2 Risk minimization
- 2.3 Errors and metrics
- 2.4 Model training
- Notes
- 3 Detecting differences A single statistical problem illuminates much of the mathematical tools necessary for benchmarking. The key lesson is that sample requirements grow quadratically in the inverse of the difference we try to detect.
- 3.1 Model comparisons from samples
- 3.2 Coin tossing
- 3.3 Distances between distributions
- 3.4 Concentration inequalities
- 3.5 From coin tosses back to benchmarking
- Notes
- 4 Holdout method The holdout method separates training and testing data, anything goes on the training data, iron rule on the testing data. Not all uses of the holdout method are alike.
- 4.1 Testing on the training set
- 4.2 Generalization
- 4.3 The holdout method
- 4.4 What鈥檚 the holdout method for?
- 4.5 Variants of the holdout method
- 4.6 Error bars and confidence intervals
- Notes
- 5 Test set reuse Statistics prescribes the iron vault for test data. But the empirical reality of machine learning benchmarks couldn鈥檛 be further from the prescription. Repeated adaptive testing brings theoretical risks and practical power.
- 5.1 Test set reuse in machine learning benchmarks
- 5.2 Guarantees of the holdout method under adaptivity
- 5.3 Alternatives to the holdout method
- 5.4 Freedman鈥檚 paradox
- Notes
- 6 Scientific crisis A replication crisis has long gripped the empirical sciences. Statistical practice is vulnerable for fundamental reasons. Under competition, researcher degrees of freedom outwit statistical measurement.
- 6.1 The replication crisis in the statistical sciences
- 6.2 Propensity of false positives
- 6.3 Perspectives on the crisis
- 6.4 Goodhart鈥檚 law
- Notes
- 7 Replication in machine learning The preconditions for crisis exist in machine learning, too. And yet, the situation in machine learning is different. While accuracy numbers don鈥檛 replicate, model rankings replicate to a significant degree.
- 7.1 The preconditions for crisis
- 7.2 Replication in machine learning
- 7.3 The trouble with absolute benchmark numbers
- 7.4 Model rankings in the ImageNet era
- 7.5 Measurement versus ranking
- Notes
- 8 Forces against crisis If machine learning thwarted scientific crisis, the question is why. Some powerful explanations emerge. Key are the social norms and practices of the community rather than statistical methodology.
- 8.1 Beating the previous best
- 8.2 Biases and heuristics
- 8.3 Rip Van Winkle鈥檚 replication problem
- 8.4 The touch of the Blenheim Spaniel
- 8.5 Code and collaboration
- 8.6 Kaggle versus science
- 8.7 From benchmarking to scientific progress
- Notes
- 9 Labeling and annotation If the holdout method is the greatest unsung hero, data annotation is not far behind. But conventional wisdom clouds the subtle role that annotation plays for benchmarking.
- 9.1 Annotator errors and annotator agreement
- 9.2 Labeling as prediction
- 9.3 Effects of label errors on model comparisons
- 9.4 Quantity versus quality
- 9.5 Resilience of rankings to label errors
- Notes
- 10 Generative models The ImageNet era ends as attention shifts to powerful generative models trained on the internet. The new era also marks a turning point for machine learning benchmarks.
- 10.1 Language models
- 10.2 Scaling
- 10.3 Early NLP benchmarks
- 10.4 CLIP and a final look at ImageNet
- Notes
- 11 Evaluating language models After training, alignment fits models to human preferences. Part of the post-training pipeline, alignment transforms evaluation results. How post-training makes such a difference brings new challenges for benchmarking.
- 11.1 Post-training methods
- 11.2 Generative evaluation
- 11.3 Confounded evaluations
- 11.4 Model comparisons and rankings
- Notes
- 12 The problem of aggregation Multi-task benchmarks promise a holistic evaluation of complex models. An analogy with voting systems reveals limitations in multi-task benchmarks. Greater diversity comes at the cost of greater sensitivity to artifacts.
- 12.1 Multi-task benchmarks
- 12.2 Problems of aggregation and voting systems
- 12.3 Ranked voting
- 12.4 Rated voting
- 12.5 Empirical trade-offs in multi-task benchmarks
- 12.6 Latent factors in benchmark performance
- Notes
- 13 When the model moves the data Models deployed at scale always influence future data, a phenomenon called performativity. Performativity breaks evaluation and creates the problem of data feedback loops. Dynamic benchmarks try to make a virtue out of it.
- 13.1 Morgenstern鈥檚 prophecy about prediction
- 13.2 Performative prediction
- 13.3 Repeated risk minimization
- 13.4 Data feedback loops
- 13.5 Dynamic benchmarks
- Notes
- 14 Evaluation at the frontier As models gain in capabilities, human supervision increasingly becomes a bottleneck. The hope is that models will supervise and evaluate each other, but there are limits to automatic evaluation.
- 14.1 LLM as a judge
- 14.2 Debiasing evaluations
- 14.3 Restricted model evaluation strategies
- 14.4 Evaluation in the real world
- Notes
- 15 Epilogue
- References
- Index
“In a moment of crisis for the field—where we struggle to devise tests today’s systems cannot ace—this book places benchmarks in the spotlight where they belong and asks the almost heretical question: why do they work as well as they do? Engrossing, page-turning, lucid, and crucial. Moritz Hardt has written a masterpiece.”—Brian Christian, author of The Alignment Problem: Machine Learning and Human Values
“From the calculated moves of grandmaster board games to the frontiers of statistical physics and protein chemistry, Moritz Hardt reveals the story of the benchmark, the intelligence yardstick that transformed AI from a hope and a prayer into a trillion-dollar arms race. Written with the deep insight of a leading expert, this is the definitive look at the benchmarks that define our future—and the secrets they keep.”—David L. Donoho, Stanford University
“For decades, machine learning has progressed with a predictable cycle of training and testing statistical machines on curated datasets. Moritz Hardt explains why this simple approach worked better than most of us expected, and why it faces daunting challenges in the age of pervasive, internet-scale AI. This is an essential read for anyone seeking to understand what all these benchmark scores really mean.”—L茅on Bottou, Flatiron Institute
“Excellently written, Moritz Hardt provides a fascinating read on a topic of central importance to AI: the use, science, and implications of benchmarks in machine learning. Whether you are an expert, an interested student, or a member of the general public curious about AI—including how we got here and where things are going—you will learn something valuable and new from this book.”—Avrim Blum, Toyota Technological Institute at Chicago
This publication has been produced to meet accepted Accessibility standards and contains various accessibility features including concise image descriptions, a table of contents, a page list to navigate to pages corresponding to the print source version, and elements such as headings for structured navigation. Appearance of the text and page layout can be modified according to the capabilities of the reading system.
Accessibility Features
-
WCAG v2.2
-
WCAG level AA
-
Table of contents navigation
-
Single logical reading order
-
Short alternative textual descriptions
-
Print-equivalent page numbering
-
Landmark navigation
-
Index navigation
-
Epub Accessibility Specification 1.1
-
ARIA roles provided
-
All non-decorative content supports reading without sight
-
No known hazards or warnings