In machine learning, researchers split their data into training and test sets, let model builders compete on the test set, and call it a benchmark. Statistical tradition prescribed locking test sets in a vault, but machine learning practitioners shared them freely. Benchmarking shouldn’t have worked, but it did, and the machine learning community never figured out the science behind it. How did benchmarking, despite its flaws, lead to advances in AI? In The Emerging Science of Machine Learning Benchmarks, Moritz Hardt investigates why benchmarking works, and what purpose it serves.
Hardt draws on a growing body of work that has begun to lay out the science underpinning benchmarks; what emerges is a rich landscape of theoretical and empirical observations that can inform practitioners. He begins with the foundations, both mathematical and empirical, covering enough background material to make the book self-contained. He finds that model rankings, rather than model evaluation, are the primary scientific product of machine learning benchmarks. Turning to the challenges of benchmarking large language models, Hardt explains how benchmarks influence model training, complicating direct model comparisons. As model capabilities exceed those of human evaluators, researchers are running out of ways to test new models. If benchmarks are to serve us well in the future, we must place them on solid scientific ground. With this book, Hardt lays the foundation.
Moritz Hardt is director at the Max Planck Institute for Intelligent Systems and an honorary professor at the University of Tübingen. He is the coauthor of Patterns, Predictions, and Actions: Foundations of Machine Learning (¿ìɫֱ²¥) and Fairness and Machine Learning: Limitations and Opportunities.
“In a moment of crisis for the field—where we struggle to devise tests today’s systems cannot ace—this book places benchmarks in the spotlight where they belong and asks the almost heretical question: why do they work as well as they do? Engrossing, page-turning, lucid, and crucial. Moritz Hardt has written a masterpiece.”—Brian Christian, author of The Alignment Problem: Machine Learning and Human Values
“From the calculated moves of grandmaster board games to the frontiers of statistical physics and protein chemistry, Moritz Hardt reveals the story of the benchmark, the intelligence yardstick that transformed AI from a hope and a prayer into a trillion-dollar arms race. Written with the deep insight of a leading expert, this is the definitive look at the benchmarks that define our future—and the secrets they keep.”—David L. Donoho, Stanford University
“For decades, machine learning has progressed with a predictable cycle of training and testing statistical machines on curated datasets. Moritz Hardt explains why this simple approach worked better than most of us expected, and why it faces daunting challenges in the age of pervasive, internet-scale AI. This is an essential read for anyone seeking to understand what all these benchmark scores really mean.”—Léon Bottou, Flatiron Institute
“Excellently written, Moritz Hardt provides a fascinating read on a topic of central importance to AI: the use, science, and implications of benchmarks in machine learning. Whether you are an expert, an interested student, or a member of the general public curious about AI—including how we got here and where things are going—you will learn something valuable and new from this book.”—Avrim Blum, Toyota Technological Institute at Chicago