Chasing the Numbers: The Puzzle of AI Benchmarks
What is easier—to create or to measure? We need to make choices without knowing how to in both cases. And time is not on our side.
In 2019, Canadian computer scientist Richard S. Sutton formulated a so-called bitter lesson of the 70 years of artificial intelligence (AI) research: general methods that leverage computation are ultimately the most effective. Hence, instead of designing a highly specialized model for a narrow downstream task, one can leverage the available computational resources to get great results in a black-box statistical model.
Unfortunately, the compute-intensive black box approach has two problems. First, a data bottleneck. We need more and more data, and the data should be of good quality, which is not very easy to obtain and understand in advance (what is good data, anyway?). Second, evaluation. By building highly generic prediction models, experimentally assessing whether the model works well or not becomes much more difficult.
In this post, I will describe in broad strokes how evaluation in natural language processing (NLP) works, what problems we have today, and what we can do about it. It is also becoming less obvious how to rigorously select the better model. We need a thorough evaluation methodology, and we’ll see that designing a good evaluation protocol is anything but easy. I also believe that my thoughts will be useful outside of the NLP field.
Recap: What Happened to NLP?
Let’s have a brief recap on NLP so we all will be on the same page. Over the last thirty years, this field experienced multiple paradigm shifts. First, it transitioned from rule-based systems to machine-learning models in the early 90s. It worked so well that there was a famous phrase attributed to the pioneer of speech recognition, Frederick Jelinek: “Every time I fire a linguist, the performance of the speech recognizer goes up.” Then, a highly efficient Word2Vec method for learning word embeddings was released in 2013 and representation learning became an essential approach for language processing. Finally, our current paradigm was defined by the release of the Transformer architecture in 2017 and the BERT model one year later which made most prior work not just obsolete but overly elaborate.
I see two correlates to these shifts. First, the more recent models were not possible before because we did not have any accelerators like GPUs, TPUs, and ASICs. Second, the earlier models relied on very difficult-to-obtain human-annotated datasets, while the more recent models were built upon huge amounts of unstructured texts crawled from the Internet (here I would kindly like to point the reader to the transcript of my data-centric part of our ICML 2023 tutorial). That means that the models become bigger and more capable in general, replicating Sutton’s bitter lesson, yet they are much less convenient in troubleshooting and analysis.
Today, we have a spark of different large language models (LLMs) released virtually daily. Some of them have great open-source licenses, some of them are proprietary without even disclosing the design details, and some of them are in between these two extremes. Many authors claim their models show excellent performance on many, many, many tasks. How do we find the best one that suits our needs and possibilities? And more generally, how do we compare models to each other? Let us start by looking at some ancient, prehistoric times and then watch what’s next. It’s all about benchmarking.
Benchmarking in Ancient Times (Before 2018)
Earlier works in NLP since the early 90s were focused on developing highly specialized methods for downstream tasks. These works usually resulted in manually-designed pipelines of distinct steps, attempting together to solve the given problem. You had a separate pipeline for sentiment analysis, spam detection, text summarization, information retrieval (IR), and so on. Although they had some common steps like tokenization, part-of-speech-tagging, and maybe syntactic parsing, most steps and machine learning features were carefully selected and tuned for the chosen downstream application. As a result, methods might feel plateresque to an uninitiated reader, but they eventually got the job done (as far as they could).
Since machine learning started to be used in these pipelines, there has been a common approach for studying the model performance. The dataset was split into train, validation, and test subsets. Then, an evaluation criterion was selected, and computed, so the obtained numbers were compared among different systems. This evaluation was rather static and eventually the systems became overfitted on specific single-task evaluation datasets, making it harder to make a fair assessment of the current state-of-the-art. For example, in computer vision, there is a highly popular image classification dataset called MNIST. It is expected a good image classification system to perform well on MNIST, but not vice versa—performing well on MNIST does not imply that the system performs well on something except MNIST.
IR community, understanding the necessity of more principled evaluation, started the incredibly successful initiative called Text REtrieval Conference (TREC) in 1992. At TREC, systems were evaluated differently, using shared tasks. First, the organizers selected a pool of tracks like query expansion or relevance feedback and published an open call for participation in the IR systems competition. They then created the corresponding training, validation, and test datasets. Finally, they ran a competition in which the participants had the training dataset and submitted predictions for the test dataset without knowing the answers in advance. The organizers evaluated these submissions. Notably, it could be done automatically by computing some evaluation criterion, like accuracy, on the hidden dataset. But more interestingly, it was often done by inviting human experts or using crowdsourcing. Results were outlined on a leaderboard. By running TREC tracks periodically, it was possible to follow the dynamics of the state-of-the-art in IR (and related areas).
TREC got many, many followers, and inspired many conferences, workshops, and other events in machine learning to co-locate topical shared tasks. The community created online platforms for hosting competitions, such as CodaLab/Codabench, TIRA, etc. Today, at every big scientific conference there is a special competition track, e.g., SemEval, KDD Cup, WSDM Cup, NeurIPS Competitions, CLEF Initiative, and many others. Many of them offer monetary prizes to motivate the participants. It all got at an industrial scale with the launch of Kaggle which enabled companies to outsource model selection for their problems to machine learning enthusiasts who earned money for the best solutions on the leaderboard. Kaggle also was so successful that Google bought it in 2017.
Running a shared task is a fantastic and very rewarding experience, but its preparation requires a lot of effort and expertise due to the annotation quality, choice of criteria, task promotion, choice of hosting platform, potential data leaks, etc. By the time of writing this, I have organized eight and my experience tells me that it is not viable to run a shared task every time one needs to assess the model. And since the discovery in 2017 of the Transformer architecture that was able to solve multiple different tasks very well, older single-task benchmarks stopped being representative.
Benchmark Like It’s 2018
Since the release of BERT in 2018, the NLP community has been adopting pre-trained Transformer models (Vaswani et al., 2017), which proved to work very well in practice. The performance numbers went up, the pipelines became much simpler, and the entire field changed its appeal—no more weird custom methods, only learning algorithms, just like in computer vision, but with texts. Why build a sophisticated representation tailored for our downstream application, if we can just fine-tune the pre-trained Transformer with a much smaller amount of data?
These methodological changes and increases in model capabilities resulted in the fact that on average, BERT-based methods consistently outperformed many previously state-of-the-art models on popular benchmarks. The question was, as they were good enough on these downstream applications, how could we measure their generalization abilities? As a response, the community proposed multi-task benchmarks, such as General Language Understanding Evaluation (GLUE) and SuperGLUE. The former contained 11 challenging tasks and the latter contained 10 tasks. The systems were scored on a leaderboard using a single aggregated score, so a good system had to show good results on all the tasks simultaneously.
The idea of multi-task benchmarks went further with such initiatives as Massive Multitask Language Understanding (MMLU, 57 tasks), EleutherAI Language Model Evaluation Harness (LM-Eval, 200+ tasks), and Beyond the Imitation Game Benchmark (BIG-bench, 200+ tasks). Note that many of these benchmarks were focused on English, and people created similar datasets for other languages or multilingual setups. So, as the models became much more capable and general, their quality was evaluated using very large multi-task benchmarks.
I could not overestimate the increasing role of ablations in evaluation. In an ablation study, one removes various components from the system being evaluated to study their contribution to the overall performance. Even though ablations have been known in NLP since the 70s, they have become a necessity when a new system is described in a paper, and this is very good. Before that, having ablations was just a nice bonus to the main experiments.
Although multi-task benchmarks and shared tasks for downstream applications were essential for benchmarking, BERT was an encoder-only architecture, so the researchers had to fine-tune it to these tasks; it was not a generative model. This gave them control over the data, except for the prohibitively expensive pre-training part. But later on, decoder-only architecture took off with flying colors. Guess what?
Benchmark Like It’s 2023
With the launch of GPT-3 in 2020, the previously not-very-interesting generative language models have become a big thing (pun intended). The NLP and the broader AI communities discovered an impressive ability to solve an unexpectedly large spectrum of problems by predicting the next word in the text. As a result, we got instruction-tuned LLMs in 2022 that, since ChatGPT, led to an emergency of the entire problem called prompt engineering.
I would never imagine that we would be writing texts in some chat in a trial-and-error manner to perform tasks like text classification and others. My bet back in the day was on transfer learning and knowledge distillation. But it’s the best we have right now—not just because of the high accessibility of such an approach, but also by the state-of-the-art performance on many useful everyday tasks.
A good performance, as demonstrated by these decoder-only models, is caused by the high number of trained parameters: GPT-3 had 175B parameters, and GPT-4 had more. Training of decoder-only Transformers requires “only” a huge amount of natural language texts without any annotation, so it scales pretty well. Instruction-tuned LLMs, in turn, require additional more carefully crafted instruction prompts and responses, see our ICML tutorial again, but it is beyond our today’s agenda. We do not usually fine-tune the model for the specific task anymore. Instead, for most downstream applications, we come up with a textual prompt that would get the job done—and the model remains unchanged all the time (but if a hosted LLM is used, its owner can use the interaction data to adjust the model).
Even though MMLU, LM-Eval, and BIG-bench are well-designed and fairly challenging benchmarks, most of the tasks they represent might already be in the training or instruction-tuning dataset for the corresponding LLMs. So take these numbers with a grain of salt. Being good at these benchmarks is necessary, but not sufficient for a good model. Currently, I see three approaches for seeking the truth besides the shared tasks: out-of-distribution tasks, human-evaluated leaderboards, and red-teaming.
Since most LLMs are trained on Web data, they have seen many popular websites from which evaluation datasets were derived. For example, evaluating LLM text classification capabilities on datasets like IMDb Movie Reviews does not make any sense as most of them have seen it all, including the held-out test subset, during the pre-training or instruction tuning phase. What helps instead is to pick an out-of-distribution task that the models have not seen it before. If not possible, try using examples that are not publicly available on the Internet.
There are different leaderboards, but I found the most useful to have human-scored LLM outputs: Hugging Face’s Open LLM Leaderboard and LMSYS’ Chatbot Arena Leaderboard. The former is assessing only open LLMs using private crowdsourcing with known instruction on a secret set of prompts. The latter does include the proprietary models, but it assesses all the LLMs using public crowdsourcing. Anyone can join to annotate pairs of outputs, and I believe it might be a problem just like with OpenAssistant data. Writing and following the annotator's instructions are very uneasy and one has to ensure that volunteered interactions are similar to the ones used to train state-of-the-art LLMs (see OpenAI’s instruction for example). Also, a proper sampling and selection of prompts for annotation is more difficult in a public crowdsourcing setup. I highly appreciate all these open-source initiatives and without them, we would not be making much progress, but I currently see that the Hugging Face’s approach is more sound.
For some reason, the AI community rank the models using Elo ratings in which the rankings depend on the comparison time. The same system is used in chess and it makes sense in chess, but not here. I would instead use the Bradley-Terry model for ranking, but this subject requires more investigation (please reach me if you’d like to discuss this).
Another way to evaluate a complex AI system is red-teaming. You ask a group of people to make the chosen model producing something incorrect, harmful, or dishonest. By running this for multiple different models, it becomes possible to compare them in an adversarial scenario. It’s better to spot something undesirable before it hurts the end-users, isn’t it?
It seems that even at OpenAI, the evaluation problem is still unsolved. Despite having the best-performing models, they launched the open-source Evals framework and granted early access to GPT-4 to those who landed a pull request there. It’s especially good that these datasets are published under a good MIT license, making them useful not just for themselves.
We know that all models are wrong, but some are useful. All benchmarks in AI become wrong over time, too. However, I believe that evaluation of such complex systems as LLMs or what's coming after them is even more interesting than model development per se, especially since it is not what everything wants to do.
Let’s hope that winter won’t take us by surprise.