Jinjie Ni | Home | Twitter / X | Google Scholar | Github National University of Singapore [email protected] | [email protected]
Aug 9 2024
This blog post was inspired by the thoughts coming with the MixEval releases, and the feedbacks we receive from the communities and reviewers. The empirical conclusions in this post might be useful to the whole AI community instead of only the LLM folks.
Model evaluation is simple and complicated. It’s simple to build a random evaluation pipeline that reflects random abilities of models. However, creating a “correct” evaluation requires sophisticated considerations. The previous posts by Jason Wei and Clémentine Fourrier have shared some essential points on how to conduct LLM evaluation.
In this blog post, we talk about how to build a “correct” model evaluation that is useful in a long term and provide the basic principles.
What is model evaluation? Model evaluation is an efficient proxy to measure how good a model will perform in the real-world use cases before they are actually used. This understanding is very important, and is the core to the remaining discussion of this post. Based on this understanding, to build a “correct” model evaluation that is useful in a long term, there are two principles to follow:
The first principle, being generalizable, is a common concern when building any machine learning systems. When training models, we wish to make the trained models more generalizable, so that they can perform well on validation sets, evaluations, and real-world applications, given what they learned from the training data. Samely, when evaluating models, we still need high generalizability–we wish the results we got from our validations and evals are consistent with the results from the real-world feedbacks.
Figure 1. The pipeline of machine learning systems. When the results of the previous stage is generalizable to the next stages, we consider the previous stage is fully generalizable.
Figure 1 outlines the stages of building machine learning systems. When the results of a stage is generalizable to the next stages, we consider the stage as fully generalizable. When building models, we care about whether the training stage can be generalizable to the validation sets and test sets (evals); when building evals, we care about whether the validation and evaluation results can be generalizable to the real-world applications. Only when the training stage is fully generalizable, i.e., all the stages are fully generalizable, this system is successful.
From the perspective of evaluation builders, what they can do to ensure the generalizability of the whole pipeline is to:
Contamination is part of the model generalizability issues, which means validation or test eval data leaking into the training data, making the test results non-generalizable.