Aug 2024 | Don’t Build Random Evals: Principles for General-Purpose Model Evaluation

Jinjie Ni | National University of Singapore

Model evaluation is simple and complicated. It’s simple to build a random evaluation pipeline that reflects random abilities of models. However, creating a “correct” evaluation requires sophisticated considerations. The great posts by Jason Wei and Clémentine Fourrier have shared some essential points on how to conduct LLM evaluation.

In this blog post, we talk about how to build a “correct” model evaluation that is useful in a long term and provide the basic principles.

What is model evaluation? Model evaluation is an efficient proxy to measure how good a model will perform in the real-world use cases before they are actually used. This understanding is very important, and is the core to the remaining discussion of this post. Based on this understanding, to build a “correct” model evaluation that is useful in a long term, there are two principles to follow: Generalizable and Efficient.