In Defense of AI Evals, for Everyone

Trey Causey

Source: In Defense of AI Evals, for Everyone

The more interesting question, then, is not whether you do evals, but when you can afford to be less rigorous and when you cannot.

Shreya Shankar patiently -- more patiently than I would do -- responds to the recent evals / no-evals discourse happening (mainly on Twitter). Rather than taking the bait, Shreya assumes good faith and says what this debate is about is really about when it's ok to be more or less rigorous in your evaluations.

It’s OK to be less rigorous when your task(s) are already heavily baked into the foundation model’s post-training (such as with coding).

It’s also ok when you have enough domain expertise and dogfood early and often.

In my own experience with applications built on top of foundation models (with much less money, lol), evals are especially critical in complex document processing and analysis. Just because a document fits in the context window does not mean the model will complete the task correctly; we have to carefully decompose the task into smaller pieces the model can handle, and then design evals for each of those pieces.

In the end, it’s always better to avoid Twitter and form your own conclusions about things instead of parroting the hot takes of the moment.

Subscribe for new posts

Blog posts only. No commonplace entries. Never sold or shared.

Loading...

Loading...

In Defense of AI Evals, for Everyone

Subscribe for new posts