One of the highest signal ~hours I've watched recently. Hamel Husain (and his teaching partner Shreya Shankar) are now synonymous with "AI evals"; here he lays out how he thinks people get them wrong and what he'd change.
Their course is excellent and about to start again.
Plus, a bonus appearance by Bryan Bischof and the suggestion that “AI evals” should just be renamed “data science for AI”.
Here are his peeves, highly encourage watching though:
Generic Metrics & Off-the-Shelf Evals
Completely Outsourcing Data Review & Leaving Out Domain Experts
Overeager Eval Automation
Not Looking at the Data At All
Not Thinking Deeply About Prompts
Dashboards Full of Noisy Metrics
Getting Stuck with Annotation
Endlessly Trying Tools Instead of Error Analysis
Putting LLMs in the Judge’s Seat Without Human Oversight
Engineers Not Using AI Enough Themselves (Lack of Intuition)
Subscribe for new posts
Blog posts only. No commonplace entries. Never sold or shared.
Their course is excellent and about to start again.
Plus, a bonus appearance by Bryan Bischof and the suggestion that “AI evals” should just be renamed “data science for AI”.
Here are his peeves, highly encourage watching though: