About AI Evals

Here's a summary of the themes from the Hacker News discussion, presented in markdown:

The Role of Model Switching vs. Other Improvement Strategies

A central theme is the debate over whether developers should prioritize switching to newer, more powerful LLMs as a primary method for improving application performance, or focus on other strategies. Some argue that simply using the latest and greatest models can provide significant gains, especially when there are clear jumps in benchmark evaluations. Others caution against this "first-principles" approach without proper evidence, emphasizing the importance of error analysis and refining existing models or prompts first.

"I suggest not thinking of switching model as the main axes of how to improve your system off the bat without evidence. Does error analysis suggest that your model is the problem?" (Hamel, quoted by afro88)
"If there's a clear jump in evals from one model to the next (ie Gemini 2 to 2.5, or Claude 3.7 to 4) that will level up your system pretty easily. Use the best models you can, if you can afford it." (afro88)
"I see so many systems perform badly only to find out they're using an older generation mode and simply updating to the current mode fixes many of their issues." (smcleod)
"The ‘with evidence’ part is key as simonw said. One anecdote from evals at Cleric - it’s rare to see a new model do better on our evals vs the current one. The reality is that you’ll optimize prompts etc for the current model." (shrumm)
"Quality can drop drastically even moving from Model N to N+1 from the same provider, let alone a different one." (ndr)

The Importance of Domain-Specific Evals and Evidence

A strong consensus emerges that generic benchmarks are insufficient for determining if a model switch or an optimization strategy will improve performance in a specific application. The discussion highlights the need for custom, evidence-based evaluation frameworks tailored to the unique use case, as even ostensibly better models can perform worse on particular tasks. Without these, developers risk making uninformed decisions and encountering unexpected regressions.

"How do you know that their evals match behavior in your application? What if the older, "worse" model actually does some things better, but if you don't have comprehensive enough evals for your own domain, you simply don't know to check the things it's good at?" (phillipcarter)
"I agree that in general, you should start with the most powerful model you can afford, and use that to bootstrap your evals. But I do not think you can rely on generic benchmarks and evals as a proxy for your own domain. I've run into this several times where an ostensibly better model does no better than the previous generation." (phillipcarter)
"If you try to fix problems by switching from eg Gemini 2.5 Flash to OpenAI o3 but you don't have any evals in place how will you tell if the model switch actually helped?" (simonw)
"I might disagree as these models are pretty inscrutable, and behavior on your specific task can be dramatically different on a new/“better” model. Teams would do well to have the right evals to make this decision rather than get surprised." (softwaredoug)

The Value and Complexity of Evaluation Tools and Custom Interfaces

The conversation delves into the practicalities of building and using tools for AI evaluation. While the need for robust evaluation infrastructure is acknowledged, there's a divergence of opinion on whether to build custom tools from scratch or leverage existing open-source solutions. Some advocate for custom interfaces to streamline human review and focus efforts on specific needs, while others caution against "reinventing the wheel" and point to the availability and power of established annotation platforms. The complexity of these tools and the difficulty in evaluating LLM outputs (e.g., handling uncertainty, multi-turn conversations) are also noted.

"I'm biased in that I work on an open source project in this space, but I would strongly recommend starting with a free/open source platform for debugging/tracing, annotating, and building custom evals." (calebkaiser)
"Alternatives to Opik include Braintrust (closed), Promptfoo (open, https://github.com/promptfoo/promptfoo) and Laminar (open, https://github.com/lmnr-ai/lmnr)." (mbanerjeepalmer)
"Maybe it's obvious to some - but I was hoping that page started off by explaining what the hell an AI Eval specifically is." (andybak)
"Great interfaces make human review fast, clear, and motivating. We recommend building your own annotation tool customized to your domain ..." (quoted from the article, discussed by ReDeiPirati)
"Ah! This is a horrible advice. Why should you recommend reinventing the wheel where there is already great open source software available? Just use https://github.com/HumanSignal/label-studio/ or any other type of open source annotation software you want to get started." (ReDeiPirati)
"Label Studio is great, but by trying to cover so many use cases, it becomes pretty complex. I've found it's often easier to just whip up something for my specific needs, when I need it." (jph00)
"This reads like a collection of ad hoc advice overfitted to experience that is probably obsolete or will be tomorrow. And we don’t even know if it does fit the author’s experience. I am looking for solid evidence of the efficacy of folk theories about how to make AI perform evaluation." (satisfice)

The Cost of Models and the Viability of AI Startups

The discussion touches on the financial aspect of LLM adoption. While acknowledging that model costs can be "non-trivial," some contributors believe that for successful AI startups, the cost of the model is secondary to cracking the core use case. They suggest that if the product is valuable, the model costs will eventually become manageable.

"The vast majority of a i startups will fail for reasons other than model costs. If you crack your use case, model costs should fall exponentially." (lumost)
"Also the “if you can afford it” can be fairly non trivial decision." (softwaredoug)

The Nature of AI Evals and Their Relationship to Traditional Testing

The fundamental definition and characteristics of AI evaluations are also explored. One user questions the article's assertion that a 100% pass rate isn't always necessary in AI evals, contrasting it with traditional unit testing expectations. The idea of tracking regressions on a per-test basis is also mentioned as a valuable practice.

"“On a related note, unlike traditional unit tests, you don’t necessarily need a 100% pass rate. Your pass rate is a product decision, depending on the failures you are willing to tolerate.” Not sure how I feel about this, given expectations, culture, and tooling around CI. This suggestion seems to blur the line between a score from an eval and the usual idea of a unit test." (xpe)
"P.S. It is also useful to track regressions on a per-test basis." (xpe)
"People should be demanding consistency and traceability from the model vendors checked by some tool perhaps like this. This may tell you when the vendor changed something but there is otherwise no recourse?" (th0ma5)

The Intricacy of Annotating Data for LLM Development

A recurring point of confusion and discussion is the concept and execution of "custom annotation." Users seek clarification on what exactly is being annotated and how. This leads to explanations involving collecting application traces, performing automated scoring and manual review, identifying failure modes, and creating datasets for benchmarking. The process can range from simple label capture (like a thumbs up/down) to more complex, customized tools.

"I still don't understand a lot of what's being suggested. What exactly is a "custom annotation tool", for annotating what?" (davedx)
"Typically, you would collect a ton of execution traces from your production app. Annotating them can mean a lot of different things, but often it means some mixture of automated scoring and manual review. At the earliest stages, you're usually annotating common modes of failure, so you can say like 'In 30% of failures, the retrieval component of our RAG app is grabbing irrelevant context.' or 'In 15% of cases, our chat agent misunderstood the user's query and did not ask clarifying questions.'" (calebkaiser)
"Concrete example from my own workflows: in my IDE whenever I accept or reject a FIM completion, I capture that data (the prefix, the suffix, the completion, and the thumbs up/down signal) and put it in a database. The resultant dataset is annotated such that I can use it for analysis, debugging, finetuning, prompt mgmt, etc." (spmurrayzzz)
"What metrics do you use (e.g., BERTScore, ROUGE, F1)? Are similarity metrics still useful?" (pamelafox)
"How do you handle uncertainty or 'don’t know' cases? (Temperature settings?)" (pamelafox)