Scaling the Vibes: Gen AI systems evaluations

Aug 5

Takeaways

A robust AI system must be dynamic, continually evolving, and focused on experimentation. Central to this process is implementing evaluations—you can't improve what you don't measure.
Evaluations are unique to every company. The evaluation system itself should also be dynamic and iterative. Real user interaction will reveal multiple unplanned outputs and use cases.
Code-driven, human, and model/LLM evaluations have their pros and cons. A comprehensive evaluation system of the future will likely integrate all methods. I expect that for the model evaluation portion, we will have multiple sector-specific evaluator models measuring different metrics.
Given the general capabilities of Gen AI systems, the long-tail distribution, and data imbalance, evaluations are not a solved problem and remain an area of ongoing research.
The lines between evaluation, observability, and fine-tuning startups will increasingly blur.

Introduction to Gen AI systems evaluations

Investing in the AI enablement layer, I often wonder about evaluation systems. What role do evaluations play in the new "AI stack"? How big of a problem/opportunity are evaluations? Will companies build them in-house or outsource? Is there an opportunity to build a generational company around Gen AI evaluations?

Traditional software evaluation relied on binary outcomes to assess reliability, with success measured by predefined answers or completed actions. Early software development emphasized QA testing to ensure applications met requirements and operated smoothly, involving manual testing and automated scripts to detect bugs and validate functionality. Traditional software had more defined functions and metrics, making evaluations straightforward. While some traditional QA methods remain relevant, the rise of AI systems has necessitated new, complementary approaches to evaluate Gen AI applications reliably.

In the realm of generative AI, the evaluation process becomes significantly more complex. The distinctive feature of language models—their ability to generate non-deterministic, human-like outputs—introduces unique challenges. How do we measure results that vary each time? How do we assess outputs that are subjective? How do you evaluate when accuracy is not the only thing that matters? How can you have good testing coverage around millions of possible outcomes?

Evaluations have become a crucial component of the AI stack. While companies of all sizes recognize the importance of effective evaluations, most are dissatisfied with their current systems. Founders and industry conferences consistently rank evaluations among the top five challenges to AI adoption. Even major players like LinkedIn have publicly shared their struggles with evaluations.

Beyond the challenges, evaluations are pivotal in driving value and establishing competitive advantages. A cornerstone of any successful AI strategy is constant iteration and experimentation, with evaluations serving as the foundational pillar. A robust evaluation system is imperative for continuous improvement—after all, you can't improve what you don't measure.

This article will cover the characteristics of a good evaluation system. It will then deep dive into the evaluation types, and the human vs AI-based model evaluations. We will then address the big research challenges around evaluations, and finally, we will cover the strategic role evaluations have for Gen AI systems beyond reliability.

What is a good evaluation system?

An effective Gen AI evaluation system should be scalable and cost-effective. Scalability is crucial due to high error rates in generative AI, as evaluating only a subset of outputs risks missing critical errors. Cost-effectiveness is equally vital - if evaluations are too expensive, companies may forgo implementing them altogether.

Evaluations can be done both online and offline; a good system incorporates both. Online evaluations provide rapid real-time checks and allow developers to control decisions about user experience, such as not serving a chatbot response that fails an evaluation. However, these evaluations must be implemented with low latency to avoid negatively impacting the user experience. Offline evaluations, including asynchronous batch processing, offer deeper assessment and insights without time constraints and are used more frequently than online evaluations.

Another critical component of an effective evaluation system is its user interface for setup, logging, and auditing. An effective evaluation platform should present information in a user-friendly manner. For example, players like Relari offer an intuitive interface to evaluate different pipeline steps. The system should facilitate easy integration with existing systems and streamline data editing following error identification. Additionally, it should ensure traceability, transparency, and interpretability.

Evaluation should be continuous, from pre-production through post-launch (Freeplay and Galileo have good guides on what makes good evaluation systems.) This long-term process requires patience. While some companies report quick 80% improvements after implementing evaluation systems, progress from 80% to 99% is gradual. Companies should set realistic expectations when implementing these systems.

Evaluation types

Early generative AI evaluation relied primarily on public LLM benchmarks. These showcased top-performing models across various tests. The "Post Turing" paper traces benchmark evolution from basic (e.g., GLUE) to advanced (e.g., MMLU). These benchmarks popularized leaderboards, such as Hugging Face's Open LLM.

While benchmarks offer a starting point, companies quickly discover their limitations in real-world scenarios. These tests often miss nuances of specific use cases. Consequently, businesses shift focus from finding the "most powerful" model to identifying the most cost-effective and accurate one for their specific needs. Moreover, data contamination in training sets compromises benchmark accuracy in measuring true model capabilities.

Given the limitations of these benchmarks, there are three primary ways to conduct evaluations. One of the first approaches companies can take is to do code-driven evaluations, where they can have custom code functions for things like schema adherence, formatting, and string match/similarity scores, among others. The advantage they have is that you can run them quickly and freely at every inference and use that as the starting place.

You can also conduct reference-free evaluation based solely on the LLM response. One example of this is heuristic-based reference-free evaluation, such as verbosity analysis. Additionally, rubric-based evaluation using LLM judges can also be reference-free.

The third method involves creating a "golden dataset" - a curated collection of high-quality, representative examples that serve as a benchmark for AI system evaluation. This dataset should embody a representative and high-quality set of examples that accurately reflect the desired outcomes and edge cases for specific use cases.

Developing a golden dataset requires close customer interaction to understand needs, possible queries, and establish a pre-production baseline. Each company's definition of success and thus golden dataset is unique and varies widely. One important note to make is that predicting every outcome and output from the beginning is impossible. Real user interaction will reveal numerous unexpected scenarios, necessitating continuous adjustments to the evaluation system. The evaluation system itself must also adapt over time. For instance, golden datasets will evolve as customer use cases change over time, models change, and there is data drift.

Zhao Chen, AI director from Upwork, emphasized that while golden datasets are important for evaluations, the real challenge lies in how they are utilized. This has been a significant issue even before the advent of LLMs, as seen in computer vision datasets for self-driving, where golden datasets were better defined and cleaner. Ian from Freeplay cautioned that the traditional approach to golden datasets is often insufficient. Given the multi-dimensional outcomes possible with LLMs, treating a golden set as a "happy path" risks failing to test for likely failures. In practice, customers are curating multiple datasets, including golden set answers, clear failures from production to avoid replication, and "red teaming" or attack datasets.

LLM as a judge or human evaluations - Scaling the vibes?

When companies started evaluating generative AI systems, one of the earliest methods was human evaluation. Many companies conducted what is known as "vibes check." In this process, a human evaluator reviewed the model's output to determine its correctness. However it is worth noting that some companies do very structured and rigorous human evaluations with large specialized teams for this.

To date, this has proven to be the most effective method. Sharon Zhou, founder of Lamini, recently shared that a major reason why code agents have gained traction is due to the nature of software development, in which developers are actively monitoring and evaluating the outputs and suggestions of the AI agent/co-pilot.

For effective human evaluation, founders and operators should consider several key points. Evaluation criteria must be precise to minimize ambiguity and potential human bias. Developing a system that enables domain experts to participate without significant time investment is crucial for scaling the process. For example, using Slack to send a small number of daily evaluation tasks to product managers allows those with deep knowledge to provide input efficiently within their existing workflow.

Despite being the most accurate evaluation method, human evaluation fails on three crucial fronts: scalability, cost, and time efficiency. Manually evaluating thousands of daily outputs is neither physically feasible nor economically viable. The challenge with human evaluations has spurred the rise of model-based evaluators. By leveraging these models, companies can develop cheaper and faster evaluation systems. This can be achieved through the use of an LLM or a smaller, fine-tuned AI evaluator model, trained on datasets containing examples of both high-quality and low-quality content, complemented by human evaluations or ratings.

Startups offering evaluation solutions must recognize that generic approaches will likely fail. Successful providers will offer customizable evaluator models that clients can fine-tune with their own data to address specific needs. For example, Galileo's Luna, a fine-tuned evaluator model for finance and law, allows further customization with company-specific data. Crucially, these solutions must be transparent and interpretable, avoiding the scenario of a black box evaluating another black box. Clients need to understand the evaluation scoring process, emphasizing the importance of explainability. The key to effective evaluator models lies in high-quality, diverse training data that minimizes biases and overfitting.

Despite their promise, LLM and evaluator models have limitations. Maxime Labonne has effectively summarized various research papers findings on this topic. LLM evaluations face several challenges, including:

Variability in model performance
Imperfect alignment with human judgments
Vulnerabilities to biases and perturbations
Their effectiveness varies depending on the specific task and type of evaluation required

Despite these challenges, model-based evaluations remain an active area of research with significant potential. Emerging concepts such as multi-agent debate systems and replacing judges with juries are expanding the scope of AI evaluation techniques. I expect that in the future, we will not have just one master evaluator model but rather a base evaluator model complemented by smaller, industry specific evaluator models tuned for different metrics.

As we have covered both human and model evaluations have unique strengths and weaknesses, necessitating an integrated approach for robust assessment. As model-based evaluations advance, they may replace non-expert human evaluators due to efficiency and scalability. However, subject matter experts will remain essential for specialized evaluations.

This human-model collaboration is exemplified by OpenAI's CriticGPT, which assists human trainers in evaluating AI-generated code. Conversely, a key practice is having humans review random samples of production data to catch edge cases and examine production failures detected by evaluations or customer feedback. The system is then updated based on these learnings, with both evaluations and datasets being refined. An effective system, therefore, is one where humans and AI models augment and improve each other's capabilities.

Evaluations complexity

One of the biggest research problems with Gen AI evaluations is that the entire point of LLMs is that they're supposed to have more general capabilities than AI systems from before, but we fundamentally have no mathematical scaffolding to measure general capabilities.

A major challenge in AI evaluation is assessing millions of possible outputs, especially for long-tail and edge scenarios. Long-tail imbalance is high-dimensional and extremely difficult to measure or detect. Dataset imbalance poses a significant issue, where models may achieve high accuracy by simply predicting the majority class, thus failing on critical minority cases. For instance, a model always predicting "legitimate" in a dataset with 95% legitimate transactions would show 95% accuracy but fail to detect fraud (which is the critical issue). These "edge" cases are a significant reason to conduct evaluations. A correct treatment of the long-tail depends on understanding not just what data is rare, but what data is rare for the specific model we are using.

Related to this topic Yi Zhang, founder and CEO of Relari, shared a comparison with the evaluation in the self-driving car industry.

“In self-driving cars, evaluation drives the entire development process. In this industry, evaluations are conducted extensively in simulations, testing thousands of times more miles in virtual environments than in real-world settings. This allows developers to identify and optimize for rare edge cases that are difficult to encounter in real-world testing. For example, in self-driving cars, even a rare edge case can have significant consequences, so it is essential to simulate and evaluate these scenarios extensively.”

To address these issues, companies have been creating specialized datasets and fine-tuning models for evaluation, as previously mentioned. While this approach significantly improves performance, it is worth acknowledging that this does not solve the problem from its root. As Zhao Chen pointed out, "training on a hallucination dataset doesn't really imbue your model with any more reasoning. It still adheres to the old 'learn your training data' model and that will always hit a performance ceiling.".

Apart from the data challenge and mathematical measurement challenge evaluation complexity varies significantly across tasks. For instance, evaluating abstractive summaries presents unique challenges due to length, the subjective nature of relevance, and the need to verify consistency. (Eugene has written an article for those interested in this topic.) Moreover, evaluation extends beyond merely assessing the model; it encompasses the entire system. In a RAG pipeline, for example, it's crucial to determine whether issues stem from the generation or retrieval phase (see my previous deep dive on RAG). Companies like Quotient AI, whose founders led evaluations for Github Copilot, have developed sophisticated RAG evaluation systems to tackle these intricacies.

“RAG significantly increases the complexity of AI products. Developers have more levers they can pull in order to enhance performance, such as modifying embeddings and trying different chunking strategies, or tweaking the prompts on the generation side. Systematically updating and measuring the impact of these changes is crucial for building high-quality AI products,,” says Julia Neagu, CEO and Co-founder of Quotient. “On top of that, the quality of these AI systems relies heavily on the retrieved documents, making it essential to have customizable evaluation frameworks for each specific use case. We cannot use off-the-shelf benchmarks to understand how well an AI agent can respond to questions about a company's documentation, for example. Quotient's tools solve that challenge by creating evaluation benchmarks tailored to our users' specific use cases and data.”

Evaluation extends beyond single-step language generation. As multi-step agents gain traction, the challenge intensifies, and the problem with the long-tail cases becomes exponentially hader. In these scenarios, assessing each workflow step is crucial to identify failure points and prevent error propagation. The complexity of agent evaluation is exemplified by frameworks like AgentBench. This comprehensive benchmark illustrates the unique aspects of evaluating autonomous agents, including their ability to follow instructions, make decisions, plan long-term actions, use tools effectively, and self-correct. While the evaluation of agentic systems warrants a separate, in-depth discussion, this overview highlights its complexity. (Watch for my colleague Holley's upcoming deep dive on this topic.)

Data and strategic role

Evaluations have evolved beyond their traditional role, now serving as crucial strategic tools. Rather than merely highlighting system failures, they now form the foundation for system enhancement. Identifying failure points provides the base for where improvements are needed. As highlighted in the introduction, any robust AI system must continuously iterate, driven by relentless experimentation and precise measurement.

A group of top-notch practitioners recently got together and decided to write an article sharing what they've learned from a year of building with LLMs. During this collaboration, Hamel contributed a really useful diagram that shows just how important evaluations are in any good AI system ( recommend listening to their Keynote at AI engineer).

Given this strategic role of evaluation, it's important to involve areas beyond data science, especially product managers. For example, companies like Freeplay have built a platform with this cross-department collaboration in mind, giving product teams tools to experiment, test, monitor and optimize AI features.

From Ian Cairns, co-founder of Freeplay: “We’ve found that in practice, product development teams build up an eval suite over time, and it’s often a multi-disciplinary process. There’s a feedback loop on evals themselves, where teams start with vibe checks, and then realize what the important underlying criteria are that they’re assessing with those ‘vibes.’ Giving the whole team including PMs, QA and other domain experts an easy way to look at data, label it with existing criteria, and suggest new evals gets that flywheel going.” The Freeplay blog goes deeper into building practical eval suites.

As investors, when speaking with founders, we often hear that capturing user feedback and data drives system improvements, providing competitive advantages that will be hard for competitors to replicate over time. However, when I ask about their evaluation systems, many founders lack concrete ideas or haven't deeply considered evaluations. When this happens, it's a big red flag for me. Saying that your differentiation will come from feedback loops, and not having evaluations as a centerpiece is like shooting in the dark. As the diagram illustrates, evaluations are the core element of improvement. As Yi Zhang founder of Relari shared

"While often viewed as a risk management tool, evals are equally crucial for driving AI performance. This shift in perspective transforms evals from a safeguard into a strategic asset for revenue growth."

User feedback plays a crucial role in the evaluation process. Implementing easy feedback mechanisms can significantly enhance your evaluation system. However, while traditional methods like thumbs-up and thumbs-down buttons are a good starting point, they're often too simplistic to capture the nuances of user satisfaction or dissatisfaction.

Given evaluations role as a central pillar of the iterative experimentation systems. I expect the line between fine-tuning startups and evaluation startups to blur, as both begin offering overlapping services. Initial partnerships might emerge, but I expect both will start entering other fields, leading to potential consolidation long term.

From an investor's perspective, distinguishing between different players in the evaluation space can be challenging, as their core elements are often quite similar. Speed of execution and go-to-market strategy will be key factors in determining which companies emerge as leaders. One possibility is that evaluations could serve as an effective entry point/wedge for adoption and be offered as a free solution. For example, companies like Ragas have decided to open-source their evaluation framework.

Final thoughts

We have shown the depth involved in good Gen AI system evaluations; however, founders shouldn't get paralyzed by this. They should start with a basic approach and refine it over time. Initially, rely on manual assessments by human experts before integrating automated model evaluators to increase scale. The key is to begin thinking about evaluation strategies early in the development process, without aiming for immediate perfection and seeing it as an iterative process.

One important thing to note is that, while there will be common guidelines for good evaluation systems, there will be significant variation and personalization between companies, and there won’t be a one-size-fits-all model, system or metrics. I expect companies will run multiple evaluations in parallel mixing human and smaller expert model evaluators.

I am especially excited about players pushing the research boundaries to find a mathematical framework to measure the model's general capabilities and solve the evaluation challenge from the root, including conducting evaluations at the token generation level.

As always, huge credit goes to my colleagues Chip Hazard and Jeff Bussgang for their feedback. I also want to express my massive thanks to Julia Negu, Ian Cairnis, Zhao Chen, and Yi Zhang, for their insights and quotes. At Flybridge, we are excited to back visionary founders and companies at the forefront of this AI revolution. If you are a founder building in this space or an operator who wants to exchange views, send me a message at daniel@flybridge.com. You can learn more about our AI thesis and history here.

AIevaluationsartificial intelligence

Daniel Porras Reyes