Evaluating generative AI systems is challenging because they produce open-ended outputs, and it's hard to define a single correct answer. The traditional NLP approaches, such as tweaking evaluation test sets to have multiple choice questions, can be helpful but have limitations.
Benchmarks are a popular approach for evaluating generative AI systems, but they have their own limitations. Some benchmarks are available, such as the MLLU and SuperGlue benchmarks, but they may not be applicable to all use cases.
Human-based approaches, such as manual labeling, can be used to evaluate generative AI systems, but they can be time-consuming and expensive. It's important to have well-defined criteria for labelers to ensure consistency.
You even seen red teams as a service, but also companies that are creating agents or a SAS platform that is basically agents as red teams and so it's not humans that have been trained to break models it is models that have been trained to break models.
So that was actually going to be my third bucket right which is model-based evaluations essentially using like llm as a judge um kind of approach
So like one of the like best practices like I've seen uh like different people use uh in the industry is essentially you kind of have like a golden test set which is again very trustworthy
Like multiple decades so I do see like reuse of some of that work within like retrieval systems as well...
Rag systems like rag which again there's so many bells and whistles that can pull uh like each of those things kind of requires its own set of evals right like okay what kind of chunking strategy do you want to use...
It is I think it's one of those things which I feel like people don't think of that as like oh this is like something cool to do but it is extremely essential to ensure like a good experience for customers eventually in production right so with software...
Like you just have to be very like surgical about like where to where you really want to deploy MLA AI modes um me there's like a lot of like background work I feel that goes into that where you would be talking essentially with like all the product people with design with like Engineers like basically whole cross functional team...
With agents again depending on like how many steps are there in your agentic flow again I think the same concept applies where you need to evaluate every step output and then you again need to evaluate the whole end to and output also depending on how many tools your agent is calling or has access to the evaluation complexity like increases quite a bit because with tool calling again we're essentially looking at two things right at the end of the day like is the llm picking the right tool for the task and then B are the right parameters being passed to the to the tool that's picked so there's two factors that you're evaluating for each tool...