Learn More About Data and AI
Explore your next career challenge and learn more about the Data and AI team!
Learn More
We know that blindly trusting generative AI can be risky, and in some contexts, even dangerous. As we explored in our last Byte, a simple “gut check” can reduce this risk by revealing what an AI tool can – and can’t – do. But this isn’t always enough. Often, we need to move beyond this type of vague impression and measure something concrete.
While the technology behind large language models is complex, the way we evaluate them doesn’t have to be. In fact, it follows the same logic as a classroom exam: you can’t grade performance without an answer key or rubric. To evaluate AI quantitatively, you need three specific components: a Task, a Dataset, and a Metric.
Together, these allow you to compare performance objectively – between models, prompting strategies, etc. – much like comparing students based on their exam scores.
1. The Task: Define the Goal
Start by being clear about what you want the system to do. Think of this like deciding what kind of exam you’re giving, what specific skill you’re looking to gauge. “Reading emails” is not a measurable task. “Extracting a customer’s name and order number from a support email” is. Being precise here makes meaningful evaluation possible.
2. The Dataset: Create Your Answer Key
If the task defines the exam, the dataset defines the questions – and the answer key, which we call the ground truth or gold standard. You’ll need a set of example inputs (for instance, historical emails) paired with the correct output you expect the system to produce.
Make sure your evaluation data is different from what might have been used to train the model or build the tool you’re evaluating. If an exam uses the same questions as the study guide, students may perform well by memorizing answers without truly learning how to apply the material. AI can fall into the same trap, performing well on familiar examples but failing to generalize to new ones.
3. The Metric: Decide How to Score It
Finally, compare the system’s output to your answer key using a clear rule – one that assigns a score, like a grade. In our email example, you might use the percentage of exactly matching items.
Think of the metric as the grading rubric. Depending on the task, you might not be looking for one specific answer, and may need to think of other ways to measure correctness.
The Bottom Line
You don’t need to be a data scientist to evaluate AI effectively. Start by clearly defining your task, then look for existing data you can use as an answer key. If you can define these, you already have the foundation for a rigorous evaluation that shifts your organization from hoping the technology works to proving that it does.
Elena Quartararo serves as a Data Scientist for Data and AI at MANTECH. Contact her via AI@MANTECH.com.
Welcome to Data and AI Bytes – a series of short, snackable blog posts by experts from MANTECH’s Data and AI Practice. These posts aim to educate readers about current topics in the fast-moving field of AI.
Brian Vickers serves as Principal Data Scientist for Data and AI at MANTECH. Contact him via AI@MANTECH.com.
Explore your next career challenge and learn more about the Data and AI team!
Learn More