Learn More About Data and AI
Explore your next career challenge and learn more about the Data and AI team!
Learn More
In our last Byte, we introduced the idea of the “Answer Key” — evaluating AI by comparing its output to a known, correct result. But there is a trap hidden in this approach: the risk of failing a model that is actually quite useful because it is “right” in a way you didn’t expect.
Literally, “9” and “nine” are different answers, although for almost any purpose they have the same meaning — for example as the answer to the question “How many airports are there in Albuquerque?” If your evaluation Metric looks for an “Exact Match,” a model that outputs “nine” when the answer key says “9” fails the test. While this rigidity makes sense for much of traditional IT, it does not necessarily make sense when evaluating the output of Generative AI.
For many tasks for which people commonly use Generative AI, such as getting information, assisting in coding, translating from one language to another, or summarizing documents,, there are many “correct” responses to a prompt. These are often responses that convey the same meaning but differ in the form used to express it.
As an example, consider asking a model to translate the French expression “en faire tout un fromage” into English. Any English phrase that conveys the meaning of “To make a big deal out of nothing” — “to overreact,” “to make a mountain out of a molehill,” “to exaggerate the response,” etc. — should be counted as correct.
For these kinds of tasks, we must shift from “Strict Matching” to “Flexible Semantic Evaluation.”
Semantics refers to meaning, so a semantic evaluation reflects, to the extent feasible, whether an output captures the intended meaning, context, and underlying message, even when the phrasing or word order differs.
We recently employed one example of semantic matching in evaluating the effectiveness of AI models on a “Text to SQL” task to generate SQL database queries that correspond to English information requests. A tool built around such a model use can let non-technical users query databases using plain English. To evaluate approaches to the task, we took the “meaning” of an SQL statement to be the data it retrieved when run against a target database.
We didn’t judge the AI on whether it wrote the SQL code exactly how a senior data engineer would. Why?Just as many English statements can have the same meaning, many SQL statements can deliver the same results.
Instead, we used Execution Accuracy: we ran the AI’s generated code against the database to see if it actually retrieved the correct data.
In this instance, the meaning was indicated by something that can be compared to a nearly exact match (although for many queries, the results may come in different orders). For other tasks, such as summarizing documents, the meaning is indicated by subtler and varied measures that attempt to measure the utility of the response, which is a topic for another time.
Don’t let rigid metrics kill good innovation. Perfection is often the enemy of progress.
When evaluating AI, you must move beyond the “Exact Match” mindset. Instead of asking, “Did it word it exactly how I expected?,” ask the question that actually matters to your mission: “Did it solve the problem?”
Welcome to Data and AI Bytes – a series of short, snackable blog posts by experts from MANTECH’s Data and AI Practice. These posts aim to educate readers about current topics in the fast-moving field of AI.
Elinna Shek serves as a Data Engineer within MANTECH’s Data and AI practice. Contact her via AI@MANTECH.com.
Explore your next career challenge and learn more about the Data and AI team!
Learn More