When "Close Enough"
is the Right Answer

By Ellina Shek - Data Engineer, Data and AI Practice

When “Close Enough” is the Right Answer

In our last Byte, we introduced the idea of the “Answer Key” — evaluating AI by comparing its output to a known, correct result. But there is a trap hidden in this approach: the risk of failing a model that is actually quite useful because it is “right” in a way you didn’t expect.

The Nuance: Meaning vs. Form

Literally, “9” and “nine” are different answers, although for almost any purpose they have the same meaning — for example as the answer to the question “How many airports are there in Albuquerque?” If your evaluation Metric looks for an “Exact Match,” a model that outputs “nine” when the answer key says “9” fails the test. While this rigidity makes sense for much of traditional IT, it does not necessarily make sense when evaluating the output of Generative AI.

For many tasks for which people commonly use Generative AI, such as getting information, assisting in coding, translating from one language to another, or summarizing documents,, there are many “correct” responses to a prompt. These are often responses that convey the same meaning but differ in the form used to express it.

As an example, consider asking a model to translate the French expression “en faire tout un fromage” into English. Any English phrase that conveys the meaning of “To make a big deal out of nothing” — “to overreact,” “to make a mountain out of a molehill,” “to exaggerate the response,” etc. — should be counted as correct.

For these kinds of tasks, we must shift from “Strict Matching” to “Flexible Semantic Evaluation.”

Semantics refers to meaning, so a semantic evaluation reflects, to the extent feasible, whether an output captures the intended meaning, context, and underlying message, even when the phrasing or word order differs.

Real-World Example: Execution Accuracy

We recently employed one example of semantic matching in evaluating the effectiveness of AI models on a “Text to SQL” task to generate SQL database queries that correspond to English information requests. A tool built around such a model use can let non-technical users query databases using plain English. To evaluate approaches to the task, we took the “meaning” of an SQL statement to be the data it retrieved when run against a target database.

We didn’t judge the AI on whether it wrote the SQL code exactly how a senior data engineer would. Why?Just as many English statements can have the same meaning, many SQL statements can deliver the same results.

Instead, we used Execution Accuracy: we ran the AI’s generated code against the database to see if it actually retrieved the correct data.

The Input: “Show me sales from last November.”
The Test: Did the system return the correct sales figures?
The Result: It didn’t matter if the AI used a different SQL statement than our answer key statement. If the data was right, the model passed.

In this instance, the meaning was indicated by something that can be compared to a nearly exact match (although for many queries, the results may come in different orders). For other tasks, such as summarizing documents, the meaning is indicated by subtler and varied measures that attempt to measure the utility of the response, which is a topic for another time.

The Bottom Line

Don’t let rigid metrics kill good innovation. Perfection is often the enemy of progress.

When evaluating AI, you must move beyond the “Exact Match” mindset. Instead of asking, “Did it word it exactly how I expected?,” ask the question that actually matters to your mission: “Did it solve the problem?”

About Data and AI Bytes

Welcome to Data and AI Bytes – a series of short, snackable blog posts by experts from MANTECH’s Data and AI Practice. These posts aim to educate readers about current topics in the fast-moving field of AI.

Elinna Shek serves as a Data Engineer within MANTECH’s Data and AI practice. Contact her via AI@MANTECH.com.

Expertise

Who We Are

What We Do

Careers

News And Media

When "Close Enough"
is the Right Answer

When “Close Enough” is the Right Answer

The Nuance: Meaning vs. Form

Real-World Example: Execution Accuracy

The Bottom Line

About Data and AI Bytes

Learn More About Data and AI

View More Blogs

Expertise

Who We Are

What We Do

Careers

News And Media

When "Close Enough" is the Right Answer

When “Close Enough” is the Right Answer

The Nuance: Meaning vs. Form

Real-World Example: Execution Accuracy

The Bottom Line

About Data and AI Bytes

Learn More About Data and AI

View More Blogs

When "Close Enough"
is the Right Answer