![[Blog] AI Eval](https://www.testmo.com/wp-content/uploads/2025/11/Blog-AI-Eval.jpg)
As we’ve discussed in previous articles (Essential AI Testing Practices & How to Test AI) AI systems don’t behave like traditional code. The same input can yield different outputs depending on model version, temperature, or even subtle prompt wording.
For QA, that breaks the usual “expected vs actual” pattern. You can’t just assert equality — you need to measure quality.
That’s where structured AI evaluations (“evals”) come in. An AI eval is, in its simplest form, a structured test case for an AI output: scored against repeatable criteria, versioned, and traceable over time.
This guide provides an overview of the basic principles of using evals and practical examples of how to use custom eval templates in a test management tool like Testmo to plan how you will test new AI features in your product, collaborate on testing and record results, and build traceability throughout development of AI functionality.
Evals give QA teams a repeatable way to assess AI outputs across defined dimensions like accuracy, completeness, readability, or factual precision — the same principles we’ve always used in QA, just applied to a fuzzier target.
Evals act as your compass, guiding iterative improvements, validating design choices, and ensuring — and documenting — that your AI aligns with its intended purpose.
By managing evals in Testmo — as structured cases, results, and runs — you gain the consistency and visibility you need to treat AI testing like any other serious QA process.
An eval is just a human judgment, captured systematically. You might ask:
Each question gets a score. Those scores turn subjective impressions into measurable, comparable data — something QA teams are built for.
In Testmo, you can model each eval as a test case, execute it, and collect those results as part of a formal run.
Say your team built an AI tool that converts user stories into Gherkin scenarios. You want to know: how good are these test cases, really?
You decide to score them across a few key dimensions:
| Dimension | Definition |
|---|---|
| Accuracy | Does the generated case match the intent of the requirement? |
| Completeness | Are all acceptance criteria covered? |
| Readability | Is the case clear and easy for others to understand? |
| Reusability | Could it be reused across projects or test suites? |
You could track this in Excel, sure. But in Testmo, you can make it repeatable and auditable — with every eval run with a variety of inputs and across versions of your AI model stored as a versioned, traceable record.
This same approach works for summarization, classification, and recommendation models. Once you define your scoring dimensions, the rest is process.
To accomplish this in Testmo, we need to create a custom test case template, and then use it in the AI system evals. Here’s how to set that up, and then use the finished template to run Evals in the context of a project.
Before you build the template, set up the fields you’ll need.
Go to Testmo Administration → Fields → Add Case Field and create the following:
| Field | Type | Example Entries |
|---|---|---|
| Evaluation Criteria | Multi-select | Accuracy, Completeness, Readability, Reusability, Style Adherence |
| Input Type | Dropdown | User Story, Gherkin Scenario, Acceptance Criteria List |
| Expected Output | Text | What the AI should generate (if known in advance) |
| AI Output (Captured) | Text | The model’s actual output |
These fields define the data each Eval case will capture.

Now configure how results are recorded whenever an Eval is executed.
| Field | Type | Example |
|---|---|---|
| Accuracy Score | Integer | 0–10 |
| Completeness Score | Integer | 0–10 |
| Readability Score | Integer | 0–10 |
| Reusability Score | Integer | 0–10 |
| Evaluator Notes | Text | Context and rationale |
#Tip: Use consistent numeric scales across all Evals for easier reporting.

You’ll need to capture an overall verdict for your Eval. Testmo allows you to create Custom Statuses, so let’s use those.
Go to Testmo Administration → Statuses and edit an existing unused status, configuring the following:
| Status Name | Success or Failure |
|---|---|
| Pass — Meets all acceptance criteria | Considered a success |
| Minor Issues — Acceptable with minor edits | Considered a success (or a failure, depending on your own project thresholds) |
| Needs Review — Significant issues or omissions | Considered a failure |
| Fail — Unusable or incorrect output | Considered a failure |
#Tip: Make sure the custom statuses are available in the project(s) in which your Evals will be executed on the Edit status → Projects tab.

Once all your fields exist, go to Project Settings → Templates → Add Template.
Name it AI Evaluation, then attach the custom fields you just created.
#Tip: Make sure your custom template is available to the project(s) you want to use it in, by selecting it on the Edit template → Projects tab.

Configurations capture the evaluation context — model version, dataset, or prompt setup. Add them to Testmo via the Administration → Configurations tab as needed, so you can utilise them in the finished Eval specifications and executions.
Examples below:
| Group | Platforms |
|---|---|
| Model Version | GPT-4-Turbo, Llama 3, Claude 3 |
| Prompt Template | “Strict” / “Conversational” |
| Temperature | 0.2 (Deterministic), 0.7 (Creative) |
| Dataset | Internal QA stories, Anonymized customer feedback |
This lets you compare results like “accuracy improved 15% from v1.1 to v1.2.”
#Note: Testmo configurations are combinations of Groups and Platforms. A configuration from the table above may therefore look like: Llama 3, Strict, 0.7 (Creative), Internal QA Stories.
#Tip: Make sure your configuration is available to the project(s) you want to use it in, by selecting it on the Edit configuration → Projects tab.

Once you’ve completed your Eval template, fields & configurations setup, you’re ready to create Evals in Testmo. Create Evals in your Testmo project Repository, using the template you just configured.
#Tip: Use tags to filter and report on Evals more easily:


You’ll now have a timestamped, auditable record of every AI evaluation — no spreadsheet cleanup required.
Most AI evaluations are manual, but Testmo ties neatly into your automated runs too.
For example:
Together, they create a feedback loop from AI generation → execution → release confidence.
It’s all linked in one system, so you can trace issues end-to-end without switching tools.
Once you’ve logged a few Evals, the patterns start to show.

Use filters and dashboards to:
| Common Pitfall | How to Avoid It |
|---|---|
| Inconsistent scoring | Define a clear rubric (e.g., what does “5/10 accuracy” mean?). |
| No baseline data | Store “gold standard” outputs in Expected Output. |
| Unclear evaluator notes | Encourage rationale for any low score. |
| Version drift | Use Configurations to separate model versions. |
| Unlinked evaluations | Add requirement or feature IDs for traceability. |
With AI evaluations structured in Testmo, you get:
Every Eval becomes part of your testing story — not an isolated experiment in someone’s drive.
Start small.
Create one AI Evaluation template. Evaluate a handful of model outputs. Review the scores as a team.
Then refine — adjust fields, build dashboards, connect automation, and make it part of your regular QA process.
If you’re ready to go deeper: