Testmo logo
Testing Tools

How to Test AI Features Using AI Evaluation Templates in Testmo

By Simon Knight
.
Nov 10, 2025
.
7 min read
[Blog] AI Eval

As we’ve discussed in previous articles (Essential AI Testing Practices & How to Test AI) AI systems don’t behave like traditional code. The same input can yield different outputs depending on model version, temperature, or even subtle prompt wording.

For QA, that breaks the usual “expected vs actual” pattern. You can’t just assert equality — you need to measure quality.

That’s where structured AI evaluations (“evals”) come in. An AI eval is, in its simplest form, a structured test case for an AI output: scored against repeatable criteria, versioned, and traceable over time.

This guide provides an overview of the basic principles of using evals and practical examples of how to use custom eval templates in a test management tool like Testmo to plan how you will test new AI features in your product, collaborate on testing and record results, and build traceability throughout development of AI functionality.

Why Structured AI Evaluations Matter for QA

Evals give QA teams a repeatable way to assess AI outputs across defined dimensions like accuracy, completeness, readability, or factual precision — the same principles we’ve always used in QA, just applied to a fuzzier target.

Evals act as your compass, guiding iterative improvements, validating design choices, and ensuring — and documenting — that your AI aligns with its intended purpose.

By managing evals in Testmo — as structured cases, results, and runs — you gain the consistency and visibility you need to treat AI testing like any other serious QA process.

What an AI Eval Actually Looks Like

An eval is just a human judgment, captured systematically. You might ask:

  • Does this output align with the intent of the input?
  • Is it complete?
  • Is it phrased clearly?
  • Would it make sense to someone new to the project?

Each question gets a score. Those scores turn subjective impressions into measurable, comparable data — something QA teams are built for.

In Testmo, you can model each eval as a test case, execute it, and collect those results as part of a formal run.

Example: Evaluating an AI Test Case Generator

Say your team built an AI tool that converts user stories into Gherkin scenarios. You want to know: how good are these test cases, really?

You decide to score them across a few key dimensions:

DimensionDefinition
AccuracyDoes the generated case match the intent of the requirement?
CompletenessAre all acceptance criteria covered?
ReadabilityIs the case clear and easy for others to understand?
ReusabilityCould it be reused across projects or test suites?

You could track this in Excel, sure. But in Testmo, you can make it repeatable and auditable — with every eval run with a variety of inputs and across versions of your AI model stored as a versioned, traceable record.

This same approach works for summarization, classification, and recommendation models. Once you define your scoring dimensions, the rest is process.

Building AI Evaluation Templates in Testmo

To accomplish this in Testmo, we need to create a custom test case template, and then use it in the AI system evals. Here’s how to set that up, and then use the finished template to run Evals in the context of a project.

Step 1: Create Custom Fields for Your Evaluation Template

Before you build the template, set up the fields you’ll need.

Go to Testmo Administration → Fields → Add Case Field and create the following:

FieldTypeExample Entries
Evaluation CriteriaMulti-selectAccuracy, Completeness, Readability, Reusability, Style Adherence
Input TypeDropdownUser Story, Gherkin Scenario, Acceptance Criteria List
Expected OutputTextWhat the AI should generate (if known in advance)
AI Output (Captured)TextThe model’s actual output

These fields define the data each Eval case will capture.

How to Test AI Features Using AI Evaluation Templates in Testmo - Fields Admin Testmo 2025 10 24 14 17 17 1
Adding the Input Type Case Field in Testmo

Step 2: Add Custom Result Fields for Scoring

Now configure how results are recorded whenever an Eval is executed.

FieldTypeExample
Accuracy ScoreInteger0–10
Completeness ScoreInteger0–10
Readability ScoreInteger0–10
Reusability ScoreInteger0–10
Evaluator NotesTextContext and rationale

#Tip: Use consistent numeric scales across all Evals for easier reporting.

How to Test AI Features Using AI Evaluation Templates in Testmo - Result Fields Admin Testmo 2025 10 24 14 18 33 6

Step 3: Create Custom Statuses

You’ll need to capture an overall verdict for your Eval. Testmo allows you to create Custom Statuses, so let’s use those.

Go to Testmo Administration → Statuses and edit an existing unused status, configuring the following:

Status NameSuccess or Failure
Pass — Meets all acceptance criteriaConsidered a success
Minor Issues — Acceptable with minor editsConsidered a success (or a failure, depending on your own project thresholds)
Needs Review — Significant issues or omissionsConsidered a failure
Fail — Unusable or incorrect outputConsidered a failure

#Tip: Make sure the custom statuses are available in the project(s) in which your Evals will be executed on the Edit status → Projects tab.

How to Test AI Features Using AI Evaluation Templates in Testmo - Statuses Admin Testmo 2025 10 24 14 20 03
Configuring the Result field in Testmo

Step 4: Create the AI Evaluation Template

Once all your fields exist, go to Project Settings → Templates → Add Template.

Name it AI Evaluation, then attach the custom fields you just created.

#Tip: Make sure your custom template is available to the project(s) you want to use it in, by selecting it on the Edit template → Projects tab.

How to Test AI Features Using AI Evaluation Templates in Testmo - Templates Admin Testmo 2025 10 24 14 22 09
Adding the Custom Template in Testmo

Step 5: Add Configurations or Platforms

Configurations capture the evaluation context — model version, dataset, or prompt setup. Add them to Testmo via the Administration → Configurations tab as needed, so you can utilise them in the finished Eval specifications and executions.

Examples below:

GroupPlatforms
Model VersionGPT-4-Turbo, Llama 3, Claude 3
Prompt Template“Strict” / “Conversational”
Temperature0.2 (Deterministic), 0.7 (Creative)
DatasetInternal QA stories, Anonymized customer feedback

This lets you compare results like “accuracy improved 15% from v1.1 to v1.2.”

#Note: Testmo configurations are combinations of Groups and Platforms. A configuration from the table above may therefore look like: Llama 3, Strict, 0.7 (Creative), Internal QA Stories.

#Tip: Make sure your configuration is available to the project(s) you want to use it in, by selecting it on the Edit configuration → Projects tab.

How to Test AI Features Using AI Evaluation Templates in Testmo - Configurations Admin Testmo 2025 10 24 14 23 19
Configurations are comprised of groups & platforms

Step 6: Organize with Tags

Once you’ve completed your Eval template, fields & configurations setup, you’re ready to create Evals in Testmo. Create Evals in your Testmo project Repository, using the template you just configured.

#Tip: Use tags to filter and report on Evals more easily:

  • Feature area (#PromptEval, #Regression)
  • Evaluation type (#LLM, #Generator)
  • Dataset or customer segment

How to Test AI Features Using AI Evaluation Templates in Testmo - Repository AI Features Testmo 2025 10 24 14 23 49
Eval test cases in Testmo

Step 7: Run and Record Evaluations

  1. Create a Manual Run using your AI Evaluation template.
  2. Assign cases (AI outputs) to evaluators.
  3. Score each criterion and record notes.
  4. Mark the result complete.
How to Test AI Features Using AI Evaluation Templates in Testmo - Evals Run 1 AI Features Testmo 2025 10 24 14 24 15
A completed Eval with results

You’ll now have a timestamped, auditable record of every AI evaluation — no spreadsheet cleanup required.

Connecting Automation and Traceability

Most AI evaluations are manual, but Testmo ties neatly into your automated runs too.

For example:

  • A BDD suite checks that generated test cases actually execute.
  • Your Eval checks whether those test cases were any good in the first place.

Together, they create a feedback loop from AI generation → execution → release confidence.

It’s all linked in one system, so you can trace issues end-to-end without switching tools.

Analysing Results and Iterating

Once you’ve logged a few Evals, the patterns start to show.

How to Test AI Features Using AI Evaluation Templates in Testmo - Evals Run 1 AI Features Testmo 2025 10 24 14 24 48 1
An evals test run in Testmo

Use filters and dashboards to:

  • Compare scores across models or datasets
  • Track quality trends over time
  • Spot inconsistencies between evaluators

Best Practices for Reliable AI Evaluations

Common PitfallHow to Avoid It
Inconsistent scoringDefine a clear rubric (e.g., what does “5/10 accuracy” mean?).
No baseline dataStore “gold standard” outputs in Expected Output.
Unclear evaluator notesEncourage rationale for any low score.
Version driftUse Configurations to separate model versions.
Unlinked evaluationsAdd requirement or feature IDs for traceability.

Putting It All Together

With AI evaluations structured in Testmo, you get:

  • Consistent, reusable criteria
  • Versioned, auditable results
  • Collaborative workflows between QA and engineering
  • A clean replacement for brittle spreadsheets

Every Eval becomes part of your testing story — not an isolated experiment in someone’s drive.

Where to Go Next

Start small.

Create one AI Evaluation template. Evaluate a handful of model outputs. Review the scores as a team.

Then refine — adjust fields, build dashboards, connect automation, and make it part of your regular QA process.

If you’re ready to go deeper:

More from Testmo

Get news about Testmo & software testing

Also receive our free original testing & QA content directly in your inbox whenever we publish new guides and articles.
We will email you occasionally when news about Testmo and testing content is published. You can unsubscribe at any time with a single click. Learn more in our privacy policy.