
This article is the second in a series of AI related articles we’ll be publishing as we go through our journey of implementing AI in Testmo, sharing our learnings as we go. If you want to keep up with our journey, be sure to subscribe to our newsletter. Additionally, you can respond to our survey here: AI Test Case Generation in Testmo
So you’ve read about the ten essential practices for testing AI systems. You’ve nodded along to TRiSM frameworks, red teaming methodologies, and continuous monitoring approaches. Maybe you’ve even bookmarked some of those tools – SHAP, LIME, WhyLabs, the whole arsenal.
But now you’re staring at your feature testing queue, watching AI features creep into the product like Borg nanoprobes, and thinking: “Where do I actually start?”
In this article, you’ll learn a few best practices when building AI testing in your QA process and four steps you can take to start testing new AI features embedded in your application or product:
Let’s be honest about something: a lot advice about how to test AI falls into two camps. Either it’s so high-level it’s useless (“you should care about fairness!”) or so technical it assumes you have a PhD in machine learning (“implement cosine similarity metrics for your semantic evaluation pipeline“!)
Neither helps when you’re a working QA professional trying to figure out how to test the new AI chatbot feature that goes live next sprint.
The ten practices we outlined in our previous article represent our best assessment of current, genuine state-of-the-art AI testing approaches. They align with regulatory expectations, reflect what companies like OpenAI and Anthropic actually do, and go well beyond traditional software testing paradigms. But “state-of-the-art” doesn’t mean “ready for product delivery teams without significant adaptation.”
Here’s an uncomfortable truth: implementing TRiSM frameworks or comprehensive red teaming requires expertise, tooling, and organisational commitment that most delivery teams don’t have yet. That doesn’t make these practices wrong. It just means they’re likely aspirational for those of us working in the trenches on day-to-day product delivery.
The gap between “what we should do” and “what we can do with current resources” is where many teams get stuck. They read about advanced AI testing methodologies and either give up (too complex) or jump in without a proper foundation (chaos ensues).
Here’s a better approach: start where you are, with what you have, and build toward those advanced practices incrementally.
Before we talk about practical implementation, let’s acknowledge why AI testing melts the traditional QA brain. Understanding AI weirdness isn’t academic – it’s essential for knowing what battles to fight and which ones to postpone.
Traditional testing assumes same input equals same output. AI systems mock that assumption. Temperature settings, context windows, model versioning, and data drift can all influence responses in ways that would never occur in traditional software. Run the same prompt through an LLM twice and you might get subtly different answers – not because of a bug, but because that’s how these systems are designed to work. Try writing a test assertion for probabilistic outputs that are correct when they vary and broken when they don’t.
When a REST API fails, you get a 500 error. When an AI fails, it keeps smiling while confidently telling you that Paris is the capital of Italy. Unless you’ve built specific detection mechanisms, these failures won’t show up in your logs, monitoring dashboards, or error tracking systems.
Your beautiful test suite might pass today, degrade silently next month, and collapse entirely in six months – not because anyone deployed broken code, but because the world changed around your model. User behaviour shifted, data distributions drifted, or edge cases you never tested started appearing in production.
In traditional systems, you validate data as it flows through. The purpose of the system is either to display or manipulate that data in predictable way, which can be validated given a set of known inputs. In AI systems, the training data is the system. If your data skews Western, corporate, or male-coded, that bias becomes baked into every prediction. No amount of functional testing will catch that.
This isn’t just a learning curve. It’s a paradigm shift that mandates rethinking some core assumptions about what testing means.
The path from “we should implement comprehensive TRiSM” to “here’s what we’re actually going to test this sprint” requires translation. You need to take those advanced practices and find their practical entry points – the places where you can start building capability without needing a machine learning PhD or a six-figure tool budget.
TRiSM (Trust, Risk, and Security Management) sounds intimidating until you realise it’s just systematic risk-based testing applied to AI. You don’t need specialised frameworks to begin. You need to ask better questions:
These aren’t philosophical questions. They’re AI product design constraints that should shape your testing strategy. Start by cataloging the specific risks your AI features introduce, then work backwards to testing approaches that can surface those risks early.
Red teaming sounds like something that requires elite hacker skills and specialised tools. In reality, it starts with curiosity and systematic exploration. Begin with basic adversarial thinking:
You don’t need specialised security analysis or testing tools. Start with session based exploration, document what you find, and build systematic test cases around the vulnerabilities that matter most to your specific context.
Data privacy in AI isn’t just about GDPR checkboxes and legal Ts & Cs. It’s about understanding how your specific system might ingest or leak information in non-obvious ways. Instead of trying to implement comprehensive privacy testing frameworks, start with targeted exploration:
Build these questions into your exploratory testing cycles. Privacy testing doesn’t require specialized tools – it requires systematic thinking about the potential for information leakage.
You don’t need SHAP or LIME implementations to begin working on explainability. Start by validating that your AI systems can provide reasoning for their decisions when asked. Test the quality and consistency of those explanations:
This human-centered approach to explainability testing is often more valuable than sophisticated interpretability metrics, especially in customer-facing applications.
Here’s a four-step model for how you can begin applying AI testing practices from the ground up.
Before you can test AI features effectively, you need to understand what you’re dealing with. This isn’t about model architecture – it’s about system behaviour and integration points.
This mapping exercise may take a few days, but it provides the foundation for the rest of your testing strategy.
Don’t lose sleep over trying to get a complete understanding though. Remember: “All models are wrong, some are useful.” You just need something useful enough to start driving the right questions from.
Start with simple, systematic test cases that establish baseline behaviour for your AI features. Think of this as your “smoke test” equivalent for AI systems.
These basic test cases won’t catch sophisticated failures (they’re the AI equivalent of smoke tests), but they establish a foundation and help you understand normal system behaviour before you start looking for abnormal patterns.
Once again, you don’t need specialist tools to start monitoring for model drift. Begin with simple metrics you can track over time:
Start with manual sampling and tracking. Once you understand what signals matter for your specific system, you can invest in automated monitoring tools.
AI testing can’t happen in isolation. As a testing leader you should make efforts to establish relationships and processes that will make more advanced testing practices possible:
These relationships are what transform basic AI testing into the comprehensive TRiSM approaches that represent true state-of-the-art practice.
Once you’ve established these foundational practices, you can begin implementing more sophisticated approaches systematically:
The key is building capability incrementally rather than trying to implement comprehensive frameworks all at once.
Ready for some good news? Experienced testers already have the core skills needed for AI quality assurance. You don’t need to become a machine learning expert. You need to apply your existing testing mindset to a new class of problems.
The state-of-the-art practices we outlined in our previous article aren’t foreign concepts requiring complete retraining. They’re an evolution of testing approaches you already know, adapted for AI systems that learn and change over time.
Let’s be realistic about what you can achieve in the short term. You’re probably not going to implement comprehensive TRiSM frameworks in a month, or during the course of a single release. You’re not going to become a red teaming expert overnight. You’re not going to solve AI bias with exploratory testing.
What you can do immediately is to start building the testing capability that will matter as AI becomes more central to your product. You can establish the foundational practices that make advanced AI testing techniques possible. You can become the person on your team who understands AI testing well enough to ask the right questions and design meaningful test strategies.
Most importantly, you can start protecting your users from the kinds of AI failures that aren’t obvious until they’re already causing problems. That’s not state-of-the-art; it’s just good testing. But in the brave new AI world, good testing might be the most advanced practice of all.
Winning at AI testing isn’t about implementing every advanced technique. It’s about systematically reducing the risk that your AI features will surprise users, violate their trust, or cause unintended harm. Start there. Everything else builds on that foundation.
The future of quality assurance is being written right now, in products shipping AI features without adequate testing, in teams learning these practices through trial and error, in testers who are figuring out how to adapt their skills to this new reality.
You don’t need permission to start. You just need to begin.