Testing AI-Generated Code: A Practical Guide
Learn how to test code produced by AI assistants with practical strategies for test rules, TDD workflows, mocking conventions, and coverage expectations.
AI writes code fast. Testing keeps it honest.
AI coding assistants can produce a working function in seconds. But "working" and "correct" are different things. The function might handle the happy path perfectly while silently dropping edge cases, ignoring error states, or breaking assumptions that the rest of your codebase depends on.
Testing is how you close that gap. And when your code is produced by an AI, the testing strategy needs to account for the specific ways AI-generated code tends to fail: plausible-looking logic that misses boundary conditions, overly optimistic error handling, and subtle misunderstandings of your business domain.
This guide covers how to build a testing workflow around AI-generated code, from writing test rules that your AI assistant follows automatically to running TDD loops where the AI writes code against your tests.
Why AI-generated code needs different testing attention
Code written by a human developer reflects their mental model of the system. When they write a function, they're thinking about how it connects to what they built yesterday and what they'll build tomorrow.
AI assistants don't have that continuity. Each generation is a fresh attempt based on whatever context is available. This creates specific failure patterns:
Missing edge cases. The AI writes the obvious path but skips the null check, the empty array case, or the timeout scenario. It produces code that works for the example you described but breaks on real-world input.
Shallow error handling. A common pattern: the AI wraps everything in try/catch and returns a generic error message. The error is "handled" in the syntactic sense but not in any useful way. No logging, no specific error codes, no recovery logic.
Incorrect assumptions about state. The AI may assume a variable is always defined, a list is always sorted, or a user is always authenticated. These assumptions are invisible in the generated code and only surface at runtime.
Over-reliance on mocks. When asked to write tests, AI assistants tend to mock aggressively, sometimes mocking the very thing the test should be verifying. You end up with tests that pass but prove nothing.
None of these are bugs in the AI. They're predictable consequences of generating code from limited context. Testing is your safety net.
Write test rules into your project instructions
The single most effective thing you can do is tell your AI assistant how you expect tests to be written. Without explicit instructions, the AI falls back to generic patterns that may not match your framework, your conventions, or your quality bar.
Add a testing section to your project instructions file (.cursorrules, CLAUDE.md, or the equivalent for your tool):
## Testing conventions
- Framework: Vitest + React Testing Library
- Test files live next to source files: `Button.tsx` -> `Button.test.tsx`
- Use `describe` blocks grouped by function/component name
- Use `it` (not `test`) for individual assertions
- Every async function must have tests for both success and error paths
- Mock external dependencies only. Never mock the function under test.
- Use `vi.fn()` for function mocks, `vi.spyOn()` for method spies
- Prefer `userEvent` over `fireEvent` for React component tests
- Always clean up: use `afterEach` to restore mocks with `vi.restoreAllMocks()`
## Do NOT in tests
- Do not use snapshot tests for component behavior (snapshots are only for visual regression)
- Do not test implementation details (internal state, private methods)
- Do not write tests that depend on execution order
- Do not mock database calls in integration tests (use the test database)
These rules eliminate entire categories of AI testing mistakes. The AI stops generating Jest syntax when you use Vitest. It stops mocking everything when you say "mock external dependencies only." It stops writing brittle snapshot tests when you explicitly say not to.
For more on structuring these rules effectively, see best practices for AI coding rules.
TDD with AI: let the AI write code against your tests
Test-Driven Development pairs naturally with AI code generation. The workflow inverts the usual AI interaction: instead of generating code and then writing tests, you write the tests first and let the AI produce code that passes them.
Here's the loop:
- You write the test. Define what the function should do, including edge cases.
- The AI writes the implementation. Point it at the failing test and say "make this pass."
- You review. Check whether the implementation is correct or whether it just games the test.
- Iterate. Add more test cases for scenarios you want covered. Have the AI update the implementation.
This workflow has a key advantage: the AI can't skip edge cases you've already defined. If your test checks for null input, the AI must handle null input. The test is the specification.
A practical example:
// You write this test first
describe("parseUserInput", () => {
it("parses a valid email", () => {
expect(parseUserInput("alice@example.com")).toEqual({
type: "email",
value: "alice@example.com",
});
});
it("rejects an empty string", () => {
expect(parseUserInput("")).toEqual({
type: "error",
message: "Input cannot be empty",
});
});
it("trims whitespace before parsing", () => {
expect(parseUserInput(" alice@example.com ")).toEqual({
type: "email",
value: "alice@example.com",
});
});
it("rejects strings longer than 254 characters", () => {
const long = "a".repeat(255) + "@example.com";
expect(parseUserInput(long)).toEqual({
type: "error",
message: "Input exceeds maximum length",
});
});
});
Then tell the AI: "Write the parseUserInput function that passes all these tests." The AI produces an implementation constrained by your specification, not by its own assumptions about what matters.
TDD with AI is especially useful for utility functions, data transformations, and validation logic, particularly anywhere the inputs and outputs are well-defined.
Testing patterns that work well with AI assistants
The arrange-act-assert structure
AI assistants produce more consistent tests when you tell them to follow a specific structure. Arrange-Act-Assert (AAA) works well because each section has a clear purpose:
it("returns 404 when user is not found", async () => {
// Arrange
const mockDb = { findUser: vi.fn().mockResolvedValue(null) };
const handler = createHandler(mockDb);
// Act
const response = await handler({ params: { id: "nonexistent" } });
// Assert
expect(response.status).toBe(404);
expect(await response.json()).toEqual({ error: "User not found" });
});
Add this to your testing rules: "Structure all tests as Arrange-Act-Assert with comments marking each section." The AI will follow it consistently once instructed.
Test the contract, not the implementation
AI-generated tests often test how something is done rather than what it does. This creates brittle tests that break whenever the implementation changes, even if the behavior is still correct.
Encode this in your rules:
## Test philosophy
Test the public interface and observable behavior. Do not test:
- The order of internal function calls
- Which private methods were invoked
- The internal structure of intermediate variables
A test should only break when the behavior changes, not when the implementation is refactored.
This rule alone prevents a large class of fragile tests that AI assistants tend to write.
Error path testing
Tell the AI to always test error paths. A rule like this works well:
## Error path coverage
For every function that can fail (throws, returns an error, rejects a promise):
- Test with invalid input
- Test with missing required fields
- Test with network/database failures (mock the failure)
- Verify the error message and error type, not just that "it threw"
Without this rule, AI assistants will write a happy-path test and move on. With it, they produce tests that verify your error handling actually works the way you intend.
Mocking conventions: where AI goes wrong
Mocking is where AI-generated tests break down most often. The AI mocks too much, mocks the wrong things, or creates mocks that don't reflect real behavior.
Set explicit mocking rules:
## Mocking rules
Mock these: network calls, file system access, third-party APIs, timers/dates.
Do NOT mock these: your own utility functions, the function under test, database
queries in integration tests.
When creating mocks:
- Mock at the boundary (the HTTP client, not the function that calls it)
- Return realistic data shapes, not empty objects
- Include error mocks (network timeout, 500 response, malformed JSON)
The "mock at the boundary" rule is critical. AI assistants frequently mock intermediate layers, which means the test only verifies that function A calls function B with the right arguments. It doesn't verify that the actual logic works end-to-end.
Here's an example of a common AI mistake and the correction:
// BAD: mocking the function under test
vi.mock("./calculateDiscount", () => ({
calculateDiscount: vi.fn().mockReturnValue(0.15),
}));
it("applies discount", () => {
// This tests nothing. You mocked the answer.
expect(calculateDiscount(100, "VIP")).toBe(0.15);
});
// GOOD: mock only the external dependency
vi.mock("./pricingApi", () => ({
fetchDiscountRate: vi.fn().mockResolvedValue({ rate: 0.15 }),
}));
it("calculates discount from API rate", async () => {
const result = await calculateDiscount(100, "VIP");
expect(result).toBe(85); // 100 - (100 * 0.15)
});
Including examples like this in your rules file gives the AI a concrete reference point. It stops generating the bad pattern because it can see the contrast.
Coverage expectations: what to aim for
AI assistants can write a lot of tests quickly, but volume is not the goal. Tell the AI what coverage means for your project:
## Coverage expectations
- Business logic: 90%+ line coverage, 80%+ branch coverage
- API routes: every route has at least success, validation error, and auth error tests
- React components: test user interactions and conditional rendering, not styling
- Utility functions: 100% coverage including edge cases
- Generated/scaffolded code: no tests needed for trivial getters/setters
These expectations are more useful than a blanket "90% coverage" rule because they communicate where testing effort should be concentrated. The AI will write thorough tests for business logic and skip trivial coverage-padding tests for code that doesn't need them.
A note on coverage as a metric: when AI writes both the code and the tests, high coverage numbers can be misleading. The tests might cover every line without actually verifying meaningful behavior. This is why the "test the contract, not the implementation" rule matters so much. Coverage tells you what code was executed during tests. It doesn't tell you whether the assertions are meaningful.
Building a testing skill for your team
Once you've dialed in your testing conventions, the next step is making them portable. If five engineers on your team are using AI assistants, all five need the same testing rules, or you'll end up with five different testing styles in the same codebase.
This is exactly what agent skills were designed for. Publish your testing conventions as a skill:
localskills publish
Teammates install it with one command:
localskills install your-team/testing-standards --target cursor claude
Every AI assistant on the team now follows the same mocking conventions, coverage expectations, and test structure. When you update the rules (adding a new pattern, tightening a convention), teammates pull the update and their AI tools adapt immediately.
This also ties into enforcing coding standards with AI: testing rules are coding standards. They deserve the same versioning, sharing, and maintenance as any other team convention.
Integrating AI testing into your review process
Even with good rules in place, review AI-generated tests with the same scrutiny you apply to AI-generated code. A few things to check in every review:
Are the assertions meaningful? A test that calls a function and asserts expect(result).toBeDefined() is barely a test. Look for specific value assertions that would fail if the behavior changed.
Are the mocks realistic? Mock data should look like real data. If your API returns objects with 12 fields and the mock has 2, the test might miss issues with how the code handles the full shape.
Does the test name describe the behavior? "it should work" tells you nothing. "it returns 404 when the user ID does not exist in the database" tells you exactly what breaks if this test fails.
Is there a missing test case? AI assistants write tests for the scenarios they think of, which correlates with what's common in training data. Your domain-specific edge cases (timezone-dependent logic, currency rounding, permission hierarchies) need your judgment.
For more on setting up review workflows that catch these issues, check out our Claude Code tips and AI code generation best practices.
Quick reference: testing checklist for AI-generated code
Before merging any AI-generated code:
- Tests exist for both success and error paths
- Mocks are at the boundary, not on internal functions
- Assertions check specific values, not just truthiness
- Edge cases are covered (empty input, null, max length, concurrent access)
- Test names describe the expected behavior
- No snapshot tests for behavioral verification
- Coverage meets your team's threshold for that code category
- You can explain what each test verifies without reading the implementation
For ongoing improvement:
- Testing conventions are documented in your project instructions file
- The same conventions are shared across all AI tools your team uses
- When a bug slips through, add a test rule that would have caught it
- Review AI-written tests as carefully as AI-written code
Testing AI-generated code is not about trusting the AI less. It's about building the same safety nets you'd build for any code, with rules tuned to the specific ways AI output can go wrong. Write your testing conventions once, share them with your team, and let the AI generate code you can actually trust.
Sign up at localskills.sh to publish your testing standards as a shared skill and keep every AI tool on your team aligned.
npm install -g @localskills/cli
localskills login
localskills publish