Evaluating LLM-Based Test Generation Under Software Evolution

Sabaat Haroon, Mohammad Taha Khan, Muhammad Ali Gulzar|March 24, 2026arXiv

Key Takeaway

LLM-generated tests work well on original code but fail to adapt to program changes, indicating they learn superficial patterns rather than genuine program semantics—a critical weakness for real-world software maintenance.

Summary

This study tests whether LLMs actually understand program behavior when generating unit tests, or just memorize patterns. Researchers mutated 22,374 programs and found that while LLMs generate good tests initially (79% coverage), they fail badly when code changes—missing 34% of bugs and struggling even when code is refactored without changing functionality.

evaluation

Key Terms

semantic-preserving-changes mutation-testing regression-detection code-coverage