LLM-generated tests work well on original code but fail to adapt to program changes, indicating they learn superficial patterns rather than genuine program semantics—a critical weakness for real-world software maintenance.
This study tests whether LLMs actually understand program behavior when generating unit tests, or just memorize patterns. Researchers mutated 22,374 programs and found that while LLMs generate good tests initially (79% coverage), they fail badly when code changes—missing 34% of bugs and struggling even when code is refactored without changing functionality.