Semantic Invariance in Agentic AI

I. de Zarzà, J. de Curtò, Jordi Cabot, Pietro Manzoni, Carlos T. Calafate|March 13, 2026arXiv

Key Takeaway

Model size doesn't guarantee robustness: smaller models like Qwen3-30B outperform much larger models at maintaining consistent reasoning when problems are rephrased, suggesting that scaling alone won't solve reliability issues for deployed AI agents.

Summary

This paper tests whether AI agents give consistent answers when you rephrase the same problem in different ways. The researchers found that larger models are actually less stable than smaller ones—a surprising result that challenges assumptions about model scaling.

evaluation reasoning agents

Key Terms

semantic-invariance metamorphic-testing foundation-models multi-step-reasoning agentic-ai