Model size doesn't guarantee robustness: smaller models like Qwen3-30B outperform much larger models at maintaining consistent reasoning when problems are rephrased, suggesting that scaling alone won't solve reliability issues for deployed AI agents.
This paper tests whether AI agents give consistent answers when you rephrase the same problem in different ways. The researchers found that larger models are actually less stable than smaller ones—a surprising result that challenges assumptions about model scaling.