LLM outputs are unstable across contextually equivalent formulations of the same task, meaning benchmark results may not reflect how models actually behave in real applications—a critical issue for bias testing and high-stakes use.
This paper reveals that large language models fail to give consistent outputs when tasks are reformulated in contextually equivalent ways.