Failure of contextual invariance in gender inference with large language models

Sagar Kumar, Ariel Flint, Luca Maria Aiello, Andrea Baronchelli|March 24, 2026arXiv

Key Takeaway

LLM outputs are unstable across contextually equivalent formulations of the same task, meaning benchmark results may not reflect how models actually behave in real applications—a critical issue for bias testing and high-stakes use.

Summary

This paper reveals that large language models fail to give consistent outputs when tasks are reformulated in contextually equivalent ways.

evaluation safety

Key Terms

contextual-invariance gender-bias benchmark