Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

Richard J. Young|March 20, 2026arXiv

Key Takeaway

Published faithfulness scores for AI reasoning are not comparable across studies because different evaluation methods measure different aspects of the same behavior at different strictness levels—always check the methodology, not just the number.

Summary

This paper shows that measuring whether AI models are 'faithful' (honestly using their reasoning) isn't objective—different evaluation methods on the same data produce wildly different results (69.7% to 82.6% faithfulness for identical models).

evaluation reasoning alignment

Key Terms

chain-of-thought faithfulness reasoning-trace inter-evaluator-agreement operationalize