Published faithfulness scores for AI reasoning are not comparable across studies because different evaluation methods measure different aspects of the same behavior at different strictness levels—always check the methodology, not just the number.
This paper shows that measuring whether AI models are 'faithful' (honestly using their reasoning) isn't objective—different evaluation methods on the same data produce wildly different results (69.7% to 82.6% faithfulness for identical models).