Explanation consistency matters as much as accuracy in medical AI: models can achieve high classification scores while applying different reasoning strategies to similar cases, and C-Score can detect this instability before the model fails.
This paper introduces C-Score, a new metric that measures whether AI models use consistent visual reasoning across different medical images of the same disease, rather than just checking if explanations match radiologist annotations.