Standard metrics for evaluating learned representations are often misspecified and can mislead you about whether your model actually learned interp...
This paper reveals that popular metrics for checking if AI models learn meaningful, interpretable features are unreliable. The metrics work only under specific conditions, and when those conditions aren't met, they give false results—saying a model learned good features when it didn't, or vice versa. The authors provide tools to properly test these metrics.