Sparse autoencoders fail at compositional generalization because they learn poor concept dictionaries during training, not because of their amortized inference approach—fixing dictionary learning, not inference speed, is the key to interpretable AI.
This paper reveals why sparse autoencoders (SAEs) and linear probes fail to understand compositional concepts in neural networks. The core issue isn't the inference method—it's that SAEs learn dictionaries (concept representations) pointing in the wrong directions.