Instead of picking training data based only on external metrics, you can use SAEs to decode what the model actually learns internally, then use those signals to organize data better—making training more efficient without changing the model architecture.
This paper shows how to improve LLM training by using Sparse Autoencoders (SAEs) to read the model's internal representations and guide data selection. The method clusters training data for diversity, orders it by difficulty, and filters low-quality examples—improving math performance by 3% and cutting training time by 20% on smaller models.