Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

Yi Jing, Zao Dai, Jinwu Hu, Zijun Yao, Lei Hou et al.|May 26, 2026arXiv

Key Takeaway

Instead of picking training data based only on external metrics, you can use SAEs to decode what the model actually learns internally, then use those signals to organize data better—making training more efficient without changing the model architecture.

Summary

This paper shows how to improve LLM training by using Sparse Autoencoders (SAEs) to read the model's internal representations and guide data selection. The method clusters training data for diversity, orders it by difficulty, and filters low-quality examples—improving math performance by 3% and cutting training time by 20% on smaller models.

training data reasoning

Key Terms

sparse-autoencoder mechanistic-interpretability curriculum-learning reinforcement-learning-from-human-feedback data-engineering