Removing redundant or low-frequency facts from training data helps models memorize factual knowledge better, letting smaller models achieve the same fact accuracy as much larger ones.
This paper shows that LLMs struggle to memorize facts when training data contains too many facts or has skewed frequency distributions. The researchers propose a data pruning method that selects which facts to include in training, enabling smaller models to memorize significantly more facts—a GPT2-Small model trained with pruned data matched a 10X larger model trained on full data.