Optimal Splitting of Language Models from Mixtures to Specialized Domains

Skyler Seto, Pierre Ablin, Anastasiia Filippova, Jiayuan Ye, Louis Bethune et al.|March 19, 2026arXiv

Key Takeaway

You can train better domain-specific models by mathematically optimizing how many tokens to spend on general pretraining versus specialized training, rather than using a fixed two-stage recipe.

Summary

This paper shows how to efficiently train multiple specialized language models by splitting compute between general pretraining and domain-specific training. Using scaling laws, the authors predict optimal token allocation for each stage, improving performance on reasoning and knowledge tasks across different model sizes.

training scaling efficiency

Key Terms

scaling-laws continued-pretraining compute-allocation domain-specialization