You can train better domain-specific models by mathematically optimizing how many tokens to spend on general pretraining versus specialized training, rather than using a fixed two-stage recipe.
This paper shows how to efficiently train multiple specialized language models by splitting compute between general pretraining and domain-specific training. Using scaling laws, the authors predict optimal token allocation for each stage, improving performance on reasoning and knowledge tasks across different model sizes.