VEPO uses variable entropy and constrained RL to improve low-resource language models by enforcing linguistic well-formedness during training while maintaining exploration—achieving better tokenization and translation quality on 90 language pairs.
This paper introduces VEPO, a training method that improves language models for low-resource languages by using reinforcement learning to enforce structural constraints (like proper formatting and sequence length) while dynamically balancing exploration and exploitation.