VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

Chonghan Liu, Yimin Du, Qi An, Xin He, Cunqi Zhai et al.|March 19, 2026arXiv

Key Takeaway

VEPO uses variable entropy and constrained RL to improve low-resource language models by enforcing linguistic well-formedness during training while maintaining exploration—achieving better tokenization and translation quality on 90 language pairs.

Summary

This paper introduces VEPO, a training method that improves language models for low-resource languages by using reinforcement learning to enforce structural constraints (like proper formatting and sequence length) while dynamically balancing exploration and exploitation.

training alignment

Key Terms

reinforcement-learning policy-alignment entropy-mechanism low-resource-languages subword-segmentation