Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan|February 27, 2026arXiv

Key Takeaway

You can reduce optimizer memory by 8x using low-rank decomposition without sacrificing model quality—making it easier to train larger models on l...

Summary

This paper makes training large language models cheaper by redesigning how optimizers store momentum information. Instead of keeping full-sized momentum matrices in memory, the authors compress them into smaller low-rank approximations—using 1/8 the memory while maintaining or improving training quality.

efficiency training

Key Terms

momentum low-rank-approximation optimizer exponential-moving-average