You can reduce optimizer memory by 8x using low-rank decomposition without sacrificing model quality—making it easier to train larger models on l...
This paper makes training large language models cheaper by redesigning how optimizers store momentum information. Instead of keeping full-sized momentum matrices in memory, the authors compress them into smaller low-rank approximations—using 1/8 the memory while maintaining or improving training quality.