Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training

Ben S. Southworth, Stephen Thomas|March 18, 2026arXiv

Key Takeaway

MUD offers 1.3-3x faster token throughput than Muon with similar final performance, making it a practical drop-in replacement for faster transformer training without sacrificing convergence.

Summary

MUD is a faster alternative to Muon, an optimizer that speeds up transformer training. Instead of using expensive matrix operations to smooth momentum updates, MUD uses a simpler triangular approach inspired by classical numerical methods. This cuts optimizer overhead by 30-70% while maintaining training speed, making transformers train 10-50% faster in real time.

training efficiency

Key Terms

orthogonal-projection momentum polar-decomposition adamw convergence-rate