MUD offers 1.3-3x faster token throughput than Muon with similar final performance, making it a practical drop-in replacement for faster transformer training without sacrificing convergence.
MUD is a faster alternative to Muon, an optimizer that speeds up transformer training. Instead of using expensive matrix operations to smooth momentum updates, MUD uses a simpler triangular approach inspired by classical numerical methods. This cuts optimizer overhead by 30-70% while maintaining training speed, making transformers train 10-50% faster in real time.