DyMoE: Dynamic Expert Orchestration with Mixed-Precision Quantization for Efficient MoE Inference on Edge

Yuegui Huang, Zhiyuan Fang, Weiqi Luo, Ruoyu Wu, Wuhui Chen et al.|March 19, 2026arXiv

Key Takeaway

By dynamically quantizing less important experts and prefetching memory strategically, DyMoE achieves 3-22x faster inference on edge devices without sacrificing accuracy—making large MoE models practical for real-time edge deployment.

Summary

DyMoE optimizes Mixture-of-Experts (MoE) models for edge devices by dynamically adjusting precision during inference. It identifies that some experts matter more than others and uses this insight to apply lower precision to less critical experts while keeping important ones at higher precision, combined with smart memory prefetching to reduce delays.

efficiency architecture

Key Terms

mixture-of-experts mixed-precision-quantization edge-device expert-importance prefetching