You can now distill transformer-based LLMs into more efficient xLSTM architectures without significant performance degradation, making it practical to deploy smaller, cheaper models that match their larger teachers.
This paper shows how to effectively compress large language models into smaller xLSTM models while preserving performance. The researchers developed a distillation pipeline that combines multiple specialized experts into a single efficient model, successfully distilling models from Llama, Qwen, and Olmo families with minimal performance loss.