Efficient Reasoning on the Edge

Yelysei Bondarenko, Thomas Hehn, Rob Hesselink, Romain Lepert, Fabio Valerio Massoli et al.|March 17, 2026arXiv

Key Takeaway

You can run reasoning-capable LLMs on mobile devices by using LoRA adapters with reinforcement learning to shorten reasoning traces, parallel decoding to reduce latency, and smart KV-cache management—achieving near-full-model accuracy with a fraction of the memory.

Summary

This paper makes LLM reasoning practical for mobile devices by combining lightweight LoRA adapters with techniques like budget forcing (to shorten responses), parallel decoding (to speed up generation), and dynamic adapter switching (to activate reasoning only when needed). The result is accurate chain-of-thought reasoning on edge devices without the memory overhead of full models.

efficiency reasoning training

Key Terms

lora-adapter kv-cache chain-of-thought budget-forcing parallel-decoding