You can run reasoning-capable LLMs on mobile devices by using LoRA adapters with reinforcement learning to shorten reasoning traces, parallel decoding to reduce latency, and smart KV-cache management—achieving near-full-model accuracy with a fraction of the memory.
This paper makes LLM reasoning practical for mobile devices by combining lightweight LoRA adapters with techniques like budget forcing (to shorten responses), parallel decoding (to speed up generation), and dynamic adapter switching (to activate reasoning only when needed). The result is accurate chain-of-thought reasoning on edge devices without the memory overhead of full models.