Universal YOCO for Efficient Depth Scaling

Yutao Sun, Li Dong, Tianzhu Ye, Shaohan Huang, Jianyong Wang et al.|April 1, 2026arXiv

Key Takeaway

You can scale LLM reasoning at inference time without exploding memory costs by combining efficient attention architectures with parameter sharing—YOCO-U shows this works better than either approach alone.

Summary

Universal YOCO combines a specialized decoder architecture with recursive computation to enable efficient test-time scaling in language models. By reusing parameters across multiple iterations in shallow layers while maintaining constant KV cache size, it achieves better reasoning capabilities without the computational overhead that typically comes with scaling inference-time compute.

efficiency architecture reasoning

Key Terms

kv-cache inference-time-compute parameter-sharing efficient-attention-architectures recursive-computation