You can scale LLM reasoning at inference time without exploding memory costs by combining efficient attention architectures with parameter sharing—YOCO-U shows this works better than either approach alone.
Universal YOCO combines a specialized decoder architecture with recursive computation to enable efficient test-time scaling in language models. By reusing parameters across multiple iterations in shallow layers while maintaining constant KV cache size, it achieves better reasoning capabilities without the computational overhead that typically comes with scaling inference-time compute.