InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross|February 26, 2026arXiv

Key Takeaway

Reorganizing how you compress KV cache to match GPU hardware operations can give you significant speed gains without accuracy loss.

Summary

InnerQ compresses the key-value cache in large language models to speed up text generation without losing accuracy. It uses a smarter grouping strategy that aligns with how GPUs actually compute, reducing memory access and enabling faster decoding—up to 22% faster than previous compression methods.

efficiency architecture

Key Terms

kv-cache quantization group-wise-quantization decoding