Reorganizing how you compress KV cache to match GPU hardware operations can give you significant speed gains without accuracy loss.
InnerQ compresses the key-value cache in large language models to speed up text generation without losing accuracy. It uses a smarter grouping strategy that aligns with how GPUs actually compute, reducing memory access and enabling faster decoding—up to 22% faster than previous compression methods.