IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du et al.|March 12, 2026arXiv

Key Takeaway

You can make sparse attention 1.8× faster during prefill by reusing token-selection indices across layers—most layers don't need their own indexer since they pick the same tokens as nearby layers.

Summary

IndexCache speeds up sparse attention in large language models by reusing token selection indices across layers instead of computing them separately at each layer. Since consecutive layers select similar tokens anyway, the method caches these selections from a few 'Full' layers and reuses them in other 'Shared' layers, cutting indexer computation by 75% with minimal quality loss.

efficiency reasoning

Key Terms

sparse-attention long-context-handling inference-time-compute agentic-workflows knowledge-distillation