You can make sparse attention 1.8× faster during prefill by reusing token-selection indices across layers—most layers don't need their own indexer since they pick the same tokens as nearby layers.
IndexCache speeds up sparse attention in large language models by reusing token selection indices across layers instead of computing them separately at each layer. Since consecutive layers select similar tokens anyway, the method caches these selections from a few 'Full' layers and reuses them in other 'Shared' layers, cutting indexer computation by 75% with minimal quality loss.