Memory Caching lets RNNs scale their memory capacity with sequence length while staying faster than Transformers.
This paper fixes a major weakness of fast RNN models: they forget information too quickly because they have fixed-size memory. The authors introduce Memory Caching, which lets RNNs save snapshots of their memory as they process longer sequences. This gives RNNs the ability to remember more without becoming as slow as Transformers, creating a sweet spot between speed and accuracy.