MoDA lets deep language models selectively attend to earlier layers, preventing information loss as models get deeper while adding only 3.7% computational overhead.
This paper introduces Mixture-of-Depths Attention (MoDA), a mechanism that lets attention heads skip layers by accessing key-value pairs from both the current and earlier layers. This solves a problem in very deep language models where useful information gets diluted as it passes through many layers.