A transformer-based model architecture designed to handle very long text sequences efficiently by using sparse attention patterns instead of processing every word pair.
Performance retention over long documents and conversations