Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text

Hainan Xu, Vladimir Bataev, Travis M. Bartley, Jagadeesh Balam|February 27, 2026arXiv

Key Takeaway

You can make streaming speech-to-text models faster and more accurate by processing audio in fixed chunks instead of one token at a time.

Summary

This paper introduces CHAT, an improved version of RNN-T models for converting speech to text in real-time. By processing audio in small chunks and using a smarter attention mechanism, CHAT runs 1.7x faster during inference, uses 46% less memory during training, and produces more accurate transcriptions—especially for translating speech between languages.

efficiency architecture multimodal

Key Terms

rnn-t cross-attention streaming transducer