Encoder-only models can be extended to handle long documents through positional embedding adaptation and continued pre-training, offering a parameter-efficient alternative to decoder-only LLMs for document understanding tasks.
This paper introduces Polish language models based on encoder-only architecture that can process documents up to 8192 tokens long—much longer than traditional BERT models. The researchers used a two-stage training approach with positional embedding adaptation and created smaller distilled versions.