End-to-end speech-to-entity models substantially outperform cascaded ASR+NER pipelines for Arabic, and multilingual pretraining transfers better than Arabic-specific pretraining for this low-resource task.
This paper introduces CV-18 NER, the first dataset for extracting named entities directly from Arabic speech. The researchers created 21 entity types by annotating the Arabic Common Voice corpus, then compared end-to-end speech models (Whisper, AraBEST-RQ) against traditional pipelines that first transcribe speech then extract entities.