CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech

Youssef Saidi, Haroun Elleuch, Fethi Bougares|April 2, 2026arXiv

Key Takeaway

End-to-end speech-to-entity models substantially outperform cascaded ASR+NER pipelines for Arabic, and multilingual pretraining transfers better than Arabic-specific pretraining for this low-resource task.

Summary

This paper introduces CV-18 NER, the first dataset for extracting named entities directly from Arabic speech. The researchers created 21 entity types by annotating the Arabic Common Voice corpus, then compared end-to-end speech models (Whisper, AraBEST-RQ) against traditional pipelines that first transcribe speech then extract entities.

data

Key Terms

named-entity-recognition end-to-end-learning cascaded-pipeline self-supervised-pretraining