SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

Ziyu Chen, Yilun Zhao, Chengye Wang, Rilyn Han, Manasi Patwardhan et al.|March 12, 2026arXiv

Key Takeaway

Training multimodal models on scientific documents requires balancing synthetic data quality with real-world document complexity—this dataset achieves that by synthesizing faithful QA pairs then re-embedding them into full papers.

Summary

This paper introduces SciMDR, a dataset of 300K question-answer pairs across 20K scientific papers designed to train AI models on understanding complex scientific documents with both text and images. The dataset uses a two-stage process: first generating focused QA pairs with reasoning chains, then embedding them into full documents to maintain realistic complexity.

multimodal data evaluation

Key Terms

multimodal-comprehension reasoning-chain cross-modal-reasoning document-level-reasoning synthetic-data