Training multimodal models on scientific documents requires balancing synthetic data quality with real-world document complexity—this dataset achieves that by synthesizing faithful QA pairs then re-embedding them into full papers.
This paper introduces SciMDR, a dataset of 300K question-answer pairs across 20K scientific papers designed to train AI models on understanding complex scientific documents with both text and images. The dataset uses a two-stage process: first generating focused QA pairs with reasoning chains, then embedding them into full documents to maintain realistic complexity.