QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

Jiayin Lei, Ming Ma, Yunxi Duan, Chenxi Li, Tianming Yang|March 12, 2026arXiv

Key Takeaway

When training on synthetic code data, filtering by reverse semantic coherence (can the answer predict the question?) is more effective at removing noise than forward metrics, letting you use 75% less data without losing model quality.

Summary

This paper introduces QAQ, a method for filtering noisy synthetic code training data by measuring bidirectional semantic coherence—checking not just if a model can generate answers from questions, but also if answers can predict back to questions. By selecting only 25% of data with the highest quality scores, the approach matches full-dataset performance while cutting computational costs.

data training

Key Terms

synthetic-data data-quality-curation semantic-coherence information-gain