When training on synthetic code data, filtering by reverse semantic coherence (can the answer predict the question?) is more effective at removing noise than forward metrics, letting you use 75% less data without losing model quality.
This paper introduces QAQ, a method for filtering noisy synthetic code training data by measuring bidirectional semantic coherence—checking not just if a model can generate answers from questions, but also if answers can predict back to questions. By selecting only 25% of data with the highest quality scores, the approach matches full-dataset performance while cutting computational costs.