A training method that improves model reasoning by comparing outputs and rewarding better explanations.