A training method that learns from pairwise comparisons between solutions rather than explicit reward signals.
Adhering to complex, structured, or constrained instructions