DPO

Direct preference optimization, the Bradley-Terry model, and why it replaces PPO.