RLHF

Reward modeling, PPO, KL penalty, and collecting human preference data at scale.