RLHF: How Human Preferences Shape AI

Large language models trained with next-token prediction are remarkably capable, but they don't inherently know what humans want. They can generate toxic content, hallucinate confidently, or give unhelpful responses. RLHF bridges this gap by training models on human preferences.

The Three-Stage Pipeline

Stage 1: Supervised Fine-Tuning (SFT) — Fine-tune the base model on high-quality demonstration data. Human annotators write ideal responses, and the model learns to imitate them.

Stage 2: Reward Model Training — Train a model to predict human preferences. Given a prompt and two responses, humans indicate which is better. The reward model learns to assign matching scores.

Stage 3: RL Optimization — Use the reward model as a scoring function and optimize with PPO to maximize predicted reward.

Reward Modeling

The reward model is trained on pairwise comparisons using the Bradley-Terry model:

L_reward = -E[log σ(r_θ(x, y_w) - r_θ(x, y_l))]

Implementation:

class RewardModel(nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.backbone = base_model
        self.reward_head = nn.Linear(base_model.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        outputs = self.backbone(input_ids, attention_mask=attention_mask)
        last_hidden = outputs.last_hidden_state[:, -1, :]
        reward = self.reward_head(last_hidden)
        return reward.squeeze(-1)

def reward_loss(model, chosen_ids, rejected_ids, chosen_mask, rejected_mask):
    r_chosen = model(chosen_ids, chosen_mask)
    r_rejected = model(rejected_ids, rejected_mask)
    return -torch.log(torch.sigmoid(r_chosen - r_rejected)).mean()

PPO Optimization

The RL stage maximizes expected reward while staying close to the SFT model:

maximize E[r_φ(x, y) - β * KL(π_θ(y|x) || π_SFT(y|x))]

The KL penalty is critical. Without it, the policy quickly learns to exploit quirks in the reward model — generating outputs that score high but are degenerate.

DPO: Direct Preference Optimization

Rafailov et al. (2023) showed you can skip the reward model entirely. DPO derives a classification loss directly on preference pairs:

L_DPO = -E[log σ(β * (log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x)))]
def dpo_loss(policy, ref_model, chosen, rejected, beta=0.1):
    pi_chosen = get_log_probs(policy, chosen)
    pi_rejected = get_log_probs(policy, rejected)

    with torch.no_grad():
        ref_chosen = get_log_probs(ref_model, chosen)
        ref_rejected = get_log_probs(ref_model, rejected)

    chosen_reward = beta * (pi_chosen - ref_chosen)
    rejected_reward = beta * (pi_rejected - ref_rejected)
    return -F.logsigmoid(chosen_reward - rejected_reward).mean()

DPO vs. RLHF

  • Simplicity: DPO eliminates the reward model, RL loop, and PPO infrastructure
  • Stability: No reward hacking, no value function estimation
  • Limitation: DPO only learns from specific pairs it sees; RLHF can generalize preferences

Open Problems

  • Reward overoptimization: Goodhart's Law — optimizing too aggressively against a proxy
  • Data quality: Human annotators disagree and have biases
  • Multi-objective alignment: Helpfulness, harmlessness, and honesty can conflict
  • Scalable oversight: Humans struggle to judge complex outputs as models improve

Variants like KTO, IPO, and ORPO continue pushing boundaries — each addressing different limitations of the original formulations.