Learning from Preferences

This section focuses on learning from pairwise feedback, where preference data indicates which of two completions is better for a given prompt. This approach is ideal for tasks lacking simple programmatic correctness criteria.

Preference Data Sources

Preferences can be collected from human evaluators comparing model outputs, or generated by a stronger model acting as a judge. Both approaches enable training models to align with subjective quality criteria like helpfulness, clarity, and style.

Two Approaches to Preference Learning

When working with pairwise preference data, Bios supports two primary methodologies:

DPO

Direct Preference Optimization directly updates the policy to prefer chosen responses over rejected ones, without needing a separate reward model.

Advantages:

  • • Simpler implementation (one-stage)
  • • Computationally cheaper
  • • No reward model required
  • • Direct policy optimization
View DPO Implementation →

RLHF

Reinforcement Learning from Human Feedback trains a reward model on preference data, then uses RL to optimize the policy against this reward model.

Advantages:

  • • More flexible (two-stage)
  • • Reward model reusable
  • • Fine-grained control via RL
  • • Better for complex objectives
View RLHF Guide →

DPO vs RLHF: Which to Choose?

AspectDPORLHF
Training StagesOne-stage (direct)Two-stage (reward model + RL)
Computational CostLower (cheaper)Higher (more expensive)
ComplexitySimplerMore complex
FlexibilityLimitedHigh (reusable reward model)
Best ForQuick alignment, resource-constrainedComplex objectives, iterative refinement

Choosing Your Approach

Choose DPO When:

  • • You have clear pairwise preferences
  • • Computational budget is limited
  • • Simplicity is valued over flexibility
  • • One-time alignment is sufficient
  • • Working with smaller datasets

Choose RLHF When:

  • • You need a reusable reward model
  • • Complex, multi-faceted objectives
  • • Iterative refinement planned
  • • Fine-grained RL control needed
  • • Working with large-scale data