Learning from Preferences

This section focuses on learning from pairwise feedback, where preference data indicates which of two completions is better for a given prompt. This approach is ideal for tasks lacking simple programmatic correctness criteria.

Preference Data Sources

Preferences can be collected from human evaluators comparing model outputs, or generated by a stronger model acting as a judge. Both approaches enable training models to align with subjective quality criteria like helpfulness, clarity, and style.

Two Approaches to Preference Learning

When working with pairwise preference data, Bios supports two primary methodologies:

DPO

Direct Preference Optimization directly updates the policy to prefer chosen responses over rejected ones, without needing a separate reward model.

Advantages:

• Simpler implementation (one-stage)
• Computationally cheaper
• No reward model required
• Direct policy optimization

View DPO Implementation →

RLHF

Reinforcement Learning from Human Feedback trains a reward model on preference data, then uses RL to optimize the policy against this reward model.

Advantages:

• More flexible (two-stage)
• Reward model reusable
• Fine-grained control via RL
• Better for complex objectives

View RLHF Guide →

DPO vs RLHF: Which to Choose?

Aspect	DPO	RLHF
Training Stages	One-stage (direct)	Two-stage (reward model + RL)
Computational Cost	Lower (cheaper)	Higher (more expensive)
Complexity	Simpler	More complex
Flexibility	Limited	High (reusable reward model)
Best For	Quick alignment, resource-constrained	Complex objectives, iterative refinement

Choosing Your Approach

Choose DPO When:

• You have clear pairwise preferences
• Computational budget is limited
• Simplicity is valued over flexibility
• One-time alignment is sufficient
• Working with smaller datasets

Choose RLHF When:

• You need a reusable reward model
• Complex, multi-faceted objectives
• Iterative refinement planned
• Fine-grained RL control needed
• Working with large-scale data

Next Steps

Explore both approaches in detail:

DPO Implementation →

Learn DPO loss function and implementation details

RLHF Guide →

Complete RLHF pipeline: reward model training + RL