Direct Preference Optimization (DPO)
Direct Preference Optimization (DPO) is a method for training language models to align with human preferences without requiring a separate reward model. Instead of using reinforcement learning with human feedback (RLHF), DPO directly optimizes the model to prefer chosen responses over rejected ones using a simple classification loss.
DPO vs RLHF
DPO eliminates the need for a separate reward model by directly optimizing the policy to prefer chosen responses. This makes training simpler and computationally cheaper than classical RLHF.
DPO Algorithm Details
The core DPO loss optimizes the policy to maximize the log-odds of choosing preferred responses:
Mathematical Formulation
Lθ = -Ex,y_chosen,y_rejected~D[
log σ(
β log(πθ(y_chosen|x) / πref(y_chosen|x))
- β log(πθ(y_rejected|x) / πref(y_rejected|x))
)
]
Where:
- • πθ - Current policy being trained
- • πref - Reference model (typically initial model before DPO)
- • β - DPO beta parameter (controls preference learning strength)
- • D - Dataset of (prompt x, chosen y_chosen, rejected y_rejected)
Key Insight
DPO optimizes the classical constrained RLHF objective, where the reference model constrains deviation from the initial distribution. This prevents the policy from drifting too far from reasonable behavior.
Running DPO Training
The implementation is in train_dpo.py with a CLI interface. Run from the command line:
1python -m bios_cookbook.recipes.preference.train \
2    log_path=/tmp/dpo-experiment \
3    model_name=ultrasafe/usf-conversation \
4    dataset=helpsteer3 \
5    renderer_name=ultrasafe \
6    learning_rate=1e-5 \
7    dpo_beta=0.1Key Parameters
| Parameter | Description | Example | 
|---|---|---|
| log_path | Directory for results and checkpoints | /tmp/dpo-exp | 
| model_name | Base model for initialization and reference policy | ultrasafe/usf-mini | 
| dataset | Preference dataset name | helpsteer3 | 
| renderer_name | Conversation formatting (see Rendering guide) | ultrasafe | 
| learning_rate | Learning rate for optimization | 1e-5 | 
| dpo_beta | DPO beta parameter (preference strength) | 0.1 | 
Available Preference Datasets
Bios provides several pre-configured preference datasets. These are implemented as DPODatasetBuilder classes:
Helpful-Harmless-Honest
Anthropic's dataset focusing on helpfulness, harmlessness, and honesty
dataset=hhhHelpSteer3
NVIDIA's HelpSteer3 preference dataset for helpful AI assistants
dataset=helpsteer3UltraFeedback
UltraFeedback binarized preferences dataset for general alignment
dataset=ultrafeedbackCustom Datasets
You can implement custom dataset builders following the bios_cookbook.preference.preference_datasets interface. Each dataset builder should provide pairwise preferences with prompts and chosen/rejected completions.
Training Process and Metrics
During training, you'll see detailed metrics showing DPO progress:
Step 50 ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Metric ┃ Value ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ accuracy │ 0.568627 │ │ batch_time │ 27.953704 │ │ chosen_reward │ 0.053621 │ │ dpo_loss │ 0.683825 │ │ learning_rate │ 0.000009 │ │ margin │ 0.002147 │ │ num_pairs │ 255 │ │ num_tokens │ 112638 │ │ progress │ 0.081210 │ │ rejected_reward │ 0.032152 │ │ test/nll │ 1.871778 │ └────────────────────────────────┴───────────┘
| Metric | Meaning | 
|---|---|
| dpo_loss | The DPO classification loss (should decrease) | 
| accuracy | Accuracy of implicit reward model on preference dataset | 
| margin | Average difference between chosen and rejected rewards | 
| chosen_reward | Average reward for chosen/preferred responses | 
| rejected_reward | Average reward for rejected/dispreferred responses | 
Evaluating DPO Models
After training, evaluate your DPO model on benchmarks to measure preference optimization impact:
1MODEL_PATH=bios://YOUR_MODEL_PATH_HERE
2python -m bios_cookbook.eval.run_inspect_evals \
3    model_path=$MODEL_PATH \
4    model_name=ultrasafe/usf-conversation \
5    tasks=inspect_evals/ifeval \
6    renderer_name=ultrasafeThis evaluates the model on various benchmarks to quantify the impact of preference optimization on instruction-following, helpfulness, and other quality dimensions.
Tips for DPO Training
Beta Parameter
Start with dpo_beta=0.1 and adjust based on your dataset. Higher beta = stronger preference signal, lower beta = more conservative updates.
Learning Rate
Use a lower learning rate than supervised fine-tuning, typically 1e-5 to 1e-6. DPO is sensitive to LR—too high can cause instability.
Base Model Selection
The base model should already be in-distribution with preference data. Either start with a light SFT phase or collect on-policy preferences. Sharp distribution mismatch can create strange behaviors.
Next Steps
Explore related preference learning techniques: