Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) is a method for training language models to align with human preferences without requiring a separate reward model. Instead of using reinforcement learning with human feedback (RLHF), DPO directly optimizes the model to prefer chosen responses over rejected ones using a simple classification loss.

DPO vs RLHF

DPO eliminates the need for a separate reward model by directly optimizing the policy to prefer chosen responses. This makes training simpler and computationally cheaper than classical RLHF.

DPO Algorithm Details

The core DPO loss optimizes the policy to maximize the log-odds of choosing preferred responses:

Mathematical Formulation

L_θ = -E_{x,y_chosen,y_rejected~D}[

log σ(

β log(π_θ(y_chosen|x) / π_ref(y_chosen|x))

- β log(π_θ(y_rejected|x) / π_ref(y_rejected|x))

)

]

Where:

• π_θ - Current policy being trained
• π_ref - Reference model (typically initial model before DPO)
• β - DPO beta parameter (controls preference learning strength)
• D - Dataset of (prompt x, chosen y_chosen, rejected y_rejected)

Key Insight

DPO optimizes the classical constrained RLHF objective, where the reference model constrains deviation from the initial distribution. This prevents the policy from drifting too far from reasonable behavior.

Running DPO Training

The implementation is in train_dpo.py with a CLI interface. Run from the command line:

DPO Training Command

1python -m bios_cookbook.recipes.preference.train \
2    log_path=/tmp/dpo-experiment \
3    model_name=ultrasafe/usf-conversation \
4    dataset=helpsteer3 \
5    renderer_name=ultrasafe \
6    learning_rate=1e-5 \
7    dpo_beta=0.1

Key Parameters

Parameter	Description	Example
`log_path`	Directory for results and checkpoints	/tmp/dpo-exp
`model_name`	Base model for initialization and reference policy	ultrasafe/usf-mini
`dataset`	Preference dataset name	helpsteer3
`renderer_name`	Conversation formatting (see Rendering guide)	ultrasafe
`learning_rate`	Learning rate for optimization	1e-5
`dpo_beta`	DPO beta parameter (preference strength)	0.1

Available Preference Datasets

Bios provides several pre-configured preference datasets. These are implemented as DPODatasetBuilder classes:

HHH

Helpful-Harmless-Honest

Anthropic's dataset focusing on helpfulness, harmlessness, and honesty

dataset=hhh

HS3

HelpSteer3

NVIDIA's HelpSteer3 preference dataset for helpful AI assistants

dataset=helpsteer3

UFB

UltraFeedback

UltraFeedback binarized preferences dataset for general alignment

dataset=ultrafeedback

Custom Datasets

You can implement custom dataset builders following the bios_cookbook.preference.preference_datasets interface. Each dataset builder should provide pairwise preferences with prompts and chosen/rejected completions.

Training Process and Metrics

During training, you'll see detailed metrics showing DPO progress:

                   Step 50                    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Metric                         ┃ Value     ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ accuracy                       │ 0.568627  │
│ batch_time                     │ 27.953704 │
│ chosen_reward                  │ 0.053621  │
│ dpo_loss                       │ 0.683825  │
│ learning_rate                  │ 0.000009  │
│ margin                         │ 0.002147  │
│ num_pairs                      │ 255       │
│ num_tokens                     │ 112638    │
│ progress                       │ 0.081210  │
│ rejected_reward                │ 0.032152  │
│ test/nll                       │ 1.871778  │
└────────────────────────────────┴───────────┘

DPO Metrics Explained

Metric	Meaning
`dpo_loss`	The DPO classification loss (should decrease)
`accuracy`	Accuracy of implicit reward model on preference dataset
`margin`	Average difference between chosen and rejected rewards
`chosen_reward`	Average reward for chosen/preferred responses
`rejected_reward`	Average reward for rejected/dispreferred responses

Evaluating DPO Models

After training, evaluate your DPO model on benchmarks to measure preference optimization impact:

Model Evaluation

1MODEL_PATH=bios://YOUR_MODEL_PATH_HERE
2python -m bios_cookbook.eval.run_inspect_evals \
3    model_path=$MODEL_PATH \
4    model_name=ultrasafe/usf-conversation \
5    tasks=inspect_evals/ifeval \
6    renderer_name=ultrasafe

This evaluates the model on various benchmarks to quantify the impact of preference optimization on instruction-following, helpfulness, and other quality dimensions.

Tips for DPO Training

Beta Parameter

Start with dpo_beta=0.1 and adjust based on your dataset. Higher beta = stronger preference signal, lower beta = more conservative updates.

Learning Rate

Use a lower learning rate than supervised fine-tuning, typically 1e-5 to 1e-6. DPO is sensitive to LR—too high can cause instability.

Base Model Selection

The base model should already be in-distribution with preference data. Either start with a light SFT phase or collect on-policy preferences. Sharp distribution mismatch can create strange behaviors.

Next Steps

Explore related preference learning techniques:

Preferences Overview →

Compare DPO and RLHF approaches

DPO Loss Implementation →

Technical details of DPO loss function