Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) is a method for training language models to align with human preferences without requiring a separate reward model. Instead of using reinforcement learning with human feedback (RLHF), DPO directly optimizes the model to prefer chosen responses over rejected ones using a simple classification loss.

DPO vs RLHF

DPO eliminates the need for a separate reward model by directly optimizing the policy to prefer chosen responses. This makes training simpler and computationally cheaper than classical RLHF.

DPO Algorithm Details

The core DPO loss optimizes the policy to maximize the log-odds of choosing preferred responses:

Mathematical Formulation

Lθ = -Ex,y_chosen,y_rejected~D[

log σ(

β log(πθ(y_chosen|x) / πref(y_chosen|x))

- β log(πθ(y_rejected|x) / πref(y_rejected|x))

)

]

Where:

  • • πθ - Current policy being trained
  • • πref - Reference model (typically initial model before DPO)
  • • β - DPO beta parameter (controls preference learning strength)
  • • D - Dataset of (prompt x, chosen y_chosen, rejected y_rejected)

Key Insight

DPO optimizes the classical constrained RLHF objective, where the reference model constrains deviation from the initial distribution. This prevents the policy from drifting too far from reasonable behavior.

Running DPO Training

The implementation is in train_dpo.py with a CLI interface. Run from the command line:

DPO Training Command
1python -m bios_cookbook.recipes.preference.train \
2    log_path=/tmp/dpo-experiment \
3    model_name=ultrasafe/usf-conversation \
4    dataset=helpsteer3 \
5    renderer_name=ultrasafe \
6    learning_rate=1e-5 \
7    dpo_beta=0.1

Key Parameters

ParameterDescriptionExample
log_pathDirectory for results and checkpoints/tmp/dpo-exp
model_nameBase model for initialization and reference policyultrasafe/usf-mini
datasetPreference dataset namehelpsteer3
renderer_nameConversation formatting (see Rendering guide)ultrasafe
learning_rateLearning rate for optimization1e-5
dpo_betaDPO beta parameter (preference strength)0.1

Available Preference Datasets

Bios provides several pre-configured preference datasets. These are implemented as DPODatasetBuilder classes:

HHH

Helpful-Harmless-Honest

Anthropic's dataset focusing on helpfulness, harmlessness, and honesty

dataset=hhh
HS3

HelpSteer3

NVIDIA's HelpSteer3 preference dataset for helpful AI assistants

dataset=helpsteer3
UFB

UltraFeedback

UltraFeedback binarized preferences dataset for general alignment

dataset=ultrafeedback

Custom Datasets

You can implement custom dataset builders following the bios_cookbook.preference.preference_datasets interface. Each dataset builder should provide pairwise preferences with prompts and chosen/rejected completions.

Training Process and Metrics

During training, you'll see detailed metrics showing DPO progress:

                   Step 50                    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Metric                         ┃ Value     ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ accuracy                       │ 0.568627  │
│ batch_time                     │ 27.953704 │
│ chosen_reward                  │ 0.053621  │
│ dpo_loss                       │ 0.683825  │
│ learning_rate                  │ 0.000009  │
│ margin                         │ 0.002147  │
│ num_pairs                      │ 255       │
│ num_tokens                     │ 112638    │
│ progress                       │ 0.081210  │
│ rejected_reward                │ 0.032152  │
│ test/nll                       │ 1.871778  │
└────────────────────────────────┴───────────┘
DPO Metrics Explained
MetricMeaning
dpo_lossThe DPO classification loss (should decrease)
accuracyAccuracy of implicit reward model on preference dataset
marginAverage difference between chosen and rejected rewards
chosen_rewardAverage reward for chosen/preferred responses
rejected_rewardAverage reward for rejected/dispreferred responses

Evaluating DPO Models

After training, evaluate your DPO model on benchmarks to measure preference optimization impact:

Model Evaluation
1MODEL_PATH=bios://YOUR_MODEL_PATH_HERE
2python -m bios_cookbook.eval.run_inspect_evals \
3    model_path=$MODEL_PATH \
4    model_name=ultrasafe/usf-conversation \
5    tasks=inspect_evals/ifeval \
6    renderer_name=ultrasafe

This evaluates the model on various benchmarks to quantify the impact of preference optimization on instruction-following, helpfulness, and other quality dimensions.

Tips for DPO Training

Beta Parameter

Start with dpo_beta=0.1 and adjust based on your dataset. Higher beta = stronger preference signal, lower beta = more conservative updates.

Learning Rate

Use a lower learning rate than supervised fine-tuning, typically 1e-5 to 1e-6. DPO is sensitive to LR—too high can cause instability.

Base Model Selection

The base model should already be in-distribution with preference data. Either start with a light SFT phase or collect on-policy preferences. Sharp distribution mismatch can create strange behaviors.