RLHF Complete Pipeline

This guide demonstrates the complete Reinforcement Learning from Human Feedback (RLHF) pipeline. The Bios Cookbook provides a script implementing the standard three-stage RLHF process.

Three-Stage RLHF Pipeline

1.Supervised Fine-Tuning: Train initial policy on instruction-following data
2.Preference Model Training: Train reward model on pairwise preference comparisons
3.RL Policy Optimization: Use preference model as reward for policy gradient training

Run the Complete Pipeline

Execute all three stages with a single command:

Run RLHF Pipeline
1python -m bios_cookbook.recipes.preference.rlhf.rlhf_pipeline

This script orchestrates all three training stages sequentially, managing checkpoints and data flow between stages automatically.

Stage 1: Training the Initial Policy (SFT)

First, train the policy via supervised learning on high-quality instruction-following data. This establishes a strong foundation for subsequent RL training.

Configuration

Dataset:Curated instruction-following dataset with human-written responses
Objective:Cross-entropy loss on high-quality demonstrations
Purpose:Establish instruction-following capability and response quality baseline

Why Start with SFT?

Supervised fine-tuning provides the model with basic instruction-following and domain knowledge. This significantly accelerates RL convergence and prevents the policy from exploring undesirable behaviors.

Stage 2: Training the Preference Model

Train a preference model (reward model) on pairwise comparison data. The model learns to score which of two completions is preferred.

Configuration

Dataset:Pairwise preference comparisons (e.g., HHH from Anthropic)
Input:Pair of completions (A and B) for the same prompt
Output:Model predicts which completion is preferred
Loss:Bradley-Terry ranking loss on pairwise comparisons

Preference Model Purpose

The trained preference model acts as a reward function in Stage 3. It scores how well completions align with human preferences, enabling RL optimization without requiring human evaluation during training.

Stage 3: Training the Policy via RL

Use the preference model as a reward function to optimize the policy through reinforcement learning. This stage employs self-play with pairwise comparisons.

RL Configuration

Initial Policy:Checkpoint from Stage 1 (SFT model)
Reward Function:Preference model from Stage 2
Self-Play:Sample multiple completions per prompt, use preference model to grade all pairs
Reward Signal:Win fraction from pairwise comparisons
Algorithm:PPO for stable policy updates

Self-Play Mechanism

For each prompt, the policy generates multiple completions. The preference model grades all pairs, creating a tournament-style evaluation. The policy receives reward based on its win fraction, incentivizing generation of responses that the preference model ranks highly.

Pipeline Flow Diagram

1

Supervised Fine-Tuning

Train on instruction data → Save SFT checkpoint

2

Preference Model Training

Train on pairwise preferences → Save preference model

3

RL Policy Optimization

Load SFT checkpoint + preference model → RL training with self-play → Final aligned model