RLHF Complete Pipeline

This guide demonstrates the complete Reinforcement Learning from Human Feedback (RLHF) pipeline. The Bios Cookbook provides a script implementing the standard three-stage RLHF process.

Three-Stage RLHF Pipeline

1.Supervised Fine-Tuning: Train initial policy on instruction-following data

2.Preference Model Training: Train reward model on pairwise preference comparisons

3.RL Policy Optimization: Use preference model as reward for policy gradient training

Run the Complete Pipeline

Execute all three stages with a single command:

Run RLHF Pipeline

1python -m bios_cookbook.recipes.preference.rlhf.rlhf_pipeline

This script orchestrates all three training stages sequentially, managing checkpoints and data flow between stages automatically.

Stage 1: Training the Initial Policy (SFT)

First, train the policy via supervised learning on high-quality instruction-following data. This establishes a strong foundation for subsequent RL training.

Configuration

Dataset:Curated instruction-following dataset with human-written responses

Objective:Cross-entropy loss on high-quality demonstrations

Purpose:Establish instruction-following capability and response quality baseline

Why Start with SFT?

Supervised fine-tuning provides the model with basic instruction-following and domain knowledge. This significantly accelerates RL convergence and prevents the policy from exploring undesirable behaviors.

Stage 2: Training the Preference Model

Train a preference model (reward model) on pairwise comparison data. The model learns to score which of two completions is preferred.

Configuration

Dataset:Pairwise preference comparisons (e.g., HHH from Anthropic)

Input:Pair of completions (A and B) for the same prompt

Output:Model predicts which completion is preferred

Loss:Bradley-Terry ranking loss on pairwise comparisons

Preference Model Purpose

The trained preference model acts as a reward function in Stage 3. It scores how well completions align with human preferences, enabling RL optimization without requiring human evaluation during training.

Stage 3: Training the Policy via RL

Use the preference model as a reward function to optimize the policy through reinforcement learning. This stage employs self-play with pairwise comparisons.

RL Configuration

Initial Policy:Checkpoint from Stage 1 (SFT model)

Reward Function:Preference model from Stage 2

Self-Play:Sample multiple completions per prompt, use preference model to grade all pairs

Reward Signal:Win fraction from pairwise comparisons

Algorithm:PPO for stable policy updates

Self-Play Mechanism

For each prompt, the policy generates multiple completions. The preference model grades all pairs, creating a tournament-style evaluation. The policy receives reward based on its win fraction, incentivizing generation of responses that the preference model ranks highly.

Pipeline Flow Diagram

Supervised Fine-Tuning

Train on instruction data → Save SFT checkpoint

Preference Model Training

Train on pairwise preferences → Save preference model

RL Policy Optimization

Load SFT checkpoint + preference model → RL training with self-play → Final aligned model

Next Steps

Dive deeper into each stage of the RLHF pipeline:

RLHF Complete Pipeline

Three-Stage RLHF Pipeline

Run the Complete Pipeline

Stage 1: Training the Initial Policy (SFT)

Configuration

Why Start with SFT?

Stage 2: Training the Preference Model

Configuration

Preference Model Purpose

Stage 3: Training the Policy via RL

RL Configuration

Self-Play Mechanism

Pipeline Flow Diagram

Supervised Fine-Tuning

Preference Model Training

RL Policy Optimization

Next Steps

Stage 1: SFT →

Stage 2: Preferences →

Stage 3: RL →