RL Hyperparameters

This guide covers key hyperparameters for reinforcement learning training, from core settings to advanced configurations for distributed RL with Bios.

Core Hyperparameters

Learning Rate

Similar to supervised learning, the learning rate is the most critical hyperparameter. We recommend using the guidance from SL hyperparameters as a starting point for RL experiments.

Get RL Learning Rate

1from bios_cookbook.hyperparam_utils import get_lr
2
3# Same utility works for RL
4model_name = "ultrasafe/usf-code"
5rl_lr = get_lr(model_name)
6
7print(f"Recommended RL LR: {rl_lr}")
8# RL typically uses same or slightly lower LR than SL

RL vs SL Learning Rates

For RL, you can use the same validated LR formulas from SL. RL often works well with the same LR or slightly lower (0.5-1.0x the SL LR). The LoRA rank independence still applies—use lower ranks (8-16) for RL efficiency.

Batch and Group Sizes

RL training uses two key parameters for data collection:

batch_size

Number of unique environments or problems used for training

Controls diversity of training data. More unique problems = better generalization.

group_size

Number of rollouts performed per unique environment

Controls exploration per problem. More rollouts = better reward estimation per environment.

Usage Guidelines

→Limited environments? Increase group_size to generate more training data from available problems
→Total rollouts: batch_size × group_size determines total trajectories per iteration
→LR scaling: Scale learning rate proportionally as LR ∝ √batch_size

Configure Batch and Group Sizes

1from bios.rlhf import PPOTrainer
2
3trainer = PPOTrainer(
4    api_key="YOUR_API_KEY",
5    model="ultrasafe/usf-code"
6)
7
8session = trainer.create_session({
9    "batch_size": 32,    # 32 unique problems
10    "group_size": 4,     # 4 rollouts per problem
11    "learning_rate": 2e-5,
12    "lora_rank": 8
13})
14
15# Total rollouts per iteration = 32 × 4 = 128

Multiple Updates per Sampling Iteration

The num_substeps parameter controls how many policy weight updates are performed on data sampled from the last policy iteration, similar to PPO and GRPO.

How It Works

num_substeps = 1 (default)

Each batch of collected trajectories is used for exactly one optimizer update

num_substeps > 1

Batch is split into mini-batches, with one update per mini-batch. Rollouts for the same environment are kept together in the same mini-batch. Still takes only one epoch through the data.

⚠️ Usage Guidelines

• Divisibility: batch_size must be divisible by num_substeps
• Start low: num_substeps = 1 gives decent performance; if experimenting, try 2-4 with PPO
• High values risk: Updates become too out-of-distribution for the policy
• Consider LR reduction: When using multiple substeps, decrease LR to maintain stability

Advanced Training Configurations

⚠️ Experimental Features

The following features are experimental and may be subject to instabilities. They are currently disabled by default. Use with caution and close monitoring.

Streaming Minibatch Training

Overlap trajectory sampling and model training to improve throughput. Submit training requests as soon as rollouts complete, without waiting for all sampling jobs to finish.

StreamMinibatchConfig

1from bios.rlhf import StreamMinibatchConfig
2
3config = StreamMinibatchConfig(
4    groups_per_batch=32,    # Same as batch_size
5    num_minibatches=8       # Split into 8 training requests
6)
7
8# Improves pipeline efficiency by overlapping sampling and training
9# Remains strictly on-policy

Key Point: This is a pipeline efficiency improvement only. Training remains strictly on-policy—no off-policy bias is introduced.

Async Off-Policy Training

Async training allows the model to train on trajectories from slightly older model versions, enabling higher throughput at the cost of some off-policy bias. Bios supports the "off-by-K" async RL approach.

AsyncConfig

1from bios.rlhf import AsyncConfig
2
3config = AsyncConfig(
4    max_steps_off_policy=3,  # Max age of trajectories (in steps)
5    groups_per_batch=32      # New groups to accumulate before update
6)
7
8# Trajectories older than max_steps_off_policy are discarded

Async RL Guidelines

• Use case: Long heterogeneous rollouts (long CoT, multi-hop tool use, agentic workflows)
• Start small: max_steps_off_policy < 5 initially
• Monitor closely: Off-policy data can degrade performance or crash policy
• Separate from batch_size: groups_per_batch is distinct from dataset construction batch size

Monitoring and Run Health

Using policy-gradient algorithms with off-policy data requires careful monitoring. KL divergence is the primary indicator of training health.

KL Divergence Monitoring

Bios logs KL divergence between the data generation policy and current learner using two estimators:

D_KL[π_sampler(·|x) || π_θ(·|x)]

kl_sample_train_v1

First KL estimator

kl_sample_train_v2

Second KL estimator

Important Notes on KL Divergence

1.Non-zero even on-policy: KL divergence won't be exactly zero due to implementation details and numerical precision, even with full on-policy training
2.Stability threshold: Training is typically stable with KL divergence < 0.01
3.Warning sign: If KL divergence exceeds threshold, indicates numerical instability or training issue

Quick Reference: RL Hyperparameters

Parameter	Recommended	Notes
Learning Rate	`get_lr(model)`	Use SL formula, or 0.5-1.0x for RL
LoRA Rank	`8-16`	Low rank sufficient for RL
batch_size	`16-64`	Unique environments per iteration
group_size	`4-8`	Rollouts per environment
num_substeps	`1`	Policy updates per iteration
KL Threshold	`< 0.01`	Maintain for stable training

Next Steps

Apply RL hyperparameter insights to your training:

RL Training Guide →

Complete RLHF and RLVR training examples

Loss Functions →

PPO and importance sampling implementation details