RL Hyperparameters
This guide covers key hyperparameters for reinforcement learning training, from core settings to advanced configurations for distributed RL with Bios.
Core Hyperparameters
Learning Rate
Similar to supervised learning, the learning rate is the most critical hyperparameter. We recommend using the guidance from SL hyperparameters as a starting point for RL experiments.
1from bios_cookbook.hyperparam_utils import get_lr
2
3# Same utility works for RL
4model_name = "ultrasafe/usf-code"
5rl_lr = get_lr(model_name)
6
7print(f"Recommended RL LR: {rl_lr}")
8# RL typically uses same or slightly lower LR than SLRL vs SL Learning Rates
For RL, you can use the same validated LR formulas from SL. RL often works well with the same LR or slightly lower (0.5-1.0x the SL LR). The LoRA rank independence still applies—use lower ranks (8-16) for RL efficiency.
Batch and Group Sizes
RL training uses two key parameters for data collection:
batch_size
Number of unique environments or problems used for training
Controls diversity of training data. More unique problems = better generalization.
group_size
Number of rollouts performed per unique environment
Controls exploration per problem. More rollouts = better reward estimation per environment.
Usage Guidelines
- →Limited environments? Increase group_sizeto generate more training data from available problems
- →Total rollouts: batch_size × group_size determines total trajectories per iteration
- →LR scaling: Scale learning rate proportionally as LR ∝ √batch_size
1from bios.rlhf import PPOTrainer
2
3trainer = PPOTrainer(
4    api_key="YOUR_API_KEY",
5    model="ultrasafe/usf-code"
6)
7
8session = trainer.create_session({
9    "batch_size": 32,    # 32 unique problems
10    "group_size": 4,     # 4 rollouts per problem
11    "learning_rate": 2e-5,
12    "lora_rank": 8
13})
14
15# Total rollouts per iteration = 32 × 4 = 128Multiple Updates per Sampling Iteration
The num_substeps parameter controls how many policy weight updates are performed on data sampled from the last policy iteration, similar to PPO and GRPO.
How It Works
num_substeps = 1 (default)
Each batch of collected trajectories is used for exactly one optimizer update
num_substeps > 1
Batch is split into mini-batches, with one update per mini-batch. Rollouts for the same environment are kept together in the same mini-batch. Still takes only one epoch through the data.
⚠️ Usage Guidelines
- • Divisibility: batch_size must be divisible by num_substeps
- • Start low: num_substeps = 1 gives decent performance; if experimenting, try 2-4 with PPO
- • High values risk: Updates become too out-of-distribution for the policy
- • Consider LR reduction: When using multiple substeps, decrease LR to maintain stability
Advanced Training Configurations
⚠️ Experimental Features
The following features are experimental and may be subject to instabilities. They are currently disabled by default. Use with caution and close monitoring.
Streaming Minibatch Training
Overlap trajectory sampling and model training to improve throughput. Submit training requests as soon as rollouts complete, without waiting for all sampling jobs to finish.
1from bios.rlhf import StreamMinibatchConfig
2
3config = StreamMinibatchConfig(
4    groups_per_batch=32,    # Same as batch_size
5    num_minibatches=8       # Split into 8 training requests
6)
7
8# Improves pipeline efficiency by overlapping sampling and training
9# Remains strictly on-policyKey Point: This is a pipeline efficiency improvement only. Training remains strictly on-policy—no off-policy bias is introduced.
Async Off-Policy Training
Async training allows the model to train on trajectories from slightly older model versions, enabling higher throughput at the cost of some off-policy bias. Bios supports the "off-by-K" async RL approach.
1from bios.rlhf import AsyncConfig
2
3config = AsyncConfig(
4    max_steps_off_policy=3,  # Max age of trajectories (in steps)
5    groups_per_batch=32      # New groups to accumulate before update
6)
7
8# Trajectories older than max_steps_off_policy are discardedAsync RL Guidelines
- • Use case: Long heterogeneous rollouts (long CoT, multi-hop tool use, agentic workflows)
- • Start small: max_steps_off_policy < 5 initially
- • Monitor closely: Off-policy data can degrade performance or crash policy
- • Separate from batch_size: groups_per_batch is distinct from dataset construction batch size
Monitoring and Run Health
Using policy-gradient algorithms with off-policy data requires careful monitoring. KL divergence is the primary indicator of training health.
KL Divergence Monitoring
Bios logs KL divergence between the data generation policy and current learner using two estimators:
kl_sample_train_v1First KL estimator
kl_sample_train_v2Second KL estimator
Important Notes on KL Divergence
- 1.Non-zero even on-policy: KL divergence won't be exactly zero due to implementation details and numerical precision, even with full on-policy training
- 2.Stability threshold: Training is typically stable with KL divergence < 0.01
- 3.Warning sign: If KL divergence exceeds threshold, indicates numerical instability or training issue
Quick Reference: RL Hyperparameters
| Parameter | Recommended | Notes | 
|---|---|---|
| Learning Rate | get_lr(model) | Use SL formula, or 0.5-1.0x for RL | 
| LoRA Rank | 8-16 | Low rank sufficient for RL | 
| batch_size | 16-64 | Unique environments per iteration | 
| group_size | 4-8 | Rollouts per environment | 
| num_substeps | 1 | Policy updates per iteration | 
| KL Threshold | < 0.01 | Maintain for stable training | 
Next Steps
Apply RL hyperparameter insights to your training: