Supervised Learning Hyperparameters
Successful LLM fine-tuning requires careful hyperparameter tuning. While exhaustive sweeps provide the most accurate results, they're time-consuming and expensive. This guide provides research-backed starting recommendations for the most critical hyperparameters.
Validated Recommendations
These hyperparameter recommendations are validated across diverse supervised fine-tuning experiments, achieving <0.5% regret compared to exhaustive hyperparameter sweeps on UltraSafe expert models.
Learning Rate: The Critical Hyperparameter
The learning rate (LR) is generally the most important hyperparameter in ML experiments. Bios provides a research-validated formula for optimal learning rate selection based on model architecture.
Optimal Learning Rate Formula
For a model m, the recommended learning rate is:
Where:
- • lrbase = Base learning rate constant (5e-5)
- • MLoRA = LoRA multiplier (10 for LoRA, 1 for full fine-tuning)
- • Hm = Hidden size of model m
- • Pm = Model-specific exponent adjustment
Current Best Estimates:
- • lrbase = 5e-5
- • MLoRA = 10
- • Pm = 0.781 for UltraSafe models (varies by architecture)
✓ Key Insight: LoRA Rank Independence
This formula is independent of LoRA rank. You can use the same learning rate across different ranks and get identical learning curves for initial training steps. Only the final convergence capacity varies with rank.
Get Recommended Learning Rate
Use the Bios utility to automatically calculate the optimal LR for any UltraSafe model:
1from bios_cookbook.hyperparam_utils import get_lr
2
3# Get recommended LR for UltraSafe model
4model_name = "ultrasafe/usf-finance"
5recommended_lr = get_lr(model_name)
6
7print(f"Recommended LR for {model_name}: {recommended_lr}")
8
9# Example output:
10# Recommended LR for ultrasafe/usf-finance: 0.00032
11
12# Use in training
13import bios
14from bios import types
15
16training_client = service_client.create_lora_training_client(
17    base_model=model_name,
18    rank=32
19)
20
21# Apply recommended LR
22training_client.optim_step(
23    types.AdamParams(learning_rate=recommended_lr)
24)Formula Validation
The learning rate formula has been validated across diverse supervised fine-tuning experiments with varying:
Datasets
Multiple domains
Dataset Sizes
100 to 100K examples
Batch Sizes
16 to 512
LoRA Ranks
8 to 128
Regret Analysis
We define regret as the performance gap between using our recommended LR and the optimal LR found via exhaustive search:
Our formula achieves <0.5% regret across tested scenarios, meaning the recommended LR performs within 0.5% of the theoretically optimal LR without expensive hyperparameter sweeps.
Practical Implication
You can use get_lr(model_name) as your starting point and achieve near-optimal performance without manual tuning. This saves significant time and computational resources in production training pipelines.
Batch Size Optimization
Batch size is the second-most important hyperparameter, significantly affecting both training efficiency and final performance. The relationship between batch size and learning rate has important implications.
Perfect Scaling Regime
For small batch sizes, there's a phenomenon of perfect scaling where LR and batch size should be varied together:
In this regime, the learning curve depends only on LR/√B. Reference: Shallue et al. (2018) for theoretical foundations in the training-from-scratch setting.
Note: When fine-tuning LLMs, we're often outside the perfect scaling regime. Smaller batch sizes frequently give better final performance, at the cost of longer training time.
Batch Size Recommendations
For supervised learning fine-tuning with Bios:
Standard Recommendation
Best performance vs training time trade-off
High Performance
Maximum quality, slower training
Fast Training
Faster convergence, may sacrifice quality
⚠️ Batch Size Guidance
Batch size recommendations are based on preliminary findings and ongoing research. We're continuing to refine these guidelines. For best results, consider testing a few batch sizes (64, 128, 256) on your specific use case.
Minimum Training Steps
Regardless of batch size, aim for adequate training duration:
Minimum Recommended
training steps
Sufficient for basic adaptation and simple tasks
Optimal Results
training steps
Best performance for complex tasks and larger datasets
Complete Hyperparameter Configuration
Putting it all together—a production configuration using validated hyperparameters:
1import bios
2from bios import types
3from bios_cookbook.hyperparam_utils import get_lr
4import asyncio
5
6async def optimized_training(
7    base_model: str,
8    training_data: list,
9    batch_size: int = 128,
10    lora_rank: int = 32,
11    num_steps: int = 1000
12):
13    """
14    Production SL training with optimized hyperparameters
15    """
16    # Initialize
17    service_client = bios.ServiceClient()
18    training_client = await service_client.create_lora_training_client_async(
19        base_model=base_model,
20        rank=lora_rank
21    )
22    
23    # Get validated learning rate
24    learning_rate = get_lr(base_model)
25    print(f"Model: {base_model}")
26    print(f"LoRA Rank: {lora_rank}")
27    print(f"Batch Size: {batch_size}")
28    print(f"Learning Rate: {learning_rate}")
29    print(f"Target Steps: {num_steps}")
30    print("-" * 50)
31    
32    # Training loop
33    for step in range(num_steps):
34        # Get batch
35        batch = get_training_batch(training_data, batch_size)
36        
37        # Training step with validated LR
38        fwd_future = await training_client.forward_backward_async(
39            batch, "cross_entropy"
40        )
41        opt_future = await training_client.optim_step_async(
42            types.AdamParams(learning_rate=learning_rate)
43        )
44        
45        # Get results
46        fwd_result = await fwd_future
47        await opt_future
48        
49        if step % 50 == 0:
50            print(f"Step {step}/{num_steps}: Loss = {fwd_result.loss:.4f}")
51        
52        # Checkpoint every 250 steps
53        if step % 250 == 0 and step > 0:
54            checkpoint = training_client.save_state(
55                name=f"step_{step}"
56            ).result()
57            print(f"Saved checkpoint: {checkpoint.path}")
58    
59    # Final save
60    final_path = training_client.save_state(name="final_model").result().path
61    print(f"Training complete! Final model: {final_path}")
62    return training_client
63
64# Run with validated hyperparameters
65asyncio.run(optimized_training(
66    base_model="ultrasafe/usf-code",
67    training_data=code_examples,
68    batch_size=128,
69    lora_rank=32,
70    num_steps=1000
71))When to Tune Hyperparameters
While our recommended values work well for most cases, certain scenarios benefit from manual tuning:
Learning Rate Tuning
Consider manual LR tuning when:
- • Training diverges or shows instability
- • Loss plateaus earlier than expected
- • Using non-standard model architectures
- • Combining SL with other training objectives
Tuning range: Try [0.5x, 1.0x, 2.0x] of recommended LR
Batch Size Tuning
Experiment with batch size when:
- • You need faster iteration (increase batch size)
- • Final performance is more important than speed (decrease batch size)
- • Memory constraints limit your options
- • Dataset is extremely small or large
Tuning range: Try [64, 128, 256] and measure quality vs speed trade-off
LoRA Rank Tuning
Adjust LoRA rank based on:
- • Dataset size (larger datasets benefit from higher rank)
- • Task complexity (complex reasoning may need higher rank)
- • Training type (RL: 8-16, SL: 32-128)
- • Memory constraints (lower rank = less memory)
Remember: Optimal LR is independent of rank—use same LR for all rank experiments
Quick Reference: Hyperparameter Defaults
| Hyperparameter | Recommended Value | How to Get | When to Tune | 
|---|---|---|---|
| Learning Rate | get_lr(model) | get_lr("ultrasafe/usf-finance") | Training diverges or plateaus early | 
| Batch Size | 128 | Fixed value | Balance performance vs speed | 
| LoRA Rank | 32(SL),8-16(RL) | Based on task type | Large datasets or complex tasks | 
| Training Steps | 1000+ | Based on dataset size | Monitor validation loss | 
Next Steps
Apply these hyperparameter insights to your training: