Supervised Learning Hyperparameters

Successful LLM fine-tuning requires careful hyperparameter tuning. While exhaustive sweeps provide the most accurate results, they're time-consuming and expensive. This guide provides research-backed starting recommendations for the most critical hyperparameters.

Validated Recommendations

These hyperparameter recommendations are validated across diverse supervised fine-tuning experiments, achieving <0.5% regret compared to exhaustive hyperparameter sweeps on UltraSafe expert models.

Learning Rate: The Critical Hyperparameter

The learning rate (LR) is generally the most important hyperparameter in ML experiments. Bios provides a research-validated formula for optimal learning rate selection based on model architecture.

Optimal Learning Rate Formula

For a model m, the recommended learning rate is:

LR(m) = lrbase · MLoRA · (2000 / Hm)Pm

Where:

  • lrbase = Base learning rate constant (5e-5)
  • MLoRA = LoRA multiplier (10 for LoRA, 1 for full fine-tuning)
  • Hm = Hidden size of model m
  • Pm = Model-specific exponent adjustment

Current Best Estimates:

  • lrbase = 5e-5
  • MLoRA = 10
  • Pm = 0.781 for UltraSafe models (varies by architecture)

✓ Key Insight: LoRA Rank Independence

This formula is independent of LoRA rank. You can use the same learning rate across different ranks and get identical learning curves for initial training steps. Only the final convergence capacity varies with rank.

Get Recommended Learning Rate

Use the Bios utility to automatically calculate the optimal LR for any UltraSafe model:

Calculate Optimal Learning Rate
1from bios_cookbook.hyperparam_utils import get_lr
2
3# Get recommended LR for UltraSafe model
4model_name = "ultrasafe/usf-finance"
5recommended_lr = get_lr(model_name)
6
7print(f"Recommended LR for {model_name}: {recommended_lr}")
8
9# Example output:
10# Recommended LR for ultrasafe/usf-finance: 0.00032
11
12# Use in training
13import bios
14from bios import types
15
16training_client = service_client.create_lora_training_client(
17    base_model=model_name,
18    rank=32
19)
20
21# Apply recommended LR
22training_client.optim_step(
23    types.AdamParams(learning_rate=recommended_lr)
24)

Formula Validation

The learning rate formula has been validated across diverse supervised fine-tuning experiments with varying:

📊

Datasets

Multiple domains

📈

Dataset Sizes

100 to 100K examples

🔢

Batch Sizes

16 to 512

⚙️

LoRA Ranks

8 to 128

Regret Analysis

We define regret as the performance gap between using our recommended LR and the optimal LR found via exhaustive search:

regret(lr') = [loss(lr') - minlr loss(lr)] / minlr loss(lr)

Our formula achieves <0.5% regret across tested scenarios, meaning the recommended LR performs within 0.5% of the theoretically optimal LR without expensive hyperparameter sweeps.

Practical Implication

You can use get_lr(model_name) as your starting point and achieve near-optimal performance without manual tuning. This saves significant time and computational resources in production training pipelines.

Batch Size Optimization

Batch size is the second-most important hyperparameter, significantly affecting both training efficiency and final performance. The relationship between batch size and learning rate has important implications.

Perfect Scaling Regime

For small batch sizes, there's a phenomenon of perfect scaling where LR and batch size should be varied together:

LR ∝ √B

In this regime, the learning curve depends only on LR/√B. Reference: Shallue et al. (2018) for theoretical foundations in the training-from-scratch setting.

Note: When fine-tuning LLMs, we're often outside the perfect scaling regime. Smaller batch sizes frequently give better final performance, at the cost of longer training time.

Batch Size Recommendations

For supervised learning fine-tuning with Bios:

Recommended Batch Sizes

Standard Recommendation

Best performance vs training time trade-off

128

High Performance

Maximum quality, slower training

32-64

Fast Training

Faster convergence, may sacrifice quality

256-512

⚠️ Batch Size Guidance

Batch size recommendations are based on preliminary findings and ongoing research. We're continuing to refine these guidelines. For best results, consider testing a few batch sizes (64, 128, 256) on your specific use case.

Minimum Training Steps

Regardless of batch size, aim for adequate training duration:

Minimum Recommended

100

training steps

Sufficient for basic adaptation and simple tasks

Optimal Results

1000+

training steps

Best performance for complex tasks and larger datasets

Complete Hyperparameter Configuration

Putting it all together—a production configuration using validated hyperparameters:

Optimized Training Configuration
1import bios
2from bios import types
3from bios_cookbook.hyperparam_utils import get_lr
4import asyncio
5
6async def optimized_training(
7    base_model: str,
8    training_data: list,
9    batch_size: int = 128,
10    lora_rank: int = 32,
11    num_steps: int = 1000
12):
13    """
14    Production SL training with optimized hyperparameters
15    """
16    # Initialize
17    service_client = bios.ServiceClient()
18    training_client = await service_client.create_lora_training_client_async(
19        base_model=base_model,
20        rank=lora_rank
21    )
22    
23    # Get validated learning rate
24    learning_rate = get_lr(base_model)
25    print(f"Model: {base_model}")
26    print(f"LoRA Rank: {lora_rank}")
27    print(f"Batch Size: {batch_size}")
28    print(f"Learning Rate: {learning_rate}")
29    print(f"Target Steps: {num_steps}")
30    print("-" * 50)
31    
32    # Training loop
33    for step in range(num_steps):
34        # Get batch
35        batch = get_training_batch(training_data, batch_size)
36        
37        # Training step with validated LR
38        fwd_future = await training_client.forward_backward_async(
39            batch, "cross_entropy"
40        )
41        opt_future = await training_client.optim_step_async(
42            types.AdamParams(learning_rate=learning_rate)
43        )
44        
45        # Get results
46        fwd_result = await fwd_future
47        await opt_future
48        
49        if step % 50 == 0:
50            print(f"Step {step}/{num_steps}: Loss = {fwd_result.loss:.4f}")
51        
52        # Checkpoint every 250 steps
53        if step % 250 == 0 and step > 0:
54            checkpoint = training_client.save_state(
55                name=f"step_{step}"
56            ).result()
57            print(f"Saved checkpoint: {checkpoint.path}")
58    
59    # Final save
60    final_path = training_client.save_state(name="final_model").result().path
61    print(f"Training complete! Final model: {final_path}")
62    return training_client
63
64# Run with validated hyperparameters
65asyncio.run(optimized_training(
66    base_model="ultrasafe/usf-code",
67    training_data=code_examples,
68    batch_size=128,
69    lora_rank=32,
70    num_steps=1000
71))

When to Tune Hyperparameters

While our recommended values work well for most cases, certain scenarios benefit from manual tuning:

Learning Rate Tuning

Consider manual LR tuning when:

  • • Training diverges or shows instability
  • • Loss plateaus earlier than expected
  • • Using non-standard model architectures
  • • Combining SL with other training objectives

Tuning range: Try [0.5x, 1.0x, 2.0x] of recommended LR

Batch Size Tuning

Experiment with batch size when:

  • • You need faster iteration (increase batch size)
  • • Final performance is more important than speed (decrease batch size)
  • • Memory constraints limit your options
  • • Dataset is extremely small or large

Tuning range: Try [64, 128, 256] and measure quality vs speed trade-off

LoRA Rank Tuning

Adjust LoRA rank based on:

  • • Dataset size (larger datasets benefit from higher rank)
  • • Task complexity (complex reasoning may need higher rank)
  • • Training type (RL: 8-16, SL: 32-128)
  • • Memory constraints (lower rank = less memory)

Remember: Optimal LR is independent of rank—use same LR for all rank experiments

Quick Reference: Hyperparameter Defaults

HyperparameterRecommended ValueHow to GetWhen to Tune
Learning Rateget_lr(model)get_lr("ultrasafe/usf-finance")Training diverges or plateaus early
Batch Size128Fixed valueBalance performance vs speed
LoRA Rank32 (SL), 8-16 (RL)Based on task typeLarge datasets or complex tasks
Training Steps1000+Based on dataset sizeMonitor validation loss