LoRA Configuration
Bios supports LoRA (Low-Rank Adaptation) fine-tuning, which adjusts a small subset of parameters rather than all parameters in full fine-tuning. This approach dramatically reduces computational requirements while maintaining performance for most enterprise use cases.
LoRA vs Full Fine-Tuning
LoRA provides parameter-efficient fine-tuning by training low-rank adapter matrices instead of updating all model weights. This reduces memory requirements and training time while achieving comparable performance to full fine-tuning for most scenarios.
📚 Research: See LoRA Without Regret for detailed experimental results and theoretical justification
LoRA Performance Characteristics
Understanding when LoRA performs equivalently to full fine-tuning helps you choose the right approach:
✓ LoRA Excels At:
- •Reinforcement Learning: Equivalent performance to full fine-tuning even with small ranks. RL requires very low capacity.
- •Small-to-Medium SL Datasets: Instruction-tuning and reasoning datasets where LoRA matches full fine-tuning performance.
- •Domain Adaptation: Fine-tuning UltraSafe expert models on enterprise-specific data and terminology.
⚠️ LoRA Limitations:
- •Large Dataset SL: When dataset size exceeds LoRA capacity, training efficiency degrades. Not a hard floor—gradual performance reduction.
- •Large Batch Sizes: LoRA is less tolerant of very large batches compared to full fine-tuning. This is inherent to the product-of-matrices parametrization.
- •Attention-Only LoRA: Applying LoRA only to attention layers underperforms. Apply to all weight matrices (attention + MLP) for best results.
Best Practice
Apply LoRA to all weight matrices (attention layers + MLP layers + MoE layers where applicable), not just attention. This provides better performance even when matching total parameter count.
What is LoRA Exactly?
LoRA (Low-Rank Adaptation) modifies weight matrices using low-rank decomposition. Instead of updating the full weight matrix, LoRA adds a low-rank update that captures the essential adaptations needed for your task.
Mathematical Formulation
Given an original weight matrix W, LoRA replaces it with:
Where B and A are low-rank matrices. If W is an n × n matrix, then:
- • B is an n × r matrix
- • A is an r × n matrix
- • r is the rank (default: 32 in Bios)
The rank r controls the capacity of the adaptation. Only the low-rank matrices B and A are trained—the original weights W remain frozen.
Conceptual Understanding
While LoRA uses low-rank approximation mathematically, it's more useful to think of it as a random projection of the parameter space that's efficient to implement. When training with RL or small SL datasets, you're only learning limited information, and this reduced parameter set is sufficient.
Critical Hyperparameter: Learning Rate
The learning rate is typically the most important hyperparameter in ML experiments. For LoRA, learning rate selection requires special attention.
⚠️ Common Mistake
LoRA requires a much larger learning rate than full fine-tuning—typically 20-100x larger.Many practitioners mistakenly use their full fine-tuning LR when switching to LoRA, leading to poor performance and incorrect conclusions about LoRA's effectiveness.
Calculate the Correct LoRA Learning Rate
Bios provides a utility to calculate the LR scaling factor for LoRA:
1from bios_cookbook.hyperparam_utils import get_lora_lr_over_full_finetune_lr
2
3# Get LR scaling factor for UltraSafe model
4model_name = "ultrasafe/usf-finance"
5lr_scale_factor = get_lora_lr_over_full_finetune_lr(model_name)
6
7print(f"LoRA LR scale factor: {lr_scale_factor}x")
8
9# Example: If your full fine-tuning LR is 1e-5
10full_ft_lr = 1e-5
11lora_lr = full_ft_lr * lr_scale_factor
12
13print(f"Full FT LR: {full_ft_lr}")
14print(f"LoRA LR: {lora_lr}")LR Scaling by Model
The scaling factor varies by model architecture and capacity:
What Rank to Use?
The default rank used by Bios is 32, which works well for most use cases. However, for supervised learning on large datasets, you should consider using a larger rank.
Rank Selection Guidelines
Reinforcement Learning
Small ranks give equivalent performance to larger ranks and full fine-tuning.
Recommended: rank = 8 to 16
Supervised Learning
Rank should scale with dataset size. Ensure LoRA parameters ≥ completion tokens for best results.
Recommended: rank = 32 to 128
Calculate LoRA Parameters
Use Bios utilities to calculate the number of trainable parameters for a given rank:
1from bios_cookbook.hyperparam_utils import get_lora_param_count
2
3# Calculate LoRA parameters for UltraSafe model
4model_name = "ultrasafe/usf-healthcare"
5lora_rank = 32
6
7param_count = get_lora_param_count(model_name, lora_rank=lora_rank)
8print(f"LoRA parameters (rank={lora_rank}): {param_count:,}")
9
10# Compare different ranks
11for rank in [8, 16, 32, 64, 128]:
12    params = get_lora_param_count(model_name, lora_rank=rank)
13    print(f"Rank {rank:3d}: {params:,} parameters")Rule of Thumb for SL
As a rough approximation, LoRA will give good supervised learning results when: # LoRA parameters ≥ # completion tokens (i.e., tokens with weight=1). This ensures sufficient capacity to learn your dataset.
Configuring LoRA in Bios
Set LoRA parameters when creating your training client:
1import bios
2
3service_client = bios.ServiceClient()
4
5# Configure LoRA parameters
6training_client = service_client.create_lora_training_client(
7    base_model="ultrasafe/usf-code",
8    rank=32,  # LoRA rank (r)
9    alpha=64,  # LoRA alpha (scaling factor)
10    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
11    dropout=0.05
12)
13
14# Training proceeds normally
15for batch in dataloader:
16    training_client.forward_backward(batch, "cross_entropy")
17    training_client.optim_step()LoRA Configuration Parameters
| Parameter | Default | Description | 
|---|---|---|
| rank | 32 | Rank of low-rank matrices. Controls adapter capacity. | 
| alpha | 64 | Scaling factor for LoRA weights. Typically set to 2 × rank. | 
| target_modules | all layers | Which weight matrices to apply LoRA to. Use all for best performance. | 
| dropout | 0.05 | Dropout rate for LoRA layers. Prevents overfitting. | 
Advanced LoRA Configuration
Fine-tune LoRA settings for specific use cases and model architectures:
1import bios
2
3# Healthcare: Higher rank for complex medical reasoning
4healthcare_client = service_client.create_lora_training_client(
5    base_model="ultrasafe/usf-healthcare",
6    rank=64,  # Higher capacity for medical knowledge
7    alpha=128,
8    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
9    modules_to_save=["lm_head"],  # Also fine-tune output layer
10    dropout=0.1
11)
12
13# Finance: Standard rank for financial analysis
14finance_client = service_client.create_lora_training_client(
15    base_model="ultrasafe/usf-finance",
16    rank=32,
17    alpha=64,
18    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
19    dropout=0.05
20)
21
22# Code: Lower rank sufficient for syntax patterns
23code_client = service_client.create_lora_training_client(
24    base_model="ultrasafe/usf-code",
25    rank=16,
26    alpha=32,
27    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
28    dropout=0.03
29)
30
31# RL: Very low rank works well
32rl_client = service_client.create_lora_training_client(
33    base_model="ultrasafe/usf-conversation",
34    rank=8,  # RL requires minimal capacity
35    alpha=16,
36    target_modules=["q_proj", "v_proj"],  # Can even use attention-only for RL
37    dropout=0.0  # No dropout for RL
38)Important Observations
Learning Rate Independence from Rank
The optimal learning rate does not depend on the LoRA rank. You can train with different ranks using the same LR and get identical learning curves for the initial training steps. Only the final convergence capacity differs.
Apply LoRA to All Layers
Even when matching parameter counts, applying LoRA to all weight matrices (attention + MLP + MoE) outperforms attention-only LoRA. The additional layers capture important feature transformations.
Batch Size Sensitivity
LoRA is more sensitive to very large batch sizes than full fine-tuning. This is a property of the product-of-matrices parametrization. If you encounter training instability, try reducing batch size rather than increasing rank.
Complete LoRA Configuration Example
Here's a complete example showing proper LoRA configuration with LR scaling:
1import bios
2from bios import types
3from bios_cookbook.hyperparam_utils import (
4    get_lora_lr_over_full_finetune_lr,
5    get_lora_param_count
6)
7
8# Model selection
9base_model = "ultrasafe/usf-finance"
10lora_rank = 32
11
12# Calculate appropriate learning rate
13# If your full fine-tuning LR was 1e-5:
14full_ft_lr = 1e-5
15lr_scale = get_lora_lr_over_full_finetune_lr(base_model)
16lora_lr = full_ft_lr * lr_scale
17
18print(f"Full FT LR: {full_ft_lr}")
19print(f"LoRA LR scale: {lr_scale}x")
20print(f"LoRA LR: {lora_lr}")
21
22# Calculate parameter count
23param_count = get_lora_param_count(base_model, lora_rank=lora_rank)
24print(f"Trainable parameters: {param_count:,}")
25
26# Create training client with optimized config
27service_client = bios.ServiceClient()
28training_client = service_client.create_lora_training_client(
29    base_model=base_model,
30    rank=lora_rank,
31    alpha=lora_rank * 2,  # Common practice: alpha = 2 * rank
32    target_modules=[
33        # Attention layers
34        "q_proj", "v_proj", "k_proj", "o_proj",
35        # MLP layers
36        "gate_proj", "up_proj", "down_proj"
37    ],
38    dropout=0.05
39)
40
41# Training loop with correct LR
42for step in range(num_steps):
43    training_client.forward_backward(batch, "cross_entropy")
44    training_client.optim_step(
45        types.AdamParams(learning_rate=lora_lr)
46    )
47
48print("Training complete with optimal LoRA configuration!")Next Steps
Apply your LoRA knowledge to advanced training techniques: