LoRA Configuration

Bios supports LoRA (Low-Rank Adaptation) fine-tuning, which adjusts a small subset of parameters rather than all parameters in full fine-tuning. This approach dramatically reduces computational requirements while maintaining performance for most enterprise use cases.

LoRA vs Full Fine-Tuning

LoRA provides parameter-efficient fine-tuning by training low-rank adapter matrices instead of updating all model weights. This reduces memory requirements and training time while achieving comparable performance to full fine-tuning for most scenarios.

📚 Research: See LoRA Without Regret for detailed experimental results and theoretical justification

LoRA Performance Characteristics

Understanding when LoRA performs equivalently to full fine-tuning helps you choose the right approach:

✓ LoRA Excels At:

  • Reinforcement Learning: Equivalent performance to full fine-tuning even with small ranks. RL requires very low capacity.
  • Small-to-Medium SL Datasets: Instruction-tuning and reasoning datasets where LoRA matches full fine-tuning performance.
  • Domain Adaptation: Fine-tuning UltraSafe expert models on enterprise-specific data and terminology.

⚠️ LoRA Limitations:

  • Large Dataset SL: When dataset size exceeds LoRA capacity, training efficiency degrades. Not a hard floor—gradual performance reduction.
  • Large Batch Sizes: LoRA is less tolerant of very large batches compared to full fine-tuning. This is inherent to the product-of-matrices parametrization.
  • Attention-Only LoRA: Applying LoRA only to attention layers underperforms. Apply to all weight matrices (attention + MLP) for best results.

Best Practice

Apply LoRA to all weight matrices (attention layers + MLP layers + MoE layers where applicable), not just attention. This provides better performance even when matching total parameter count.

What is LoRA Exactly?

LoRA (Low-Rank Adaptation) modifies weight matrices using low-rank decomposition. Instead of updating the full weight matrix, LoRA adds a low-rank update that captures the essential adaptations needed for your task.

Mathematical Formulation

Given an original weight matrix W, LoRA replaces it with:

W′ = W + BA

Where B and A are low-rank matrices. If W is an n × n matrix, then:

  • B is an n × r matrix
  • A is an r × n matrix
  • r is the rank (default: 32 in Bios)

The rank r controls the capacity of the adaptation. Only the low-rank matrices B and A are trained—the original weights W remain frozen.

Conceptual Understanding

While LoRA uses low-rank approximation mathematically, it's more useful to think of it as a random projection of the parameter space that's efficient to implement. When training with RL or small SL datasets, you're only learning limited information, and this reduced parameter set is sufficient.

Critical Hyperparameter: Learning Rate

The learning rate is typically the most important hyperparameter in ML experiments. For LoRA, learning rate selection requires special attention.

⚠️ Common Mistake

LoRA requires a much larger learning rate than full fine-tuning—typically 20-100x larger.Many practitioners mistakenly use their full fine-tuning LR when switching to LoRA, leading to poor performance and incorrect conclusions about LoRA's effectiveness.

Calculate the Correct LoRA Learning Rate

Bios provides a utility to calculate the LR scaling factor for LoRA:

Calculate LoRA Learning Rate
1from bios_cookbook.hyperparam_utils import get_lora_lr_over_full_finetune_lr
2
3# Get LR scaling factor for UltraSafe model
4model_name = "ultrasafe/usf-finance"
5lr_scale_factor = get_lora_lr_over_full_finetune_lr(model_name)
6
7print(f"LoRA LR scale factor: {lr_scale_factor}x")
8
9# Example: If your full fine-tuning LR is 1e-5
10full_ft_lr = 1e-5
11lora_lr = full_ft_lr * lr_scale_factor
12
13print(f"Full FT LR: {full_ft_lr}")
14print(f"LoRA LR: {lora_lr}")

LR Scaling by Model

The scaling factor varies by model architecture and capacity:

UltraSafe Mini models~32x scale factor
UltraSafe Expert models~64-96x scale factor
Large capacity models~128x scale factor

What Rank to Use?

The default rank used by Bios is 32, which works well for most use cases. However, for supervised learning on large datasets, you should consider using a larger rank.

Rank Selection Guidelines

Reinforcement Learning

Small ranks give equivalent performance to larger ranks and full fine-tuning.

Recommended: rank = 8 to 16

Supervised Learning

Rank should scale with dataset size. Ensure LoRA parameters ≥ completion tokens for best results.

Recommended: rank = 32 to 128

Calculate LoRA Parameters

Use Bios utilities to calculate the number of trainable parameters for a given rank:

Calculate Parameter Count
1from bios_cookbook.hyperparam_utils import get_lora_param_count
2
3# Calculate LoRA parameters for UltraSafe model
4model_name = "ultrasafe/usf-healthcare"
5lora_rank = 32
6
7param_count = get_lora_param_count(model_name, lora_rank=lora_rank)
8print(f"LoRA parameters (rank={lora_rank}): {param_count:,}")
9
10# Compare different ranks
11for rank in [8, 16, 32, 64, 128]:
12    params = get_lora_param_count(model_name, lora_rank=rank)
13    print(f"Rank {rank:3d}: {params:,} parameters")

Rule of Thumb for SL

As a rough approximation, LoRA will give good supervised learning results when: # LoRA parameters ≥ # completion tokens (i.e., tokens with weight=1). This ensures sufficient capacity to learn your dataset.

Configuring LoRA in Bios

Set LoRA parameters when creating your training client:

Basic LoRA Configuration
1import bios
2
3service_client = bios.ServiceClient()
4
5# Configure LoRA parameters
6training_client = service_client.create_lora_training_client(
7    base_model="ultrasafe/usf-code",
8    rank=32,  # LoRA rank (r)
9    alpha=64,  # LoRA alpha (scaling factor)
10    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
11    dropout=0.05
12)
13
14# Training proceeds normally
15for batch in dataloader:
16    training_client.forward_backward(batch, "cross_entropy")
17    training_client.optim_step()

LoRA Configuration Parameters

ParameterDefaultDescription
rank32Rank of low-rank matrices. Controls adapter capacity.
alpha64Scaling factor for LoRA weights. Typically set to 2 × rank.
target_modulesall layersWhich weight matrices to apply LoRA to. Use all for best performance.
dropout0.05Dropout rate for LoRA layers. Prevents overfitting.

Advanced LoRA Configuration

Fine-tune LoRA settings for specific use cases and model architectures:

Domain-Specific LoRA Configuration
1import bios
2
3# Healthcare: Higher rank for complex medical reasoning
4healthcare_client = service_client.create_lora_training_client(
5    base_model="ultrasafe/usf-healthcare",
6    rank=64,  # Higher capacity for medical knowledge
7    alpha=128,
8    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
9    modules_to_save=["lm_head"],  # Also fine-tune output layer
10    dropout=0.1
11)
12
13# Finance: Standard rank for financial analysis
14finance_client = service_client.create_lora_training_client(
15    base_model="ultrasafe/usf-finance",
16    rank=32,
17    alpha=64,
18    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
19    dropout=0.05
20)
21
22# Code: Lower rank sufficient for syntax patterns
23code_client = service_client.create_lora_training_client(
24    base_model="ultrasafe/usf-code",
25    rank=16,
26    alpha=32,
27    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
28    dropout=0.03
29)
30
31# RL: Very low rank works well
32rl_client = service_client.create_lora_training_client(
33    base_model="ultrasafe/usf-conversation",
34    rank=8,  # RL requires minimal capacity
35    alpha=16,
36    target_modules=["q_proj", "v_proj"],  # Can even use attention-only for RL
37    dropout=0.0  # No dropout for RL
38)

Important Observations

Learning Rate Independence from Rank

The optimal learning rate does not depend on the LoRA rank. You can train with different ranks using the same LR and get identical learning curves for the initial training steps. Only the final convergence capacity differs.

Apply LoRA to All Layers

Even when matching parameter counts, applying LoRA to all weight matrices (attention + MLP + MoE) outperforms attention-only LoRA. The additional layers capture important feature transformations.

Batch Size Sensitivity

LoRA is more sensitive to very large batch sizes than full fine-tuning. This is a property of the product-of-matrices parametrization. If you encounter training instability, try reducing batch size rather than increasing rank.

Complete LoRA Configuration Example

Here's a complete example showing proper LoRA configuration with LR scaling:

Production LoRA Setup
1import bios
2from bios import types
3from bios_cookbook.hyperparam_utils import (
4    get_lora_lr_over_full_finetune_lr,
5    get_lora_param_count
6)
7
8# Model selection
9base_model = "ultrasafe/usf-finance"
10lora_rank = 32
11
12# Calculate appropriate learning rate
13# If your full fine-tuning LR was 1e-5:
14full_ft_lr = 1e-5
15lr_scale = get_lora_lr_over_full_finetune_lr(base_model)
16lora_lr = full_ft_lr * lr_scale
17
18print(f"Full FT LR: {full_ft_lr}")
19print(f"LoRA LR scale: {lr_scale}x")
20print(f"LoRA LR: {lora_lr}")
21
22# Calculate parameter count
23param_count = get_lora_param_count(base_model, lora_rank=lora_rank)
24print(f"Trainable parameters: {param_count:,}")
25
26# Create training client with optimized config
27service_client = bios.ServiceClient()
28training_client = service_client.create_lora_training_client(
29    base_model=base_model,
30    rank=lora_rank,
31    alpha=lora_rank * 2,  # Common practice: alpha = 2 * rank
32    target_modules=[
33        # Attention layers
34        "q_proj", "v_proj", "k_proj", "o_proj",
35        # MLP layers
36        "gate_proj", "up_proj", "down_proj"
37    ],
38    dropout=0.05
39)
40
41# Training loop with correct LR
42for step in range(num_steps):
43    training_client.forward_backward(batch, "cross_entropy")
44    training_client.optim_step(
45        types.AdamParams(learning_rate=lora_lr)
46    )
47
48print("Training complete with optimal LoRA configuration!")