Prompt Distillation
Prompt distillation is a training technique that optimizes a model to behave as though it had been provided with a long, complex prompt—without requiring that prompt during inference. This dramatically reduces token overhead while preserving the behavioral guidance encoded in detailed instructions.
Two-Step Process
Mathematical Overview
Let fT and fS denote the teacher and student models, respectively. Given an instruction prompt P and a query qi, the teacher generates a response:
Distillation Formulation
Teacher Generation:
Distillation Dataset:
(Query-response pairs excluding the original prompt P)
Student Training Objective:
Student minimizes cross-entropy loss to match teacher outputs
Key Insight
The prompt P is concatenated with each query for teacher generation, but excluded from the training dataset. The student learns to implicitly reproduce the prompt's behavioral guidance through supervised learning on teacher outputs.
Example: Financial Analysis Distillation
The Bios Cookbook provides a prompt distillation recipe. We'll demonstrate with a financial analysis task, distilling a detailed analyst prompt into model weights.
Step 1: Generate Training Data
Create distillation data using the teacher model with a detailed financial analysis prompt:
1# Run the data generation script
2python -m bios_cookbook.recipes.prompt_distillation.create_data \
3  output_file=/tmp/bios-datasets/financial_analysis_distilled.jsonl \
4  teacher_model=ultrasafe/usf-finance \
5  num_examples=1000What This Command Does:
- →Uses configured teacher model (ultrasafe/usf-finance) with detailed analyst prompt
- →Generates financial analysis examples on diverse queries
- →Saves distilled dataset to specified output file (JSONL format)
- →Creates training examples suitable for student model fine-tuning
Step 2: Train the Student Model
Fine-tune a student model on the distillation data to internalize the prompt guidance:
1# Fine-tune student model
2python -m bios_cookbook.recipes.prompt_distillation.train \
3  data_file=/tmp/bios-datasets/financial_analysis_distilled.jsonl \
4  student_model=ultrasafe/usf-finance \
5  output_dir=/tmp/bios-models/distilled_analystTraining Process:
- →Loads generated distillation dataset
- →Applies optimized training configurations (validated LR, batch size)
- →Fine-tunes student model using cross-entropy loss
- →Saves checkpoints and metrics for evaluation
Step 3: Test Your Distilled Model
Verify the distilled model's performance by sampling without the original prompt:
1import bios
2from bios import types
3
4# Load distilled student model
5service_client = bios.ServiceClient()
6sampling_client = service_client.create_sampling_client(
7    model_path="bios://distilled_analyst/final"
8)
9
10# Test with minimal prompt (no lengthy instructions needed!)
11test_query = "Analyze Tesla's Q3 2024 earnings report"
12prompt = types.ModelInput.from_ints(
13    tokenizer.encode(test_query)
14)
15
16# Sample from distilled model
17result = sampling_client.sample(
18    prompt,
19    sampling_params=types.SamplingParams(max_tokens=512, temperature=0.3)
20).result()
21
22# Model provides detailed analysis without lengthy prompt
23print(tokenizer.decode(result.sequences[0].tokens))✓ Distillation Success
The student model now provides comprehensive financial analysis (with risk metrics, compliance notes, etc.) using just the query—no 500+ token system prompt required. The behavioral guidelines have been internalized into the model weights.
Teacher-Student Model Configuration
The teacher and student models can be identical or different, depending on your requirements:
Same Model (Self-Distillation)
Use the same UltraSafe model as both teacher and student. The student learns to internalize complex prompt instructions into its base weights.
Example:
Teacher: ultrasafe/usf-finance
Student: ultrasafe/usf-finance
Different Models (Cross-Distillation)
Use a larger/more capable model as teacher to generate training data for a smaller/faster student model.
Example:
Teacher: ultrasafe/usf-finance
Student: ultrasafe/usf-mini
Complete Distillation Pipeline
Here's a complete implementation of prompt distillation for financial analysis:
1import bios
2from bios import types
3from bios_cookbook import renderers
4import asyncio
5
6# Define detailed teacher prompt (500+ tokens)
7TEACHER_PROMPT = """
8You are an expert financial analyst. For every query, provide:
9
101. RISK ASSESSMENT
11   - Quantify risk scores (0-100) with confidence intervals
12   - Identify systematic and idiosyncratic risks
13   - Reference relevant market conditions
14
152. COMPLIANCE ANALYSIS  
16   - Cite applicable SEC regulations
17   - Note disclosure requirements
18   - Flag potential compliance issues
19
203. QUANTITATIVE METRICS
21   - Calculate key financial ratios
22   - Provide historical comparisons (3-5 year trends)
23   - Include data sources and methodologies
24
254. EXECUTIVE SUMMARY
26   - Lead with 2-3 sentence overview
27   - Highlight critical findings
28   - Provide clear actionable insights
29
30Use professional terminology and cite sources.
31"""
32
33async def generate_distillation_data(
34    teacher_model: str,
35    queries: list[str],
36    output_file: str
37):
38    """Generate distillation dataset using teacher model"""
39    service_client = bios.ServiceClient()
40    
41    # Create sampling client for teacher
42    teacher_client = service_client.create_sampling_client(
43        base_model=teacher_model
44    )
45    
46    tokenizer = teacher_client.get_tokenizer()
47    renderer = renderers.get_renderer('ultrasafe', tokenizer)
48    
49    distillation_data = []
50    
51    for query in queries:
52        # Build teacher prompt with detailed instructions
53        teacher_messages = [
54            {"role": "system", "content": TEACHER_PROMPT},
55            {"role": "user", "content": query}
56        ]
57        
58        teacher_prompt = renderer.build_generation_prompt(teacher_messages)
59        stop_sequences = renderer.get_stop_sequences()
60        
61        # Generate teacher response
62        result = await teacher_client.sample_async(
63            teacher_prompt,
64            sampling_params=types.SamplingParams(
65                max_tokens=800,
66                temperature=0.3,
67                stop=stop_sequences
68            ),
69            num_samples=1
70        )
71        teacher_output = await result
72        teacher_response, _ = renderer.parse_response(
73            teacher_output.sequences[0].tokens
74        )
75        
76        # Create student training example (WITHOUT teacher prompt)
77        student_example = {
78            "messages": [
79                {"role": "user", "content": query},
80                {"role": "assistant", "content": teacher_response["content"]}
81            ]
82        }
83        distillation_data.append(student_example)
84        
85        print(f"Generated example {len(distillation_data)}/{len(queries)}")
86    
87    # Save distillation dataset
88    import json
89    with open(output_file, 'w') as f:
90        for example in distillation_data:
91            f.write(json.dumps(example) + '\n')
92    
93    print(f"Saved {len(distillation_data)} examples to {output_file}")
94
95# Generate data
96financial_queries = [
97    "Analyze Amazon's cloud computing revenue trends",
98    "Evaluate Microsoft's AI investment strategy",
99    # ... more queries
100]
101
102asyncio.run(generate_distillation_data(
103    teacher_model="ultrasafe/usf-finance",
104    queries=financial_queries,
105    output_file="/tmp/financial_distilled.jsonl"
106))Step 2: Train Student Model
1# Train student model on distilled data
2python -m bios_cookbook.recipes.prompt_distillation.train \
3  data_file=/tmp/financial_distilled.jsonl \
4  student_model=ultrasafe/usf-finance \
5  lora_rank=32 \
6  num_epochs=3 \
7  log_path=/tmp/distillation_logsAdvanced Configuration
Customize the distillation recipe for different scenarios:
Teacher Model Selection
Choose different base models based on capacity requirements. Larger teacher models can provide higher-quality responses but are more expensive to run.
Sampling Strategies
Adjust temperature and other generation parameters. Lower temperature (0.3-0.5) provides more consistent teacher outputs suitable for distillation.
Data Volume
Scale the number of generated examples based on task complexity. Simple tasks: 100-500 examples. Complex reasoning: 1000-5000 examples.
Training Hyperparameters
Fine-tune learning rate, batch size, and other settings. Use get_lr() for validated starting points.
Prompt Distillation Benefits
Token Reduction
Eliminate 500+ token system prompts, reducing to minimal task descriptions
Faster Inference
Shorter prompts mean significantly lower latency per request
Cost Savings
Reduced token consumption translates to lower API costs
Next Steps
Apply prompt distillation to your use case: