Model Evaluations
Bios training scripts automatically print training and test loss. Beyond basic loss metrics, this guide covers two common evaluation workflows: inline evaluations during training and offline evaluations on saved checkpoints.
Inline Evaluations During Training
Add inline evaluations to your training runs by configuring evaluator builders. These run periodically during training to monitor model quality on held-out tasks.
Supervised Fine-Tuning Evals
Configure evaluation frequency in your SFT config:
1# Add to supervised training config
2config = {
3    # Regular evaluations
4    "evaluator_builders": [eval_builder_1, eval_builder_2],
5    "eval_every": 100,  # Run every 100 steps
6    
7    # Infrequent (expensive) evaluations
8    "infrequent_evaluator_builders": [expensive_eval],
9    "infrequent_eval_every": 500  # Run every 500 steps
10}RL Training Evals
Configure periodic evaluations for RL training:
1# Add to RL training config
2config = {
3    "evaluator_builders": [sampling_eval_1, sampling_eval_2],
4    "eval_every": 50  # Run every 50 RL iterations
5}Evaluator Types
- • EvaluatorBuilderfor SFT
- • SamplingClientEvaluatorfor RL
Offline Evaluations with Inspect AI
Run standard benchmarks on your trained models using the Inspect AI library. Bios provides a script that uses internal sampling functionality.
1MODEL_PATH=bios://YOUR_MODEL_PATH_HERE
2python -m bios_cookbook.eval.run_inspect_evals \
3    model_path=$MODEL_PATH \
4    model_name=ultrasafe/usf-mini \
5    tasks=inspect_evals/ifeval,inspect_evals/mmlu_0_shot \
6    renderer_name=ultrasafeSupported Evaluation Tasks
Standard benchmarks available through Inspect AI:
ifevalInstruction following
mmlu_0_shotMultitask understanding
gsm8kMath reasoning
View full list of supported evaluations in the Bios Cookbook documentation
Creating Custom Evaluations
Bios recommends two approaches for custom evaluations:
1. Inspect AI Tasks
Create custom tasks using the Inspect AI framework, then run with the evaluation script above
2. SamplingClientEvaluator
Lower-level abstraction with fine-grained control over datasets and metrics
LLM-as-Judge with Inspect AI
Example showing how to create an evaluation with an LLM grading model responses:
1import bios
2from inspect_ai import Task, task
3from inspect_ai.dataset import MemoryDataset, Sample
4from inspect_ai.scorer import model_graded_qa
5from inspect_ai.solver import generate
6from bios_cookbook.eval.inspect_utils import InspectAPIFromBiosSampling
7
8# Define QA dataset
9QA_DATASET = MemoryDataset(
10    name="qa_dataset",
11    samples=[
12        Sample(input="What is the capital of France?", target="Paris"),
13        Sample(input="What is the capital of Italy?", target="Rome"),
14    ]
15)
16
17# Create Bios sampling client
18service_client = bios.ServiceClient()
19sampling_client = service_client.create_sampling_client(
20    base_model="ultrasafe/usf-mini"
21)
22
23# Wrap for Inspect AI
24api = InspectAPIFromBiosSampling(
25    renderer_name="ultrasafe",
26    model_name="ultrasafe/usf-mini",
27    sampling_client=sampling_client,
28    verbose=False
29)
30
31@task
32def example_lm_as_judge() -> Task:
33    return Task(
34        name="llm_as_judge",
35        dataset=QA_DATASET,
36        solver=generate(),
37        scorer=model_graded_qa(
38            instructions="Grade strictly. Respond 'GRADE: C' if correct or 'GRADE: I' otherwise.",
39            partial_credit=False
40        )
41    )Custom SamplingClientEvaluator
For fine-grained control, implement your own evaluator class:
1from typing import Any, Callable
2import bios
3from bios import types
4from bios_cookbook import renderers
5from bios_cookbook.evaluators import SamplingClientEvaluator
6from bios_cookbook.tokenizer_utils import get_tokenizer
7
8class CustomEvaluator(SamplingClientEvaluator):
9    def __init__(self, dataset, grader_fn, model_name, renderer_name):
10        self.dataset = dataset
11        self.grader_fn = grader_fn
12        tokenizer = get_tokenizer(model_name)
13        self.renderer = renderers.get_renderer(name=renderer_name, tokenizer=tokenizer)
14    
15    async def __call__(self, sampling_client) -> dict[str, float]:
16        num_correct = 0
17        sampling_params = types.SamplingParams(
18            max_tokens=100,
19            temperature=0.7,
20            stop=self.renderer.get_stop_sequences()
21        )
22        
23        for datum in self.dataset:
24            model_input = self.renderer.build_generation_prompt(
25                [{"role": "user", "content": datum["input"]}]
26            )
27            result = await sampling_client.sample_async(
28                prompt=model_input,
29                num_samples=1,
30                sampling_params=sampling_params
31            )
32            tokens = result.sequences[0].tokens
33            response = self.renderer.parse_response(tokens)[0]
34            
35            if self.grader_fn(response["content"], datum["output"]):
36                num_correct += 1
37        
38        return {"accuracy": num_correct / len(self.dataset)}Using Custom Evaluators
Example of running a custom evaluator on a model checkpoint:
1import asyncio
2import bios
3
4# Define test dataset
5QA_DATASET = [
6    {"input": "What is the capital of France?", "output": "Paris"},
7    {"input": "What is the capital of Germany?", "output": "Berlin"},
8    {"input": "What is the capital of Italy?", "output": "Rome"},
9]
10
11# Simple grader function
12def grader_fn(response: str, target: str) -> bool:
13    return target.lower() in response.lower()
14
15# Create evaluator
16evaluator = CustomEvaluator(
17    dataset=QA_DATASET,
18    grader_fn=grader_fn,
19    model_name="ultrasafe/usf-mini",
20    renderer_name="ultrasafe"
21)
22
23# Run evaluation
24service_client = bios.ServiceClient()
25sampling_client = service_client.create_sampling_client(
26    base_model="ultrasafe/usf-mini"
27)
28
29async def main():
30    result = await evaluator(sampling_client)
31    print(f"Evaluation Results: {result}")
32    # Output: {'accuracy': 1.0}
33
34asyncio.run(main())Evaluation Best Practices
✓ Do
- • Run inline evals to catch issues early
- • Use multiple diverse evaluation tasks
- • Save evaluation results with checkpoints
- • Test on held-out data, not training set
- • Use Inspect AI for standard benchmarks
- • Create domain-specific custom evaluators
✗ Don't
- • Don't run expensive evals too frequently (slows training)
- • Don't skip evaluation on important checkpoints
- • Don't use only training loss as quality metric
- • Don't evaluate on training data (overfitting blind spot)
- • Don't ignore eval failures or anomalies
Next Steps
Integrate evaluations into your training workflow: