Evaluation Metrics
OpenEvals implements standard evaluation metrics for LLM benchmarking.
Evaluation Framework
The evaluation framework consists of:
Task-specific metrics - Accuracy, Pass@k, exact match
Efficiency metrics - Latency, throughput, memory
Statistical analysis - Confidence intervals, significance tests
Standard Metrics
Accuracy
For multiple choice tasks (MMLU, ARC, TruthfulQA):
from openevals.utils.metrics import calculate_accuracy
accuracy = calculate_accuracy(correct=650, total=1000)
# Returns: 0.65
Exact Match
For generation tasks (GSM8K, MATH):
from openevals.utils.metrics import exact_match
score = exact_match(prediction="42", target="42")
# Returns: 1.0
Pass@k
For code generation tasks (HumanEval, MBPP):
from openevals.utils.metrics import calculate_pass_at_k
pass_at_1 = calculate_pass_at_k(n_samples=10, n_correct=3, k=1)
pass_at_10 = calculate_pass_at_k(n_samples=10, n_correct=3, k=10)
Safety Metrics
TruthfulQA Scoring
from openevals.utils.metrics import truthfulqa_score
scores = truthfulqa_score(
predictions,
mc1_targets,
mc2_targets
)
# Returns: {"mc1": 0.45, "mc2": 0.52}
Custom Scoring Components
Create custom metrics:
from openevals.utils.metrics import MetricBase
class CustomMetric(MetricBase):
def compute(self, predictions, targets):
# Implement custom scoring logic
score = self._calculate_score(predictions, targets)
return {"custom_score": score}
Statistical Analysis
Confidence Intervals
Calculate confidence intervals for accuracy:
from openevals.utils.metrics import calculate_confidence_interval
ci_lower, ci_upper = calculate_confidence_interval(
accuracy=0.65,
n_samples=1000,
confidence=0.95
)
# Returns: (0.62, 0.68)
Significance Testing
Compare model performances:
from openevals.utils.metrics import paired_bootstrap_test
p_value = paired_bootstrap_test(
model_a_scores,
model_b_scores,
n_bootstrap=10000
)
Aggregating Results
Aggregate results across multiple runs:
from openevals.utils.metrics import aggregate_results
aggregated = aggregate_results(
results_list,
method="mean" # or "median"
)
Efficiency Metrics
Latency
Time to generate responses:
from openevals.utils.metrics import measure_latency
latency = measure_latency(model, prompts, output_length=128)
# Returns: {"mean": 0.5, "std": 0.1, "p50": 0.45, "p95": 0.7}
Throughput
Tokens generated per second:
from openevals.utils.metrics import measure_throughput
throughput = measure_throughput(model, prompts)
# Returns: {"tokens_per_second": 128.5}
Memory Utilization
GPU memory usage:
from openevals.utils.metrics import measure_memory
memory = measure_memory(model)
# Returns: {"allocated_gb": 4.2, "reserved_gb": 6.0}
Per-Task Metrics
Task |
Primary Metric |
Additional Metrics |
|---|---|---|
MMLU |
Accuracy |
Per-subject accuracy |
GSM8K |
Exact Match |
Chain-of-thought analysis |
MATH |
Accuracy |
Per-difficulty scores |
HumanEval |
Pass@1, Pass@10 |
Per-problem pass rates |
ARC |
Accuracy |
Easy/Challenge splits |
TruthfulQA |
MC1, MC2 |
Truthfulness scores |