OpenEvals Documentation

OpenEvals is an open-source evaluation framework for large language models, providing systematic benchmarking across standardized academic tasks.

Note

This project is developed as part of Google Summer of Code 2025 with Google DeepMind.

Overview

OpenEvals provides infrastructure for:

Evaluating open-weight models on established benchmarks (MMLU, GSM8K, MATH, HumanEval, ARC, TruthfulQA, and more)
Comparing performance across model families (Gemma, Llama, Mistral, Qwen, DeepSeek, and arbitrary HuggingFace models)
Measuring computational efficiency metrics (latency, throughput, memory utilization)
Generating statistical analyses with confidence intervals
Producing publication-ready visualizations and reports

Supported Models

Family	Variants	Sizes
Gemma	Gemma, Gemma 2, Gemma 3	1B - 27B
Llama 3	Llama 3, 3.1, 3.2	1B - 405B
Mistral	Mistral, Mixtral	7B, 8x7B, 8x22B
Qwen	Qwen 2, Qwen 2.5	0.5B - 72B
DeepSeek	DeepSeek, DeepSeek-R1	1.5B - 671B
Phi	Phi-3	Mini, Small, Medium
HuggingFace	Any model on Hub	Custom

Supported Benchmarks

Benchmark	Category	Description
MMLU	Knowledge	57 subjects spanning STEM, humanities, social sciences
GSM8K	Mathematical	Grade school math word problems
MATH	Mathematical	Competition math (AMC, AIME, Olympiad)
HumanEval	Code Generation	Python function completion tasks
ARC	Reasoning	Science questions (Easy and Challenge splits)
TruthfulQA	Truthfulness	Questions probing common misconceptions
BBH	Reasoning	BIG-Bench Hard - 23 challenging tasks

Quick Start

# Install
git clone https://github.com/heilcheng/openevals.git
cd openevals && pip install -r requirements.txt

# Set authentication
export HF_TOKEN=your_token_here

# Run evaluation
python -m openevals.scripts.run_benchmark \
  --config configs/benchmark_config.yaml \
  --models llama3-8b \
  --tasks mmlu gsm8k \
  --visualize

Getting Help

GitHub Issues: https://github.com/heilcheng/openevals/issues
Documentation: https://heilcheng.github.io/openevals/

Contributing

Contributions are welcome. Please see the Contributing guide for details.

Citation

@software{openevals,
  author = {Cheng Hei Lam},
  title = {OpenEvals: An Open-Source Evaluation Framework for Large Language Models},
  year = {2025},
  url = {https://github.com/heilcheng/openevals}
}

Getting Started

API Reference

API Reference