OpenEvals Documentation

OpenEvals is an open-source evaluation framework for large language models, providing systematic benchmarking across standardized academic tasks.

Note

This project is developed as part of Google Summer of Code 2025 with Google DeepMind.

Overview

OpenEvals provides infrastructure for:

  • Evaluating open-weight models on established benchmarks (MMLU, GSM8K, MATH, HumanEval, ARC, TruthfulQA, and more)

  • Comparing performance across model families (Gemma, Llama, Mistral, Qwen, DeepSeek, and arbitrary HuggingFace models)

  • Measuring computational efficiency metrics (latency, throughput, memory utilization)

  • Generating statistical analyses with confidence intervals

  • Producing publication-ready visualizations and reports

Supported Models

Family

Variants

Sizes

Gemma

Gemma, Gemma 2, Gemma 3

1B - 27B

Llama 3

Llama 3, 3.1, 3.2

1B - 405B

Mistral

Mistral, Mixtral

7B, 8x7B, 8x22B

Qwen

Qwen 2, Qwen 2.5

0.5B - 72B

DeepSeek

DeepSeek, DeepSeek-R1

1.5B - 671B

Phi

Phi-3

Mini, Small, Medium

HuggingFace

Any model on Hub

Custom

Supported Benchmarks

Benchmark

Category

Description

MMLU

Knowledge

57 subjects spanning STEM, humanities, social sciences

GSM8K

Mathematical

Grade school math word problems

MATH

Mathematical

Competition math (AMC, AIME, Olympiad)

HumanEval

Code Generation

Python function completion tasks

ARC

Reasoning

Science questions (Easy and Challenge splits)

TruthfulQA

Truthfulness

Questions probing common misconceptions

BBH

Reasoning

BIG-Bench Hard - 23 challenging tasks

Quick Start

# Install
git clone https://github.com/heilcheng/openevals.git
cd openevals && pip install -r requirements.txt

# Set authentication
export HF_TOKEN=your_token_here

# Run evaluation
python -m openevals.scripts.run_benchmark \
  --config configs/benchmark_config.yaml \
  --models llama3-8b \
  --tasks mmlu gsm8k \
  --visualize

Getting Help

Contributing

Contributions are welcome. Please see the Contributing guide for details.

Citation

@software{openevals,
  author = {Cheng Hei Lam},
  title = {OpenEvals: An Open-Source Evaluation Framework for Large Language Models},
  year = {2025},
  url = {https://github.com/heilcheng/openevals}
}