OpenEvals Documentation
=======================

**OpenEvals** is an open-source evaluation framework for large language models, providing systematic benchmarking across standardized academic tasks.

.. note::

   This project is developed as part of Google Summer of Code 2025 with Google DeepMind.

Overview
--------

OpenEvals provides infrastructure for:

- Evaluating open-weight models on established benchmarks (MMLU, GSM8K, MATH, HumanEval, ARC, TruthfulQA, and more)
- Comparing performance across model families (Gemma, Llama, Mistral, Qwen, DeepSeek, and arbitrary HuggingFace models)
- Measuring computational efficiency metrics (latency, throughput, memory utilization)
- Generating statistical analyses with confidence intervals
- Producing publication-ready visualizations and reports

Supported Models
----------------

.. list-table::
   :header-rows: 1
   :widths: 20 40 40

   * - Family
     - Variants
     - Sizes
   * - Gemma
     - Gemma, Gemma 2, Gemma 3
     - 1B - 27B
   * - Llama 3
     - Llama 3, 3.1, 3.2
     - 1B - 405B
   * - Mistral
     - Mistral, Mixtral
     - 7B, 8x7B, 8x22B
   * - Qwen
     - Qwen 2, Qwen 2.5
     - 0.5B - 72B
   * - DeepSeek
     - DeepSeek, DeepSeek-R1
     - 1.5B - 671B
   * - Phi
     - Phi-3
     - Mini, Small, Medium
   * - HuggingFace
     - Any model on Hub
     - Custom

Supported Benchmarks
--------------------

.. list-table::
   :header-rows: 1
   :widths: 15 20 65

   * - Benchmark
     - Category
     - Description
   * - MMLU
     - Knowledge
     - 57 subjects spanning STEM, humanities, social sciences
   * - GSM8K
     - Mathematical
     - Grade school math word problems
   * - MATH
     - Mathematical
     - Competition math (AMC, AIME, Olympiad)
   * - HumanEval
     - Code Generation
     - Python function completion tasks
   * - ARC
     - Reasoning
     - Science questions (Easy and Challenge splits)
   * - TruthfulQA
     - Truthfulness
     - Questions probing common misconceptions
   * - BBH
     - Reasoning
     - BIG-Bench Hard - 23 challenging tasks

Quick Start
-----------

.. code-block:: bash

   # Install
   git clone https://github.com/heilcheng/openevals.git
   cd openevals && pip install -r requirements.txt

   # Set authentication
   export HF_TOKEN=your_token_here

   # Run evaluation
   python -m openevals.scripts.run_benchmark \
     --config configs/benchmark_config.yaml \
     --models llama3-8b \
     --tasks mmlu gsm8k \
     --visualize

Getting Help
------------

- GitHub Issues: https://github.com/heilcheng/openevals/issues
- Documentation: https://heilcheng.github.io/openevals/

Contributing
------------

Contributions are welcome. Please see the :doc:`contributing` guide for details.

Citation
--------

.. code-block:: bibtex

   @software{openevals,
     author = {Cheng Hei Lam},
     title = {OpenEvals: An Open-Source Evaluation Framework for Large Language Models},
     year = {2025},
     url = {https://github.com/heilcheng/openevals}
   }

.. toctree::
   :maxdepth: 2
   :caption: Getting Started

   installation
   quickstart

.. toctree::
   :maxdepth: 2
   :caption: Core Functionality

   data_loading
   evaluation_metrics
   leaderboard

.. toctree::
   :maxdepth: 2
   :caption: API Reference

   api/index