Public Leaderboard

OpenEvals includes a leaderboard system for tracking and comparing model performance.

Overview

The leaderboard provides:

Ranked model comparisons across benchmarks
Historical performance tracking
Public and private leaderboard modes
Export capabilities for publications

Using the CLI

View Current Leaderboard

python -m openevals.scripts.leaderboard --task mmlu

Submit Results

python -m openevals.scripts.leaderboard --submit results.yaml

Filter by Model Family

python -m openevals.scripts.leaderboard --family gemma --task gsm8k

Input Formats

YAML Format

submission:
  model_name: "gemma-2b-it"
  model_family: "gemma"
  model_size: "2b"

results:
  mmlu:
    overall: 0.65
    mathematics: 0.58
    computer_science: 0.72
  gsm8k:
    overall: 0.45

JSON Format

{
  "submission": {
    "model_name": "gemma-2b-it",
    "model_family": "gemma",
    "model_size": "2b"
  },
  "results": {
    "mmlu": {"overall": 0.65},
    "gsm8k": {"overall": 0.45}
  }
}

Customization

Custom Ranking

Configure ranking criteria:

from openevals.leaderboard import Leaderboard

lb = Leaderboard()
lb.set_ranking_weights({
    "mmlu": 0.3,
    "gsm8k": 0.2,
    "humaneval": 0.3,
    "arc": 0.2
})

Filtering Options

# Filter by model size
lb.filter(min_size="7b", max_size="70b")

# Filter by date
lb.filter(after="2025-01-01")

# Filter by benchmark score
lb.filter(min_mmlu=0.5)

Deployment

Web Interface

The web platform includes an interactive leaderboard:

cd web/backend && uvicorn app.main:app --port 8000

Access at http://localhost:8000/api/v1/benchmarks/leaderboard.

API Endpoints

Endpoint	Description
`GET /leaderboard`	Retrieve current rankings
`POST /leaderboard/submit`	Submit new results
`GET /leaderboard/history`	View historical rankings
`GET /leaderboard/export`	Export as CSV/JSON

Export Options

Export for Publications

python -m openevals.scripts.leaderboard --export latex --output table.tex

Available formats:

LaTeX table
Markdown table
CSV
JSON

Example LaTeX Output

\begin{table}[h]
\centering
\begin{tabular}{lcccc}
\toprule
Model & MMLU & GSM8K & HumanEval & ARC \\
\midrule
Llama 3 70B & 0.82 & 0.74 & 0.68 & 0.85 \\
Gemma 27B & 0.75 & 0.68 & 0.62 & 0.78 \\
\bottomrule
\end{tabular}
\caption{Model performance comparison}
\end{table}