Data Loading and Processing

OpenEvals provides utilities for loading and processing benchmark datasets.

Supported Datasets

Dataset

Source

Description

MMLU

HuggingFace

57 subjects, multiple choice

GSM8K

HuggingFace

8,500 math word problems

MATH

HuggingFace

Competition mathematics

HumanEval

OpenAI

164 Python problems

ARC

AI2

Science questions

TruthfulQA

HuggingFace

Truthfulness evaluation

Downloading Data

Download all datasets:

python -m openevals.scripts.download_data --all

Download specific datasets:

python -m openevals.scripts.download_data --datasets mmlu gsm8k

Data Validation

Validate downloaded data:

from openevals.utils.data_loader import DataValidator

validator = DataValidator()
is_valid = validator.validate_dataset("mmlu")
print(f"MMLU valid: {is_valid}")

Dataset Statistics

Get statistics for a dataset:

from openevals.utils.data_loader import get_dataset_stats

stats = get_dataset_stats("mmlu")
print(f"Total samples: {stats['total']}")
print(f"Subjects: {len(stats['subjects'])}")

Custom Data Loading

Load custom datasets:

from openevals.utils.data_loader import CustomDataLoader

loader = CustomDataLoader()
dataset = loader.load_from_json("custom_data.json")

Expected format for custom datasets:

{
  "samples": [
    {
      "question": "What is 2 + 2?",
      "choices": ["3", "4", "5", "6"],
      "answer": "B"
    }
  ]
}

Data Preprocessing

Tokenization

from openevals.utils.preprocessing import tokenize_dataset

tokenized = tokenize_dataset(
    dataset,
    tokenizer,
    max_length=2048
)

Prompt Formatting

Format prompts for evaluation:

from openevals.utils.preprocessing import format_prompt

prompt = format_prompt(
    question=sample["question"],
    choices=sample["choices"],
    shot_examples=few_shot_examples
)

Caching

Downloaded datasets are cached locally:

~/.cache/openevals/
+-- mmlu/
+-- gsm8k/
+-- humaneval/

Clear cache:

python -m openevals.scripts.clear_cache --datasets mmlu