Data Loading and Processing
OpenEvals provides utilities for loading and processing benchmark datasets.
Supported Datasets
Dataset |
Source |
Description |
|---|---|---|
MMLU |
HuggingFace |
57 subjects, multiple choice |
GSM8K |
HuggingFace |
8,500 math word problems |
MATH |
HuggingFace |
Competition mathematics |
HumanEval |
OpenAI |
164 Python problems |
ARC |
AI2 |
Science questions |
TruthfulQA |
HuggingFace |
Truthfulness evaluation |
Downloading Data
Download all datasets:
python -m openevals.scripts.download_data --all
Download specific datasets:
python -m openevals.scripts.download_data --datasets mmlu gsm8k
Data Validation
Validate downloaded data:
from openevals.utils.data_loader import DataValidator
validator = DataValidator()
is_valid = validator.validate_dataset("mmlu")
print(f"MMLU valid: {is_valid}")
Dataset Statistics
Get statistics for a dataset:
from openevals.utils.data_loader import get_dataset_stats
stats = get_dataset_stats("mmlu")
print(f"Total samples: {stats['total']}")
print(f"Subjects: {len(stats['subjects'])}")
Custom Data Loading
Load custom datasets:
from openevals.utils.data_loader import CustomDataLoader
loader = CustomDataLoader()
dataset = loader.load_from_json("custom_data.json")
Expected format for custom datasets:
{
"samples": [
{
"question": "What is 2 + 2?",
"choices": ["3", "4", "5", "6"],
"answer": "B"
}
]
}
Data Preprocessing
Tokenization
from openevals.utils.preprocessing import tokenize_dataset
tokenized = tokenize_dataset(
dataset,
tokenizer,
max_length=2048
)
Prompt Formatting
Format prompts for evaluation:
from openevals.utils.preprocessing import format_prompt
prompt = format_prompt(
question=sample["question"],
choices=sample["choices"],
shot_examples=few_shot_examples
)
Caching
Downloaded datasets are cached locally:
~/.cache/openevals/
+-- mmlu/
+-- gsm8k/
+-- humaneval/
Clear cache:
python -m openevals.scripts.clear_cache --datasets mmlu