Data Loading and Processing
MedExplain-Evals provides comprehensive data loading functionality for popular medical datasets, with automatic complexity stratification and standardized conversion to the MedExplain-Evals format.
Supported Datasets
The following medical datasets are currently supported:
MedQA-USMLE: Medical question answering based on USMLE exam format
iCliniq: Real clinical questions from patients with professional answers
Cochrane Reviews: Evidence-based systematic reviews and meta-analyses
Each dataset loader handles the specific format and field mappings of its source data, converting everything to standardized MedExplainItem objects.
Basic Usage
Load Individual Datasets
from src.data_loaders import load_medqa_usmle, load_icliniq, load_cochrane_reviews
# Load MedQA-USMLE dataset with automatic complexity stratification
medqa_items = load_medqa_usmle(
'data/medqa_usmle.json',
max_items=300,
auto_complexity=True
)
# Load iCliniq dataset
icliniq_items = load_icliniq(
'data/icliniq.json',
max_items=400,
auto_complexity=True
)
# Load Cochrane Reviews
cochrane_items = load_cochrane_reviews(
'data/cochrane.json',
max_items=300,
auto_complexity=True
)
Combine Multiple Datasets
from src.data_loaders import save_benchmark_items
# Combine all datasets
all_items = medqa_items + icliniq_items + cochrane_items
# Save as unified benchmark dataset
save_benchmark_items(all_items, 'data/benchmark_items.json')
Complexity Stratification
MedExplain-Evals automatically categorizes content complexity using Flesch-Kincaid Grade Level scores:
Basic: FK score ≤ 8 (elementary/middle school level)
Intermediate: FK score 9-12 (high school level)
Advanced: FK score > 12 (college/professional level)
from src.data_loaders import calculate_complexity_level
# Calculate complexity for any text
text = "Hypertension is high blood pressure that can damage your heart."
complexity = calculate_complexity_level(text)
print(complexity) # "basic"
# More complex medical text
complex_text = ("Pharmacokinetic interactions involving cytochrome P450 enzymes "
"can significantly alter therapeutic drug concentrations.")
complexity = calculate_complexity_level(complex_text)
print(complexity) # "advanced"
Fallback Complexity Calculation
When the textstat library is unavailable, MedExplain-Evals uses a fallback method based on:
Average sentence length
Average syllables per word
Medical terminology density
# The fallback method is automatically used when textstat is not available
# No changes needed in your code - it's handled transparently
Data Processing Script
MedExplain-Evals includes a comprehensive command-line script for processing and combining datasets:
Basic Usage
# Process all three datasets with default settings
python scripts/process_datasets.py \
--medqa data/medqa_usmle.json \
--icliniq data/icliniq.json \
--cochrane data/cochrane.json \
--output data/benchmark_items.json
Advanced Options
# Custom item limits per dataset
python scripts/process_datasets.py \
--medqa data/medqa_usmle.json \
--icliniq data/icliniq.json \
--cochrane data/cochrane.json \
--output data/benchmark_items.json \
--max-items 1500 \
--medqa-items 600 \
--icliniq-items 500 \
--cochrane-items 400 \
--balance-complexity \
--validate \
--stats \
--verbose
Script Options
Option |
Description |
|---|---|
|
Path to MedQA-USMLE JSON file |
|
Path to iCliniq JSON file |
|
Path to Cochrane Reviews JSON file |
|
Output path for combined dataset (default: data/benchmark_items.json) |
|
Maximum total items in final dataset (default: 1000) |
|
Maximum items from MedQA-USMLE |
|
Maximum items from iCliniq |
|
Maximum items from Cochrane Reviews |
|
Enable automatic complexity calculation (default: True) |
|
Disable automatic complexity calculation |
|
Balance dataset across complexity levels (default: True) |
|
Validate final dataset and show report |
|
Show detailed statistics about created dataset |
|
Random seed for reproducible dataset creation (default: 42) |
|
Enable verbose logging |
Dataset Validation
The data processing includes comprehensive validation:
from scripts.process_datasets import validate_dataset
# Validate any list of MedExplainItem objects
validation_report = validate_dataset(items)
if validation_report['valid']:
print("✅ Dataset validation passed")
else:
print("❌ Dataset validation failed")
for issue in validation_report['issues']:
print(f" Issue: {issue}")
for warning in validation_report['warnings']:
print(f" Warning: {warning}")
Validation Checks
Duplicate IDs: Ensures all item IDs are unique
Content Length: Validates minimum content length (≥20 characters)
Complexity Distribution: Warns if not all complexity levels are represented
Data Integrity: Checks for valid field types and required fields
Dataset Statistics
Generate comprehensive statistics about your dataset:
from scripts.process_datasets import print_dataset_statistics
# Print detailed statistics
print_dataset_statistics(items)
The statistics include:
Total item count
Complexity level distribution (percentages)
Source dataset distribution
Content length statistics (min, max, average)
Custom Dataset Loading
For datasets not directly supported, use the custom loader:
from src.data_loaders import load_custom_dataset
# Define field mapping for your dataset
field_mapping = {
'q': 'question', # Your field -> standard field
'a': 'answer',
'medical_text': 'medical_content',
'item_id': 'id'
}
items = load_custom_dataset(
'path/to/your/dataset.json',
field_mapping=field_mapping,
max_items=500,
complexity_level='intermediate' # Or use auto_complexity=True
)
Error Handling
All data loaders include comprehensive error handling:
try:
items = load_medqa_usmle('data/medqa.json')
except FileNotFoundError:
print("Dataset file not found")
except json.JSONDecodeError:
print("Invalid JSON format")
except ValueError as e:
print(f"Data validation error: {e}")
The loaders will:
Skip invalid items with detailed logging
Continue processing when individual items fail
Provide informative error messages
Return partial results when possible
Performance Considerations
For large datasets:
Use
max_itemsto limit memory usage during developmentEnable
auto_complexityonly when needed (adds processing time)Consider processing datasets separately and combining later
Use the
--verboseflag to monitor progress
# Process large dataset in chunks
chunk_size = 1000
all_items = []
for i in range(0, total_items, chunk_size):
chunk_items = load_medqa_usmle(
'large_dataset.json',
max_items=chunk_size,
offset=i # If your loader supports offset
)
all_items.extend(chunk_items)
Best Practices
Reproducible Datasets: Always use the same random seed for consistent results
python scripts/process_datasets.py --seed 42 [other options]
Validation: Always validate your final dataset
python scripts/process_datasets.py --validate [other options]
Backup: Keep backup copies of your original datasets
Documentation: Document your dataset processing pipeline
# Document your processing steps processing_config = { 'medqa_items': 300, 'icliniq_items': 400, 'cochrane_items': 300, 'complexity_stratification': True, 'balance_complexity': True, 'seed': 42 }
Version Control: Track your dataset versions and processing scripts