Data Loading and Processing

MedExplain-Evals provides comprehensive data loading functionality for popular medical datasets, with automatic complexity stratification and standardized conversion to the MedExplain-Evals format.

Supported Datasets

The following medical datasets are currently supported:

  • MedQA-USMLE: Medical question answering based on USMLE exam format

  • iCliniq: Real clinical questions from patients with professional answers

  • Cochrane Reviews: Evidence-based systematic reviews and meta-analyses

Each dataset loader handles the specific format and field mappings of its source data, converting everything to standardized MedExplainItem objects.

Basic Usage

Load Individual Datasets

from src.data_loaders import load_medqa_usmle, load_icliniq, load_cochrane_reviews

# Load MedQA-USMLE dataset with automatic complexity stratification
medqa_items = load_medqa_usmle(
    'data/medqa_usmle.json',
    max_items=300,
    auto_complexity=True
)

# Load iCliniq dataset
icliniq_items = load_icliniq(
    'data/icliniq.json',
    max_items=400,
    auto_complexity=True
)

# Load Cochrane Reviews
cochrane_items = load_cochrane_reviews(
    'data/cochrane.json',
    max_items=300,
    auto_complexity=True
)

Combine Multiple Datasets

from src.data_loaders import save_benchmark_items

# Combine all datasets
all_items = medqa_items + icliniq_items + cochrane_items

# Save as unified benchmark dataset
save_benchmark_items(all_items, 'data/benchmark_items.json')

Complexity Stratification

MedExplain-Evals automatically categorizes content complexity using Flesch-Kincaid Grade Level scores:

  • Basic: FK score ≤ 8 (elementary/middle school level)

  • Intermediate: FK score 9-12 (high school level)

  • Advanced: FK score > 12 (college/professional level)

from src.data_loaders import calculate_complexity_level

# Calculate complexity for any text
text = "Hypertension is high blood pressure that can damage your heart."
complexity = calculate_complexity_level(text)
print(complexity)  # "basic"

# More complex medical text
complex_text = ("Pharmacokinetic interactions involving cytochrome P450 enzymes "
               "can significantly alter therapeutic drug concentrations.")
complexity = calculate_complexity_level(complex_text)
print(complexity)  # "advanced"

Fallback Complexity Calculation

When the textstat library is unavailable, MedExplain-Evals uses a fallback method based on:

  • Average sentence length

  • Average syllables per word

  • Medical terminology density

# The fallback method is automatically used when textstat is not available
# No changes needed in your code - it's handled transparently

Data Processing Script

MedExplain-Evals includes a comprehensive command-line script for processing and combining datasets:

Basic Usage

# Process all three datasets with default settings
python scripts/process_datasets.py \
    --medqa data/medqa_usmle.json \
    --icliniq data/icliniq.json \
    --cochrane data/cochrane.json \
    --output data/benchmark_items.json

Advanced Options

# Custom item limits per dataset
python scripts/process_datasets.py \
    --medqa data/medqa_usmle.json \
    --icliniq data/icliniq.json \
    --cochrane data/cochrane.json \
    --output data/benchmark_items.json \
    --max-items 1500 \
    --medqa-items 600 \
    --icliniq-items 500 \
    --cochrane-items 400 \
    --balance-complexity \
    --validate \
    --stats \
    --verbose

Script Options

Command Line Options

Option

Description

--medqa PATH

Path to MedQA-USMLE JSON file

--icliniq PATH

Path to iCliniq JSON file

--cochrane PATH

Path to Cochrane Reviews JSON file

--output PATH

Output path for combined dataset (default: data/benchmark_items.json)

--max-items N

Maximum total items in final dataset (default: 1000)

--medqa-items N

Maximum items from MedQA-USMLE

--icliniq-items N

Maximum items from iCliniq

--cochrane-items N

Maximum items from Cochrane Reviews

--auto-complexity

Enable automatic complexity calculation (default: True)

--no-auto-complexity

Disable automatic complexity calculation

--balance-complexity

Balance dataset across complexity levels (default: True)

--validate

Validate final dataset and show report

--stats

Show detailed statistics about created dataset

--seed N

Random seed for reproducible dataset creation (default: 42)

--verbose

Enable verbose logging

Dataset Validation

The data processing includes comprehensive validation:

from scripts.process_datasets import validate_dataset

# Validate any list of MedExplainItem objects
validation_report = validate_dataset(items)

if validation_report['valid']:
    print("✅ Dataset validation passed")
else:
    print("❌ Dataset validation failed")
    for issue in validation_report['issues']:
        print(f"  Issue: {issue}")

for warning in validation_report['warnings']:
    print(f"  Warning: {warning}")

Validation Checks

  • Duplicate IDs: Ensures all item IDs are unique

  • Content Length: Validates minimum content length (≥20 characters)

  • Complexity Distribution: Warns if not all complexity levels are represented

  • Data Integrity: Checks for valid field types and required fields

Dataset Statistics

Generate comprehensive statistics about your dataset:

from scripts.process_datasets import print_dataset_statistics

# Print detailed statistics
print_dataset_statistics(items)

The statistics include:

  • Total item count

  • Complexity level distribution (percentages)

  • Source dataset distribution

  • Content length statistics (min, max, average)

Custom Dataset Loading

For datasets not directly supported, use the custom loader:

from src.data_loaders import load_custom_dataset

# Define field mapping for your dataset
field_mapping = {
    'q': 'question',          # Your field -> standard field
    'a': 'answer',
    'medical_text': 'medical_content',
    'item_id': 'id'
}

items = load_custom_dataset(
    'path/to/your/dataset.json',
    field_mapping=field_mapping,
    max_items=500,
    complexity_level='intermediate'  # Or use auto_complexity=True
)

Error Handling

All data loaders include comprehensive error handling:

try:
    items = load_medqa_usmle('data/medqa.json')
except FileNotFoundError:
    print("Dataset file not found")
except json.JSONDecodeError:
    print("Invalid JSON format")
except ValueError as e:
    print(f"Data validation error: {e}")

The loaders will:

  • Skip invalid items with detailed logging

  • Continue processing when individual items fail

  • Provide informative error messages

  • Return partial results when possible

Performance Considerations

For large datasets:

  • Use max_items to limit memory usage during development

  • Enable auto_complexity only when needed (adds processing time)

  • Consider processing datasets separately and combining later

  • Use the --verbose flag to monitor progress

# Process large dataset in chunks
chunk_size = 1000
all_items = []

for i in range(0, total_items, chunk_size):
    chunk_items = load_medqa_usmle(
        'large_dataset.json',
        max_items=chunk_size,
        offset=i  # If your loader supports offset
    )
    all_items.extend(chunk_items)

Best Practices

  1. Reproducible Datasets: Always use the same random seed for consistent results

    python scripts/process_datasets.py --seed 42 [other options]
    
  2. Validation: Always validate your final dataset

    python scripts/process_datasets.py --validate [other options]
    
  3. Backup: Keep backup copies of your original datasets

  4. Documentation: Document your dataset processing pipeline

    # Document your processing steps
    processing_config = {
        'medqa_items': 300,
        'icliniq_items': 400,
        'cochrane_items': 300,
        'complexity_stratification': True,
        'balance_complexity': True,
        'seed': 42
    }
    
  5. Version Control: Track your dataset versions and processing scripts

API Reference