MedExplain-Evals Documentation ======================= Welcome to MedExplain-Evals, a resource-efficient benchmark for evaluating audience-adaptive explanation quality in medical Large Language Models. This project is developed as part of `Google Summer of Code 2025 `_, mentored by `Google DeepMind `_. .. toctree:: :maxdepth: 2 :caption: Getting Started: installation quickstart .. toctree:: :maxdepth: 2 :caption: Core Functionality: data_loading evaluation_metrics leaderboard .. toctree:: :maxdepth: 2 :caption: API Reference: api/index Overview -------- MedExplain-Evals addresses a critical gap in medical AI evaluation by providing the first benchmark specifically designed to assess an LLM's ability to generate audience-adaptive medical explanations for four key stakeholders: - **Physicians** - Technical, evidence-based explanations - **Nurses** - Practical care implications and monitoring - **Patients** - Simple, empathetic, jargon-free language - **Caregivers** - Concrete tasks and warning signs Key Features ------------ - Novel evaluation framework for audience-adaptive medical explanations - Support for MedQA-USMLE, iCliniq, and Cochrane Reviews datasets - Advanced safety metrics including contradiction and hallucination detection - Automated complexity stratification using Flesch-Kincaid Grade Level - Interactive HTML leaderboards for result visualization - Multi-dimensional scoring with LLM-as-a-judge paradigm - Optimized for open-weight models on consumer hardware Quick Start ----------- .. code-block:: bash pip install -r requirements.txt .. code-block:: python from src.benchmark import MedExplain from src.evaluator import MedExplainEvaluator # Initialize benchmark bench = MedExplain() # Generate audience-adaptive explanations explanations = bench.generate_explanations(medical_content, model) # Evaluate explanations evaluator = MedExplainEvaluator() scores = evaluator.evaluate_all_audiences(explanations) Architecture ------------ MedExplain-Evals is built with SOLID principles: - Strategy Pattern for audience-specific scoring - Dependency Injection for flexible component management - Configuration-driven design with YAML configuration - Comprehensive logging for debugging and monitoring Getting Help ------------ **Documentation** - Primary documentation: This comprehensive guide covers installation, usage, and advanced topics - API Reference: Detailed function and class documentation with examples - Quickstart Guide: :doc:`quickstart` - Installation Guide: :doc:`installation` **Support Channels** - Bug Reports: `GitHub Issues `_ - Questions: `GitHub Discussions `_ **Troubleshooting** .. code-block:: bash # Verify installation python -c "import src; print('MedExplain-Evals is working')" # Run basic test python run_benchmark.py --model_name dummy --max_items 2 Contributing ------------ We welcome contributions: - Code contributions via Pull Requests - Bug reports and feature requests via Issues - Documentation improvements - Research collaborations See our `Contributing Guidelines `_. Citation -------- .. code-block:: bibtex @software{medexplain-evals-2025, title={MedExplain-Evals: A Resource-Efficient Benchmark for Evaluating Audience-Adaptive Explanation Quality in Medical Large Language Models}, author={Cheng Hei Lam}, year={2025}, url={https://github.com/heilcheng/medexplain-evals} } Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`