A powerful open-source medical foundation model that unifies visual grounding, clinical reasoning, and language understanding across diverse medical imaging modalities.
Mohamed bin Zayed University of Artificial Intelligence
MedMO is a medical foundation model built for reliable multimodal understanding and grounded reasoning across clinical imaging. It supports a wide range of tasks, including Visual QA, Text-based QA, Radiology Report Generation, Report Summarization, Diagnostic Classification, and Clinical Reasoning.
MedMO also delivers strong spatial intelligence with Disease Localization using Bounding Boxes, Anatomical Grounding, and Spatial Object Detection across radiology, pathology, ophthalmology, and microscopy. The model is trained on large-scale, domain-specific data with multi-stage alignment and grounding objectives, enabling accurate interpretation and clinically faithful outputs across modalities.
In benchmarks, MedMO consistently outperforms prior open-source medical MLLMs on VQA, text QA, and report generation, while achieving large gains in grounding and localization accuracy. This makes MedMO a strong, unified model for real-world medical imaging workflows.
MedMO achieves state-of-the-art results across diverse medical imaging tasks
Addressing critical limitations in existing medical MLLMs
Most existing models rely on distilled data from proprietary models, which often lack accurate domain grounding for fine-grained clinical reasoning.
Distillation pipelines without structured supervision amplify hallucination risks and inconsistencies in medical outputs.
Current models focus on individual tasks or narrow modality subsets rather than achieving unified, cross-modal generalization.
Progressive post-training for comprehensive medical image understanding
Align heterogeneous visual encoders with a medical language backbone using DeepStack fusion mechanism.
Training spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes.
Novel bounding-box GIoU reward combined with factuality checks for enhanced spatial grounding.
Built upon Qwen3-VL with a modular design enabling future expansion across additional modalities.
State-of-the-art performance across medical VQA, Text QA, and Grounding tasks
MedMO achieves the best overall balance among open-source models, outperforming both Lingshu-7B and Fleming-VL-8B with the strongest Text-QA results (+14.5% over Fleming-VL) while maintaining competitive VQA performance within 1.9% of SOTA.
| Model | MMMU-Med | VQA-RAD | SLAKE | PathVQA | PMC-VQA | OmniMedVQA | MedXQA | Avg. |
|---|---|---|---|---|---|---|---|---|
| GPT-4.1 | 75.2 | 65.0 | 72.2 | 55.5 | 55.2 | 75.5 | 45.2 | 63.4 |
| Claude Sonnet 4 | 74.6 | 67.6 | 70.6 | 54.2 | 54.4 | 65.5 | 43.3 | 61.5 |
| Gemini-2.5-Flash | 76.9 | 68.5 | 75.8 | 55.4 | 55.4 | 71.0 | 52.8 | 65.1 |
| BiomedGPT | 24.9 | 16.6 | 13.6 | 11.3 | 27.6 | 27.9 | – | – |
| Med-R1-2B | 34.8 | 39.0 | 54.5 | 15.3 | 47.4 | – | 21.1 | – |
| MedVLM-R1-2B | 35.2 | 48.6 | 56.0 | 32.5 | 47.6 | 77.7 | 20.4 | 45.4 |
| MedGemma-4B-IT | 43.7 | 72.5 | 76.4 | 48.8 | 49.9 | 69.8 | 22.3 | 54.8 |
| LLaVA-Med-7B | 29.3 | 53.7 | 48.0 | 38.8 | 30.5 | 44.3 | 20.3 | 37.8 |
| HuatuoGPT-V-7B | 47.3 | 67.0 | 67.8 | 48.0 | 53.3 | 74.2 | 21.6 | 54.2 |
| BioMediX2-8B | 39.8 | 49.2 | 57.7 | 37.0 | 43.5 | 63.3 | 21.8 | 44.6 |
| Qwen2.5VL-7B | 50.6 | 64.5 | 67.2 | 44.1 | 51.9 | 63.6 | 22.3 | 52.0 |
| InternVL2.5-8B | 53.5 | 59.4 | 69.0 | 42.1 | 51.3 | 81.3 | 21.7 | 54.0 |
| InternVL3-8B | 59.2 | 65.4 | 72.8 | 48.6 | 53.8 | 79.1 | 22.4 | 57.3 |
| Lingshu-7B | 54.0 | 67.9 | 83.1 | 61.9 | 56.3 | 82.9 | 26.1 | 61.8 |
| Fleming-VL-8B | 63.3 | 66.1 | 86.5 | 62.9 | 64.3 | 86.7 | 21.6 | 64.4 |
| Qwen3VL-8B (Baseline) | 61.4 | 64.1 | 47.3 | 14.6 | 52.3 | 77.2 | 24.8 | 48.8 |
| MedMO-4B (Ours) | 54.6 | 50.9 | 41.0 | 62.4 | 50.6 | 79.7 | 24.8 | 52.0↑+3.2 |
| MedMO-8B (Ours) | 64.6 | 64.7 | 81.6 | 56.3 | 59.4 | 84.8 | 26.2 | 62.5↑+13.7 |
| Model | MMLU-Med | PubMedQA | MedMCQA | MedQA | Medbullets | MedXQA | SGPQA | Avg. |
|---|---|---|---|---|---|---|---|---|
| GPT-4.1 | 89.6 | 75.6 | 77.7 | 89.1 | 77.0 | 30.9 | 49.9 | 70.0 |
| Claude Sonnet 4 | 91.3 | 78.6 | 79.3 | 92.1 | 80.2 | 33.6 | 56.3 | 73.1 |
| Gemini-2.5-Flash | 84.2 | 73.8 | 73.6 | 91.2 | 77.6 | 35.6 | 53.3 | 69.9 |
| Med-R1-2B | 51.5 | 66.2 | 39.1 | 39.9 | 33.6 | 11.2 | 17.9 | 37.0 |
| MedVLM-R1-2B | 51.8 | 66.4 | 39.7 | 42.3 | 33.8 | 11.8 | 19.1 | 37.8 |
| MedGemma-4B-IT | 66.7 | 72.2 | 52.2 | 56.2 | 45.6 | 12.8 | 21.6 | 46.8 |
| LLaVA-Med-7B | 50.6 | 26.4 | 39.4 | 42.0 | 34.4 | 9.9 | 16.1 | 31.3 |
| HuatuoGPT-V-7B | 69.3 | 72.8 | 51.2 | 52.9 | 40.9 | 10.1 | 21.9 | 45.6 |
| BioMediX2-8B | 68.6 | 75.2 | 52.9 | 58.9 | 45.9 | 13.4 | 25.2 | 48.6 |
| Qwen2.5VL-7B | 73.4 | 76.4 | 52.6 | 57.3 | 42.1 | 12.8 | 26.3 | 48.7 |
| InternVL2.5-8B | 74.2 | 76.4 | 52.4 | 53.7 | 42.4 | 11.6 | 26.1 | 48.1 |
| InternVL3-8B | 77.5 | 75.4 | 57.7 | 62.1 | 48.5 | 13.1 | 31.2 | 52.2 |
| Lingshu-7B | 74.5 | 76.6 | 55.9 | 63.3 | 56.2 | 16.5 | 26.3 | 52.8 |
| Fleming-VL-8B | 71.8 | 74.0 | 51.8 | 53.7 | 40.5 | 12.1 | 24.9 | 46.9 |
| Qwen3VL-8B (Baseline) | 79.3 | 70.4 | 60.0 | 66.1 | 56.1 | 15.1 | 34.7 | 54.5 |
| MedMO-4B (Ours) | 75.7 | 78.0 | 58.0 | 78.5 | 57.5 | 16.4 | 29.4 | 56.2↑+1.7 |
| MedMO-8B (Ours) | 81.0 | 77.6 | 65.0 | 90.4 | 60.2 | 19.9 | 36.0 | 61.4↑+6.9 |
Semantic (ROUGE-L, CIDEr) and model-based (RaTE, Semb) metrics
| Model | MIMIC-CXR | CheXpert Plus | IU-Xray | Med-Trinity | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R-L | CIDEr | RaTE | Semb | R-L | CIDEr | RaTE | Semb | R-L | CIDEr | RaTE | Semb | R-L | CIDEr | RaTE | Semb | |
| GPT-4.1 | 9.0 | 82.8 | 51.3 | 23.9 | 24.5 | 78.8 | 45.5 | 23.2 | 30.2 | 124.6 | 51.3 | 47.5 | – | – | – | – |
| Claude Sonnet 4 | 20.0 | 56.6 | 45.6 | 19.7 | 22.0 | 59.5 | 43.5 | 18.9 | 25.4 | 88.3 | 55.4 | 41.0 | – | – | – | – |
| Gemini-2.5-Flash | 25.4 | 80.7 | 50.3 | 29.7 | 23.6 | 72.2 | 44.3 | 27.4 | 33.5 | 129.3 | 55.6 | 50.9 | – | – | – | – |
| Med-R1-2B | 19.3 | 35.4 | 40.6 | 14.8 | 18.6 | 37.1 | 38.5 | 17.8 | 16.1 | 38.3 | 41.4 | 12.5 | – | – | – | – |
| MedVLM-R1-2B | 20.3 | 40.1 | 41.6 | 14.2 | 20.9 | 43.5 | 38.9 | 15.5 | 22.7 | 61.1 | 46.1 | 22.7 | – | – | – | – |
| MedGemma-4B-IT | 25.6 | 81.0 | 52.4 | 29.2 | 27.1 | 79.0 | 47.2 | 29.3 | 30.8 | 103.6 | 57.0 | 46.8 | – | – | – | – |
| LLaVA-Med-7B | 15.0 | 43.4 | 12.8 | 18.3 | 18.4 | 45.5 | 38.8 | 23.5 | 18.8 | 68.2 | 40.9 | 16.0 | – | – | – | – |
| HuatuoGPT-V-7B | 23.4 | 69.5 | 48.9 | 20.0 | 21.3 | 64.7 | 44.2 | 19.3 | 29.6 | 104.3 | 52.9 | 40.7 | – | – | – | – |
| BioMediX2-8B | 20.0 | 52.8 | 44.4 | 17.7 | 18.1 | 47.9 | 40.8 | 21.6 | 19.6 | 58.8 | 40.1 | 11.6 | – | – | – | – |
| Qwen2.5VL-7B | 24.1 | 63.7 | 47.0 | 18.4 | 22.2 | 62.0 | 41.0 | 17.2 | 26.5 | 78.1 | 48.4 | 36.3 | 23.5 | 81.5 | 44.9 | 38.3 |
| InternVL2.5-8B | 23.2 | 61.8 | 47.0 | 21.0 | 20.6 | 58.5 | 43.1 | 19.7 | 24.8 | 75.4 | 51.1 | 36.7 | 13.5 | 47.1 | 42.5 | 12.8 |
| InternVL3-8B | 22.9 | 66.2 | 48.2 | 21.5 | 20.9 | 65.4 | 44.3 | 25.2 | 22.9 | 76.2 | 51.2 | 31.3 | 12.9 | 46.6 | 42.2 | 3.7 |
| Lingshu-7B | 30.8 | 109.4 | 52.1 | 30.0 | 26.5 | 79.0 | 45.4 | 26.8 | 41.2 | 180.7 | 57.6 | 48.4 | 16.0 | 74.5 | 44.4 | 24.0 |
| Fleming-VL-8B | 35.7 | 132.5 | 56.7 | 33.6 | 26.1 | 82.2 | 47.1 | 40.1 | 44.9 | 198.6 | 66.0 | 51.3 | 13.1 | 35.8 | 41.9 | 18.1 |
| Qwen3VL-8B (Baseline) | 25.1 | 77.9 | 50.3 | 33.4 | 21.9 | 67.4 | 44.4 | 37.9 | 25.0 | 91.4 | 52.5 | 42.9 | 20.2 | 69.9 | 45.9 | 33.6 |
| MedMO-4B (Ours) | 26.0 | 92.6 | 49.8 | 31.6 | 15.1 | 62.3 | 36.6 | 34.2 | 26.6 | 94.0 | 42.1 | 41.3 | 22.5 | 152.6 | 47.8 | 34.3 |
| MedMO-8B (Ours) | 31.7 | 140.0 | 57.1 | 50.0 | 23.6 | 87.5 | 47.3 | 42.2 | 31.1 | 169.7 | 45.3 | 41.3 | 37.0 | 270.4 | 53.0 | 39.2 |
| Model | NIH Chest | DeepLesion | Bacteria | MedSG (multi-view) | MedSG (tracking) | Avg. |
|---|---|---|---|---|---|---|
| InternVL3-8B | 10.1 | 0.0 | 0.7 | 6.3 | 13.0 | 5.6 |
| Fleming-VL-8B | 0.0 | 0.0 | 8.3 | 42.0 | 36.7 | 17.2 |
| Lingshu-7B | 5.3 | 0.7 | 0.0 | 28.3 | 38.7 | 13.9 |
| Qwen3VL-8B | 16.4 | 0.0 | 9.16 | 8.4 | 17.8 | 13.7 |
| MedMO (Ours) | 8.83 | 38.5 | 54.6 | 75.8 | 77.2 | 54.2↑+40.4 |
A powerful open-source post-trained multimodal large VLM designed for comprehensive medical image understanding and grounding, available in 4B and 8B variants.
Curated 26M+ multimodal medical samples from 45 datasets with a multi-stage post-training pipeline that progressively enhances cross-modal alignment.
Constructed a dedicated Cell dataset from open-source microscopy images with varying sizes, shapes, and densities for evaluating VLM detection capabilities.
Extensive experiments across data and methodology dimensions, providing an open benchmark for future multimodal medical LLM research.
MedMO demonstrates superior diagnostic accuracy and clinical reasoning