A powerful open-source medical foundation model that unifies visual grounding, clinical reasoning, and language understanding across diverse medical imaging modalities.
Mohamed bin Zayed University of Artificial Intelligence
Multimodal large language models have advanced rapidly, but their adoption in medicine is constrained by limited domain coverage, imperfect modality alignment, and insufficient grounded reasoning. We introduce MedMO, a medical multimodal foundation model built on a general MLLM architecture and trained exclusively on large-scale domain-specific data.
MedMO uses a multi-stage training recipe that includes cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone, instruction tuning with multi-task supervision spanning captioning, VQA, report generation, retrieval, and bounding-box disease localization, and reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU signal to improve spatial grounding and step-by-step reasoning in challenging clinical settings.
Across modalities and tasks, MedMO surpasses strong open-source medical baselines. MedMO-8B-Next achieves consistent gains on VQA benchmarks, improving by 6.6% on average over Fleming-VL-8B, including gains of 6.0% on MMMU-Med, 9.8% on PMC-VQA, and 21.3% on MedXpertQA. On text-based QA, it improves by 14.4% over Fleming-VL-8B, driven by gains of 8.4% on MMLU-Med and 30.1% on MedQA. For medical report generation, it improves by 6.7% on MIMIC-CXR. MedMO-8B-Next also demonstrates strong grounding performance, reaching 56.1 IoU on Bacteria, which is a 47.8 IoU gain over Fleming-VL-8B. At smaller scale, MedMO-4B-Next remains competitive and exceeds Fleming-VL-8B across VQA, QA, and report generation. Evaluations spanning radiology, ophthalmology, and pathology microscopy further confirm broad cross-modality generalization.
MedMO achieves state-of-the-art results across diverse medical imaging tasks
Addressing critical limitations in existing medical MLLMs
Most existing models rely on distilled data from proprietary models, which often lack accurate domain grounding for fine-grained clinical reasoning.
Distillation pipelines without structured supervision amplify hallucination risks and inconsistencies in medical outputs.
Current models focus on individual tasks or narrow modality subsets rather than achieving unified, cross-modal generalization.
Progressive post-training for comprehensive medical image understanding
Align heterogeneous visual encoders with a medical language backbone using DeepStack fusion mechanism.
Training spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes.
Novel bounding-box GIoU reward combined with factuality checks for enhanced spatial grounding.
Built upon Qwen3-VL with a modular design enabling future expansion across additional modalities.
State-of-the-art performance across medical VQA, Text QA, and Grounding tasks
MedMO-8B-Next achieves new state-of-the-art among open-source models, surpassing Fleming-VL-8B by +6.6% on VQA and +14.4% on Text QA, with gains of 6.0% on MMMU-Med, 9.8% on PMC-VQA, 21.3% on MedXpertQA, and 47.8 IoU on Bacteria grounding. MedMO-4B-Next also exceeds Fleming-VL-8B across VQA, QA, and report generation at smaller scale.
| Model | MMMU-Med | VQA-RAD (closed/all) | SLAKE (closed/all) | PathVQA (all) | PMC-VQA | OMVQA | MedXQA | Avg. |
|---|---|---|---|---|---|---|---|---|
| GPT-4.1 | 75.2 | 65.0 | 72.2 | 55.5 | 55.2 | 75.5 | 45.2 | 63.4 |
| Claude Sonnet 4 | 74.6 | 67.6 | 70.6 | 54.2 | 54.4 | 65.5 | 43.3 | 61.5 |
| Gemini-2.5-Flash | 76.9 | 68.5 | 75.8 | 55.4 | 55.4 | 71.0 | 52.8 | 65.1 |
| BiomedGPT | 24.9 | 16.6 | 13.6 | 11.3 | 27.6 | 27.9 | – | – |
| Med-R1-2B | 34.8 | 39.0 | 54.5 | 15.3 | 47.4 | – | 21.1 | – |
| MedVLM-R1-2B | 35.2 | 48.6 | 56.0 | 32.5 | 47.6 | 77.7 | 20.4 | 45.4 |
| MedGemma-4B-IT | 43.7 | 72.5 | 76.4 | 48.8 | 49.9 | 69.8 | 22.3 | 54.8 |
| LLaVA-Med-7B | 29.3 | 53.7 | 48.0 | 38.8 | 30.5 | 44.3 | 20.3 | 37.8 |
| HuatuoGPT-V-7B | 47.3 | 67.0 | 67.8 | 48.0 | 53.3 | 74.2 | 21.6 | 54.2 |
| BioMediX2-8B | 39.8 | 49.2 | 57.7 | 37.0 | 43.5 | 63.3 | 21.8 | 44.6 |
| Qwen2.5VL-7B | 50.6 | 64.5 | 67.2 | 44.1 | 51.9 | 63.6 | 22.3 | 52.0 |
| InternVL2.5-8B | 53.5 | 59.4 | 69.0 | 42.1 | 51.3 | 81.3 | 21.7 | 54.0 |
| InternVL3-8B | 59.2 | 76.4/52.9 | 72.1/62.4 | 39.0 | 53.8 | 79.1 | 22.4 | 57.4 |
| Lingshu-7B | 54.0 | 77.2/43.0 | 82.4/33.2 | 41.9 | 54.2 | 82.9 | 26.9 | 55.1 |
| Fleming-VL-8B | 63.3 | 78.4/56.4 | 86.9/80.0 | 56.5 | 64.3 | 88.2 | 21.6 | 66.1 |
| Qwen3VL-8B (Baseline) | 61.4 | 54.1/31.2 | 34.3/15.0 | 14.6 | 52.3 | 77.2 | 24.8 | 40.5 |
| MedMO-4B (Ours) | 54.6 | 50.9/35.0 | 41.0/30.0 | 42.4 | 50.6 | 79.7 | 24.8 | 45.4 |
| MedMO-4B-Next (Ours) | 58.7 | 79.7/59.6 | 78.0/74.0 | 73.3 | 75.7 | 90.6 | 27.0 | 68.5 |
| MedMO-8B (Ours) | 64.6 | 72.3/64.7 | 70.6/70.0 | 56.3 | 59.4 | 84.8 | 26.2 | 63.2 |
| MedMO-8B-Next (Ours) | 69.3 | 86.4/68.0 | 83.0/81.6 | 56.3 | 74.1 | 93.3 | 42.9 | 72.7↑+6.6 |
| Model | MMLU-Med | PubMedQA | MedMCQA | MedQA | Medbullets (op4/op5) | MedXQA | SGPQA | Avg. |
|---|---|---|---|---|---|---|---|---|
| GPT-4.1 | 89.6 | 75.6 | 77.7 | 89.1 | 77.0 | 30.9 | 49.9 | 70.0 |
| Claude Sonnet 4 | 91.3 | 78.6 | 79.3 | 92.1 | 80.2 | 33.6 | 56.3 | 73.1 |
| Gemini-2.5-Flash | 84.2 | 73.8 | 73.6 | 91.2 | 77.6 | 35.6 | 53.3 | 69.9 |
| BiomedGPT | – | – | – | – | – | – | – | – |
| Med-R1-2B | 51.5 | 66.2 | 39.1 | 39.9 | 33.6 | 11.2 | 17.9 | 37.0 |
| MedVLM-R1-2B | 51.8 | 66.4 | 39.7 | 42.3 | 33.8 | 11.8 | 19.1 | 37.8 |
| MedGemma-4B-IT | 66.7 | 72.2 | 52.2 | 56.2 | 45.6 | 12.8 | 21.6 | 46.8 |
| LLaVA-Med-7B | 50.6 | 26.4 | 39.4 | 42.0 | 34.4 | 9.9 | 16.1 | 31.3 |
| HuatuoGPT-V-7B | 69.3 | 72.8 | 51.2 | 52.9 | 40.9 | 10.1 | 21.9 | 45.6 |
| BioMediX2-8B | 68.6 | 75.2 | 52.9 | 58.9 | 45.9 | 13.4 | 25.2 | 48.6 |
| Qwen2.5VL-7B | 73.4 | 76.4 | 52.6 | 57.3 | 42.1 | 12.8 | 26.3 | 48.7 |
| InternVL2.5-8B | 74.2 | 76.4 | 52.4 | 53.7 | 42.4 | 11.6 | 26.1 | 48.1 |
| InternVL3-8B | 77.5 | 75.4 | 57.7 | 62.1 | 50.2/42.8 | 13.1 | 31.2 | 51.2 |
| Lingshu-7B | 69.6 | 75.8 | 56.3 | 63.5 | 62.0/53.8 | 16.4 | 27.5 | 53.1 |
| Fleming-VL-8B | 71.8 | 74.0 | 51.8 | 53.7 | 40.5/37.3 | 12.1 | 24.9 | 45.7 |
| Qwen3VL-8B (Baseline) | 79.3 | 70.4 | 60.0 | 66.1 | 56.1/47.7 | 15.1 | 34.7 | 53.6 |
| MedMO-4B (Ours) | 75.7 | 78.0 | 58.0 | 78.5 | 57.5/47.7 | 16.4 | 29.4 | 55.1 |
| MedMO-4B-Next (Ours) | 74.8 | 78.2 | 58.1 | 78.3 | 57.4/47.6 | 16.5 | 29.5 | 55.0 |
| MedMO-8B (Ours) | 81.0 | 77.6 | 65.0 | 84.3 | 66.5/60.2 | 19.9 | 36.0 | 61.3 |
| MedMO-8B-Next (Ours) | 80.2 | 75.6 | 62.0 | 83.8 | 65.2/57.8 | 20.9 | 35.5 | 60.1↑+14.4 |
Semantic (ROUGE-L, CIDEr) and model-based (RaTE, Semb) metrics
| Model | MIMIC-CXR | CheXpert Plus | IU-Xray | Med-Trinity | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R-L | CIDEr | RaTE | Semb | R-L | CIDEr | RaTE | Semb | R-L | CIDEr | RaTE | Semb | R-L | CIDEr | RaTE | Semb | |
| GPT-4.1 | 9.0 | 82.8 | 51.3 | 23.9 | 24.5 | 78.8 | 45.5 | 23.2 | 30.2 | 124.6 | 51.3 | 47.5 | – | – | – | – |
| Claude Sonnet 4 | 20.0 | 56.6 | 45.6 | 19.7 | 22.0 | 59.5 | 43.5 | 18.9 | 25.4 | 88.3 | 55.4 | 41.0 | – | – | – | – |
| Gemini-2.5-Flash | 25.4 | 80.7 | 50.3 | 29.7 | 23.6 | 72.2 | 44.3 | 27.4 | 33.5 | 129.3 | 55.6 | 50.9 | – | – | – | – |
| Med-R1-2B | 19.3 | 35.4 | 40.6 | 14.8 | 18.6 | 37.1 | 38.5 | 17.8 | 16.1 | 38.3 | 41.4 | 12.5 | – | – | – | – |
| MedVLM-R1-2B | 20.3 | 40.1 | 41.6 | 14.2 | 20.9 | 43.5 | 38.9 | 15.5 | 22.7 | 61.1 | 46.1 | 22.7 | – | – | – | – |
| MedGemma-4B-IT | 25.6 | 81.0 | 52.4 | 29.2 | 27.1 | 79.0 | 47.2 | 29.3 | 30.8 | 103.6 | 57.0 | 46.8 | – | – | – | – |
| LLaVA-Med-7B | 15.0 | 43.4 | 12.8 | 18.3 | 18.4 | 45.5 | 38.8 | 23.5 | 18.8 | 68.2 | 40.9 | 16.0 | – | – | – | – |
| HuatuoGPT-V-7B | 23.4 | 69.5 | 48.9 | 20.0 | 21.3 | 64.7 | 44.2 | 19.3 | 29.6 | 104.3 | 52.9 | 40.7 | – | – | – | – |
| BioMediX2-8B | 20.0 | 52.8 | 44.4 | 17.7 | 18.1 | 47.9 | 40.8 | 21.6 | 19.6 | 58.8 | 40.1 | 11.6 | – | – | – | – |
| Qwen2.5VL-7B | 24.1 | 63.7 | 47.0 | 18.4 | 22.2 | 62.0 | 41.0 | 17.2 | 26.5 | 78.1 | 48.4 | 36.3 | 23.5 | 81.5 | 44.9 | 38.3 |
| InternVL2.5-8B | 23.2 | 61.8 | 47.0 | 21.0 | 20.6 | 58.5 | 43.1 | 19.7 | 24.8 | 75.4 | 51.1 | 36.7 | 13.5 | 47.1 | 42.5 | 12.8 |
| InternVL3-8B | 22.9 | 66.2 | 48.2 | 21.5 | 20.9 | 65.4 | 44.3 | 25.2 | 22.9 | 76.2 | 51.2 | 31.3 | 12.9 | 46.6 | 42.2 | 3.7 |
| Lingshu-7B | 30.8 | 109.4 | 52.1 | 30.0 | 26.5 | 79.0 | 45.4 | 26.8 | 41.2 | 180.7 | 57.6 | 48.4 | 16.0 | 74.5 | 44.4 | 24.0 |
| Fleming-VL-8B | 35.7 | 132.5 | 56.7 | 33.6 | 26.1 | 82.2 | 47.1 | 40.1 | 44.9 | 198.6 | 66.0 | 51.3 | 13.1 | 35.8 | 41.9 | 18.1 |
| Qwen3VL-8B (Baseline) | 25.1 | 77.9 | 50.3 | 33.4 | 21.9 | 67.4 | 44.4 | 37.9 | 25.0 | 91.44 | 52.5 | 42.9 | 20.2 | 69.9 | 45.9 | 33.6 |
| MedMO-4B (Ours) | 26.0 | 92.6 | 49.8 | 31.6 | 15.1 | 62.3 | 36.6 | 34.2 | 26.6 | 94.0 | 42.1 | 41.3 | 22.5 | 152.6 | 47.8 | 34.3 |
| MedMO-4B-Next (Ours) | 28.3 | 96.7 | 52.0 | 34.3 | 23.5 | 74.5 | 42.6 | 38.7 | 38.0 | 147.8 | 62.0 | 49.4 | 26.3 | 183.8 | 49.5 | 38.6 |
| MedMO-8B (Ours) | 31.7 | 140.0 | 57.1 | 50.0 | 23.6 | 87.5 | 47.3 | 42.2 | 31.1 | 169.7 | 45.3 | 41.3 | 37.0 | 270.4 | 53.0 | 39.2 |
| MedMO-8B-Next (Ours) | 32.6 | 143.4 | 57.7 | 51.5 | 25.7 | 88.3 | 48.1 | 43.8 | 31.8 | 171.9 | 56.0 | 43.1 | 38.5 | 272.1 | 53.8 | 40.7 |
| Model | NIH | DeepLesion | Bacteria | MedSG (multi_view) | MedSG (object_tracking) | MedSG (referring) | Avg. |
|---|---|---|---|---|---|---|---|
| InternVL3-8B | 10.1 | 0.00 | 0.7 | 6.3 | 13.0 | 3.3 | 5.6 |
| Fleming-VL-8B | 0.00 | 0.00 | 8.3 | 42.0 | 36.7 | 16.6 | 17.2 |
| Lingshu-7B | 5.3 | 0.7 | 10.8 | 28.3 | 38.7 | 10.4 | 13.9 |
| Qwen3VL-8B | 16.4 | 0.00 | 9.16 | 8.4 | 17.8 | 31.4 | 13.8 |
| MedSG-Bench | – | – | – | 55.0 | 62.1 | 60.4 | – |
| MedMO-8B (Ours) | 8.83 | 38.5 | 54.6 | 75.8 | 77.2 | 70.1 | 54.2 |
| MedMO-8B-Next (Ours) | 15.9 | 40.5 | 56.1 | 77.5 | 78.8 | 71.9 | 56.8↑+39.6 |
A powerful open-source post-trained multimodal large VLM designed for comprehensive medical image understanding and grounding, available in 4B and 8B variants.
Curated 26M+ multimodal medical samples from 45 datasets with a multi-stage post-training pipeline that progressively enhances cross-modal alignment.
Constructed a dedicated Cell dataset from open-source microscopy images with varying sizes, shapes, and densities for evaluating VLM detection capabilities.
Extensive experiments across data and methodology dimensions, providing an open benchmark for future multimodal medical LLM research.
MedMO demonstrates superior diagnostic accuracy and clinical reasoning