Medical Foundation Model

MedMO: Grounding and Understanding
Multimodal LLMs for Medical Images

A powerful open-source medical foundation model that unifies visual grounding, clinical reasoning, and language understanding across diverse medical imaging modalities.

Ankan Deria Komal Kumar Adinath Madhavrao Dukre Eran Segal Salman Khan Imran Razzak

Mohamed bin Zayed University of Artificial Intelligence

26M+
Training Samples
45
Medical Datasets
+13.7%
VQA Improvement
+43.8
Grounding IoU Gain
4
Training Stages

Abstract

MedMO is a medical foundation model built for reliable multimodal understanding and grounded reasoning across clinical imaging. It supports a wide range of tasks, including Visual QA, Text-based QA, Radiology Report Generation, Report Summarization, Diagnostic Classification, and Clinical Reasoning.

MedMO also delivers strong spatial intelligence with Disease Localization using Bounding Boxes, Anatomical Grounding, and Spatial Object Detection across radiology, pathology, ophthalmology, and microscopy. The model is trained on large-scale, domain-specific data with multi-stage alignment and grounding objectives, enabling accurate interpretation and clinically faithful outputs across modalities.

In benchmarks, MedMO consistently outperforms prior open-source medical MLLMs on VQA, text QA, and report generation, while achieving large gains in grounding and localization accuracy. This makes MedMO a strong, unified model for real-world medical imaging workflows.

Performance Comparison

MedMO achieves state-of-the-art results across diverse medical imaging tasks

MedMO Benchmark Performance
Figure. MedMO benchmark performance across multimodal medical tasks, including VQA, text-based QA, report generation, and spatial grounding.

Motivation & Challenges

Addressing critical limitations in existing medical MLLMs

Reliance on Distilled Data

Most existing models rely on distilled data from proprietary models, which often lack accurate domain grounding for fine-grained clinical reasoning.

Hallucination Risks

Distillation pipelines without structured supervision amplify hallucination risks and inconsistencies in medical outputs.

Narrow Modality Coverage

Current models focus on individual tasks or narrow modality subsets rather than achieving unified, cross-modal generalization.

Multi-Stage Training Pipeline

Progressive post-training for comprehensive medical image understanding

MedMO Training Pipeline and Capabilities
1
General Medical SFT
Large-scale training for foundational understanding
18.5M samples 768×768
2
High-Resolution SFT
Spatial localization & fine-grained grounding
3M samples 1280×1280
3
Instruction Tuning
Human-style medical instruction following
4.3M samples Multi-task
4
Reinforcement Learning
GRPO with verifiable rewards
300K samples BBox IoU
01

Cross-Modal Pretraining

Align heterogeneous visual encoders with a medical language backbone using DeepStack fusion mechanism.

02

Multi-Task Supervision

Training spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes.

03

Verifiable Rewards

Novel bounding-box GIoU reward combined with factuality checks for enhanced spatial grounding.

04

Scalable Architecture

Built upon Qwen3-VL with a modular design enabling future expansion across additional modalities.

Benchmark Results

State-of-the-art performance across medical VQA, Text QA, and Grounding tasks

Key Performance Insights

BASELINE (Qwen3-VL-8B)
VQA: 48.8% · Text QA: 54.5%
CURRENT SOTA (Fleming-VL-8B)
Best VQA Avg: 64.4% · Text QA: 46.9%
MedMO-4B (OURS)
VQA: 52.0% · Text QA: 56.2%
🏆 MedMO-8B (OURS) — NEW SOTA
VQA: 62.5% · Text QA: 61.4%

MedMO achieves the best overall balance among open-source models, outperforming both Lingshu-7B and Fleming-VL-8B with the strongest Text-QA results (+14.5% over Fleming-VL) while maintaining competitive VQA performance within 1.9% of SOTA.

Medical VQA Benchmarks

Model MMMU-Med VQA-RAD SLAKE PathVQA PMC-VQA OmniMedVQA MedXQA Avg.
GPT-4.1 75.265.072.255.555.275.545.263.4
Claude Sonnet 4 74.667.670.654.254.465.543.361.5
Gemini-2.5-Flash 76.968.575.855.455.471.052.865.1
BiomedGPT 24.916.613.611.327.627.9
Med-R1-2B 34.839.054.515.347.421.1
MedVLM-R1-2B 35.248.656.032.547.677.720.445.4
MedGemma-4B-IT 43.772.576.448.849.969.822.354.8
LLaVA-Med-7B 29.353.748.038.830.544.320.337.8
HuatuoGPT-V-7B 47.367.067.848.053.374.221.654.2
BioMediX2-8B 39.849.257.737.043.563.321.844.6
Qwen2.5VL-7B 50.664.567.244.151.963.622.352.0
InternVL2.5-8B 53.559.469.042.151.381.321.754.0
InternVL3-8B 59.265.472.848.653.879.122.457.3
Lingshu-7B 54.067.983.161.956.382.926.161.8
Fleming-VL-8B 63.366.186.562.964.386.721.664.4
Qwen3VL-8B (Baseline) 61.464.147.314.652.377.224.848.8
MedMO-4B (Ours) 54.650.941.062.450.679.724.852.0↑+3.2
MedMO-8B (Ours) 64.664.781.656.359.484.826.262.5↑+13.7

Medical Text QA Benchmarks

Model MMLU-Med PubMedQA MedMCQA MedQA Medbullets MedXQA SGPQA Avg.
GPT-4.1 89.675.677.789.177.030.949.970.0
Claude Sonnet 4 91.378.679.392.180.233.656.373.1
Gemini-2.5-Flash 84.273.873.691.277.635.653.369.9
Med-R1-2B 51.566.239.139.933.611.217.937.0
MedVLM-R1-2B 51.866.439.742.333.811.819.137.8
MedGemma-4B-IT 66.772.252.256.245.612.821.646.8
LLaVA-Med-7B 50.626.439.442.034.49.916.131.3
HuatuoGPT-V-7B 69.372.851.252.940.910.121.945.6
BioMediX2-8B 68.675.252.958.945.913.425.248.6
Qwen2.5VL-7B 73.476.452.657.342.112.826.348.7
InternVL2.5-8B 74.276.452.453.742.411.626.148.1
InternVL3-8B 77.575.457.762.148.513.131.252.2
Lingshu-7B 74.576.655.963.356.216.526.352.8
Fleming-VL-8B 71.874.051.853.740.512.124.946.9
Qwen3VL-8B (Baseline) 79.370.460.066.156.115.134.754.5
MedMO-4B (Ours) 75.778.058.078.557.516.429.456.2↑+1.7
MedMO-8B (Ours) 81.077.665.090.460.219.936.061.4↑+6.9

Medical Report Generation

Semantic (ROUGE-L, CIDEr) and model-based (RaTE, Semb) metrics

Model MIMIC-CXR CheXpert Plus IU-Xray Med-Trinity
R-LCIDErRaTESemb R-LCIDErRaTESemb R-LCIDErRaTESemb R-LCIDErRaTESemb
GPT-4.1 9.082.851.323.9 24.578.845.523.2 30.2124.651.347.5
Claude Sonnet 4 20.056.645.619.7 22.059.543.518.9 25.488.355.441.0
Gemini-2.5-Flash 25.480.750.329.7 23.672.244.327.4 33.5129.355.650.9
Med-R1-2B 19.335.440.614.8 18.637.138.517.8 16.138.341.412.5
MedVLM-R1-2B 20.340.141.614.2 20.943.538.915.5 22.761.146.122.7
MedGemma-4B-IT 25.681.052.429.2 27.179.047.229.3 30.8103.657.046.8
LLaVA-Med-7B 15.043.412.818.3 18.445.538.823.5 18.868.240.916.0
HuatuoGPT-V-7B 23.469.548.920.0 21.364.744.219.3 29.6104.352.940.7
BioMediX2-8B 20.052.844.417.7 18.147.940.821.6 19.658.840.111.6
Qwen2.5VL-7B 24.163.747.018.4 22.262.041.017.2 26.578.148.436.3 23.581.544.938.3
InternVL2.5-8B 23.261.847.021.0 20.658.543.119.7 24.875.451.136.7 13.547.142.512.8
InternVL3-8B 22.966.248.221.5 20.965.444.325.2 22.976.251.231.3 12.946.642.23.7
Lingshu-7B 30.8109.452.130.0 26.579.045.426.8 41.2180.757.648.4 16.074.544.424.0
Fleming-VL-8B 35.7132.556.733.6 26.182.247.140.1 44.9198.666.051.3 13.135.841.918.1
Qwen3VL-8B (Baseline) 25.177.950.333.4 21.967.444.437.9 25.091.452.542.9 20.269.945.933.6
MedMO-4B (Ours) 26.092.649.831.6 15.162.336.634.2 26.694.042.141.3 22.5152.647.834.3
MedMO-8B (Ours) 31.7140.057.150.0 23.687.547.342.2 31.1169.745.341.3 37.0270.453.039.2
Key Result: MedMO achieves CIDEr 140.0 and Semb 50.0 on MIMIC-CXR — best semantic coherence and clinical accuracy. On Med-Trinity (diverse modalities), MedMO dramatically outperforms with CIDEr 270.4 (vs 81.5 for next best).

Medical Grounding Benchmarks (IoU %)

Model NIH Chest DeepLesion Bacteria MedSG (multi-view) MedSG (tracking) Avg.
InternVL3-8B 10.1 0.0 0.7 6.3 13.0 5.6
Fleming-VL-8B 0.0 0.0 8.3 42.0 36.7 17.2
Lingshu-7B 5.3 0.7 0.0 28.3 38.7 13.9
Qwen3VL-8B 16.4 0.0 9.16 8.4 17.8 13.7
MedMO (Ours) 8.83 38.5 54.6 75.8 77.2 54.2↑+40.4

Key Contributions

01

Open-Source Foundation Model

A powerful open-source post-trained multimodal large VLM designed for comprehensive medical image understanding and grounding, available in 4B and 8B variants.

02

Scalable Training Pipeline

Curated 26M+ multimodal medical samples from 45 datasets with a multi-stage post-training pipeline that progressively enhances cross-modal alignment.

03

Novel Evaluation Benchmark

Constructed a dedicated Cell dataset from open-source microscopy images with varying sizes, shapes, and densities for evaluating VLM detection capabilities.

04

Comprehensive Analysis

Extensive experiments across data and methodology dimensions, providing an open benchmark for future multimodal medical LLM research.

Unified Multimodal Medical Dataset

Dataset composition covering imaging modalities and biological systems
Composition of the unified multimodal medical dataset comprising diverse imaging modalities (X-ray, CT, MRI, Ultrasound, Nuclear Medicine, Optical, Pathology) and biological systems (Respiratory, Cardiovascular, Nervous, Digestive, Urinary, Musculoskeletal, and more).

Qualitative Results

MedMO demonstrates superior diagnostic accuracy and clinical reasoning

🔬 Dermatology Diagnosis
Question
What is the name of the skin abnormality in this image?
Options: A. Eczema, B. Squamous cell carcinoma, C. Malignant melanoma, D. Melanoma
Other Models (Fleming-VL, Qwen3-VL, Lingshu)
B. Psoriasis ❌
✓ MedMO
B. Squamous cell carcinoma
🦠 Cell Detection & Grounding
Question
Detect and localize all cells in the image.
Ground Truth
[[54,545,63,554]]
Other Models
Fleming-VL: [0,0,999,999] ❌
Qwen3-VL: [31,21,965,957] ❌
✓ MedMO
[53,548,62,557] ✓ (Near-perfect localization)