Medical Foundation Model

MedMO: Grounding and Understanding
Multimodal LLMs for Medical Images

A powerful open-source medical foundation model that unifies visual grounding, clinical reasoning, and language understanding across diverse medical imaging modalities.

Ankan Deria Komal Kumar Adinath Madhavrao Dukre Eran Segal Salman Khan Imran Razzak

Mohamed bin Zayed University of Artificial Intelligence

26M+
Training Samples
45
Medical Datasets
+6.6%
VQA over Fleming-VL
+14.4%
QA over Fleming-VL
+47.8
Bacteria IoU Gain
4
Training Stages

Abstract

Multimodal large language models have advanced rapidly, but their adoption in medicine is constrained by limited domain coverage, imperfect modality alignment, and insufficient grounded reasoning. We introduce MedMO, a medical multimodal foundation model built on a general MLLM architecture and trained exclusively on large-scale domain-specific data.

MedMO uses a multi-stage training recipe that includes cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone, instruction tuning with multi-task supervision spanning captioning, VQA, report generation, retrieval, and bounding-box disease localization, and reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU signal to improve spatial grounding and step-by-step reasoning in challenging clinical settings.

Across modalities and tasks, MedMO surpasses strong open-source medical baselines. MedMO-8B-Next achieves consistent gains on VQA benchmarks, improving by 6.6% on average over Fleming-VL-8B, including gains of 6.0% on MMMU-Med, 9.8% on PMC-VQA, and 21.3% on MedXpertQA. On text-based QA, it improves by 14.4% over Fleming-VL-8B, driven by gains of 8.4% on MMLU-Med and 30.1% on MedQA. For medical report generation, it improves by 6.7% on MIMIC-CXR. MedMO-8B-Next also demonstrates strong grounding performance, reaching 56.1 IoU on Bacteria, which is a 47.8 IoU gain over Fleming-VL-8B. At smaller scale, MedMO-4B-Next remains competitive and exceeds Fleming-VL-8B across VQA, QA, and report generation. Evaluations spanning radiology, ophthalmology, and pathology microscopy further confirm broad cross-modality generalization.

Performance Comparison

MedMO achieves state-of-the-art results across diverse medical imaging tasks

MedMO Benchmark Performance
Figure. MedMO benchmark performance across multimodal medical tasks, including VQA, text-based QA, report generation, and spatial grounding.

Motivation & Challenges

Addressing critical limitations in existing medical MLLMs

Reliance on Distilled Data

Most existing models rely on distilled data from proprietary models, which often lack accurate domain grounding for fine-grained clinical reasoning.

Hallucination Risks

Distillation pipelines without structured supervision amplify hallucination risks and inconsistencies in medical outputs.

Narrow Modality Coverage

Current models focus on individual tasks or narrow modality subsets rather than achieving unified, cross-modal generalization.

Multi-Stage Training Pipeline

Progressive post-training for comprehensive medical image understanding

MedMO Training Pipeline and Capabilities
1
General Medical SFT
Large-scale training for foundational understanding
18.5M samples 768×768
2
High-Resolution SFT
Spatial localization & fine-grained grounding
3M samples 1280×1280
3
Instruction Tuning
Human-style medical instruction following
4.3M samples Multi-task
4
Reinforcement Learning
GRPO with verifiable rewards
300K samples BBox IoU
01

Cross-Modal Pretraining

Align heterogeneous visual encoders with a medical language backbone using DeepStack fusion mechanism.

02

Multi-Task Supervision

Training spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes.

03

Verifiable Rewards

Novel bounding-box GIoU reward combined with factuality checks for enhanced spatial grounding.

04

Scalable Architecture

Built upon Qwen3-VL with a modular design enabling future expansion across additional modalities.

Benchmark Results

State-of-the-art performance across medical VQA, Text QA, and Grounding tasks

Key Performance Insights

BASELINE (Qwen3-VL-8B)
VQA: 40.5% · Text QA: 53.6%
PREV. SOTA (Fleming-VL-8B)
VQA: 66.1% · Text QA: 45.7%
MedMO-4B (OURS)
VQA: 45.4% · Text QA: 55.1%
MedMO-4B-Next (OURS)
VQA: 68.5% · Text QA: 55.0%
MedMO-8B (OURS)
VQA: 63.2% · Text QA: 61.3%
🏆 MedMO-8B-Next (OURS) — NEW SOTA
VQA: 72.7% · Text QA: 60.1%

MedMO-8B-Next achieves new state-of-the-art among open-source models, surpassing Fleming-VL-8B by +6.6% on VQA and +14.4% on Text QA, with gains of 6.0% on MMMU-Med, 9.8% on PMC-VQA, 21.3% on MedXpertQA, and 47.8 IoU on Bacteria grounding. MedMO-4B-Next also exceeds Fleming-VL-8B across VQA, QA, and report generation at smaller scale.

Medical VQA Benchmarks

Model MMMU-Med VQA-RAD (closed/all) SLAKE (closed/all) PathVQA (all) PMC-VQA OMVQA MedXQA Avg.
GPT-4.1 75.265.072.255.555.275.545.263.4
Claude Sonnet 4 74.667.670.654.254.465.543.361.5
Gemini-2.5-Flash 76.968.575.855.455.471.052.865.1
BiomedGPT 24.916.613.611.327.627.9
Med-R1-2B 34.839.054.515.347.421.1
MedVLM-R1-2B 35.248.656.032.547.677.720.445.4
MedGemma-4B-IT 43.772.576.448.849.969.822.354.8
LLaVA-Med-7B 29.353.748.038.830.544.320.337.8
HuatuoGPT-V-7B 47.367.067.848.053.374.221.654.2
BioMediX2-8B 39.849.257.737.043.563.321.844.6
Qwen2.5VL-7B 50.664.567.244.151.963.622.352.0
InternVL2.5-8B 53.559.469.042.151.381.321.754.0
InternVL3-8B 59.276.4/52.972.1/62.439.053.879.122.457.4
Lingshu-7B 54.077.2/43.082.4/33.241.954.282.926.955.1
Fleming-VL-8B 63.378.4/56.486.9/80.056.564.388.221.666.1
Qwen3VL-8B (Baseline) 61.454.1/31.234.3/15.014.652.377.224.840.5
MedMO-4B (Ours) 54.650.9/35.041.0/30.042.450.679.724.845.4
MedMO-4B-Next (Ours) 58.779.7/59.678.0/74.073.375.790.627.068.5
MedMO-8B (Ours) 64.672.3/64.770.6/70.056.359.484.826.263.2
MedMO-8B-Next (Ours) 69.386.4/68.083.0/81.656.374.193.342.972.7↑+6.6

Medical Text QA Benchmarks

Model MMLU-Med PubMedQA MedMCQA MedQA Medbullets (op4/op5) MedXQA SGPQA Avg.
GPT-4.1 89.675.677.789.177.030.949.970.0
Claude Sonnet 4 91.378.679.392.180.233.656.373.1
Gemini-2.5-Flash 84.273.873.691.277.635.653.369.9
BiomedGPT
Med-R1-2B 51.566.239.139.933.611.217.937.0
MedVLM-R1-2B 51.866.439.742.333.811.819.137.8
MedGemma-4B-IT 66.772.252.256.245.612.821.646.8
LLaVA-Med-7B 50.626.439.442.034.49.916.131.3
HuatuoGPT-V-7B 69.372.851.252.940.910.121.945.6
BioMediX2-8B 68.675.252.958.945.913.425.248.6
Qwen2.5VL-7B 73.476.452.657.342.112.826.348.7
InternVL2.5-8B 74.276.452.453.742.411.626.148.1
InternVL3-8B 77.575.457.762.150.2/42.813.131.251.2
Lingshu-7B 69.675.856.363.562.0/53.816.427.553.1
Fleming-VL-8B 71.874.051.853.740.5/37.312.124.945.7
Qwen3VL-8B (Baseline) 79.370.460.066.156.1/47.715.134.753.6
MedMO-4B (Ours) 75.778.058.078.557.5/47.716.429.455.1
MedMO-4B-Next (Ours) 74.878.258.178.357.4/47.616.529.555.0
MedMO-8B (Ours) 81.077.665.084.366.5/60.219.936.061.3
MedMO-8B-Next (Ours) 80.275.662.083.865.2/57.820.935.560.1↑+14.4

Medical Report Generation

Semantic (ROUGE-L, CIDEr) and model-based (RaTE, Semb) metrics

Model MIMIC-CXR CheXpert Plus IU-Xray Med-Trinity
R-LCIDErRaTESemb R-LCIDErRaTESemb R-LCIDErRaTESemb R-LCIDErRaTESemb
GPT-4.1 9.082.851.323.9 24.578.845.523.2 30.2124.651.347.5
Claude Sonnet 4 20.056.645.619.7 22.059.543.518.9 25.488.355.441.0
Gemini-2.5-Flash 25.480.750.329.7 23.672.244.327.4 33.5129.355.650.9
Med-R1-2B 19.335.440.614.8 18.637.138.517.8 16.138.341.412.5
MedVLM-R1-2B 20.340.141.614.2 20.943.538.915.5 22.761.146.122.7
MedGemma-4B-IT 25.681.052.429.2 27.179.047.229.3 30.8103.657.046.8
LLaVA-Med-7B 15.043.412.818.3 18.445.538.823.5 18.868.240.916.0
HuatuoGPT-V-7B 23.469.548.920.0 21.364.744.219.3 29.6104.352.940.7
BioMediX2-8B 20.052.844.417.7 18.147.940.821.6 19.658.840.111.6
Qwen2.5VL-7B 24.163.747.018.4 22.262.041.017.2 26.578.148.436.3 23.581.544.938.3
InternVL2.5-8B 23.261.847.021.0 20.658.543.119.7 24.875.451.136.7 13.547.142.512.8
InternVL3-8B 22.966.248.221.5 20.965.444.325.2 22.976.251.231.3 12.946.642.23.7
Lingshu-7B 30.8109.452.130.0 26.579.045.426.8 41.2180.757.648.4 16.074.544.424.0
Fleming-VL-8B 35.7132.556.733.6 26.182.247.140.1 44.9198.666.051.3 13.135.841.918.1
Qwen3VL-8B (Baseline) 25.177.950.333.4 21.967.444.437.9 25.091.4452.542.9 20.269.945.933.6
MedMO-4B (Ours) 26.092.649.831.6 15.162.336.634.2 26.694.042.141.3 22.5152.647.834.3
MedMO-4B-Next (Ours) 28.396.752.034.3 23.574.542.638.7 38.0147.862.049.4 26.3183.849.538.6
MedMO-8B (Ours) 31.7140.057.150.0 23.687.547.342.2 31.1169.745.341.3 37.0270.453.039.2
MedMO-8B-Next (Ours) 32.6143.457.751.5 25.788.348.143.8 31.8171.956.043.1 38.5272.153.840.7
Key Result: MedMO-8B-Next achieves CIDEr 143.4 and Semb 51.5 on MIMIC-CXR — best semantic coherence and clinical accuracy. On Med-Trinity (diverse modalities), MedMO-8B-Next dramatically outperforms with CIDEr 272.1 (vs 81.5 for next best open-source).

Medical Grounding Benchmarks (IoU %)

Model NIH DeepLesion Bacteria MedSG (multi_view) MedSG (object_tracking) MedSG (referring) Avg.
InternVL3-8B 10.1 0.00 0.7 6.3 13.0 3.3 5.6
Fleming-VL-8B 0.00 0.00 8.3 42.0 36.7 16.6 17.2
Lingshu-7B 5.3 0.7 10.8 28.3 38.7 10.4 13.9
Qwen3VL-8B 16.4 0.00 9.16 8.4 17.8 31.4 13.8
MedSG-Bench 55.0 62.1 60.4
MedMO-8B (Ours) 8.83 38.5 54.6 75.8 77.2 70.1 54.2
MedMO-8B-Next (Ours) 15.9 40.5 56.1 77.5 78.8 71.9 56.8↑+39.6

Key Contributions

01

Open-Source Foundation Model

A powerful open-source post-trained multimodal large VLM designed for comprehensive medical image understanding and grounding, available in 4B and 8B variants.

02

Scalable Training Pipeline

Curated 26M+ multimodal medical samples from 45 datasets with a multi-stage post-training pipeline that progressively enhances cross-modal alignment.

03

Novel Evaluation Benchmark

Constructed a dedicated Cell dataset from open-source microscopy images with varying sizes, shapes, and densities for evaluating VLM detection capabilities.

04

Comprehensive Analysis

Extensive experiments across data and methodology dimensions, providing an open benchmark for future multimodal medical LLM research.

Unified Multimodal Medical Dataset

Dataset composition covering imaging modalities and biological systems
Composition of the unified multimodal medical dataset comprising diverse imaging modalities (X-ray, CT, MRI, Ultrasound, Nuclear Medicine, Optical, Pathology) and biological systems (Respiratory, Cardiovascular, Nervous, Digestive, Urinary, Musculoskeletal, and more).

Qualitative Results

MedMO demonstrates superior diagnostic accuracy and clinical reasoning

🔬 Dermatology Diagnosis
Question
What is the name of the skin abnormality in this image?
Options: A. Eczema, B. Squamous cell carcinoma, C. Malignant melanoma, D. Melanoma
Other Models (Fleming-VL, Qwen3-VL, Lingshu)
B. Psoriasis ❌
✓ MedMO
B. Squamous cell carcinoma
🦠 Cell Detection & Grounding
Question
Detect and localize all cells in the image.
Ground Truth
[[54,545,63,554]]
Other Models
Fleming-VL: [0,0,999,999] ❌
Qwen3-VL: [31,21,965,957] ❌
✓ MedMO
[53,548,62,557] ✓ (Near-perfect localization)