MedMO: Medical Foundation Model

Overview

Abstract

MedMO is a medical foundation model built for reliable multimodal understanding and grounded reasoning across clinical imaging. It supports a wide range of tasks, including Visual QA, Text-based QA, Radiology Report Generation, Report Summarization, Diagnostic Classification, and Clinical Reasoning.

MedMO also delivers strong spatial intelligence with Disease Localization using Bounding Boxes, Anatomical Grounding, and Spatial Object Detection across radiology, pathology, ophthalmology, and microscopy. The model is trained on large-scale, domain-specific data with multi-stage alignment and grounding objectives, enabling accurate interpretation and clinically faithful outputs across modalities.

In benchmarks, MedMO consistently outperforms prior open-source medical MLLMs on VQA, text QA, and report generation, while achieving large gains in grounding and localization accuracy. This makes MedMO a strong, unified model for real-world medical imaging workflows.

Approach

Multi-Stage Training Pipeline

Progressive post-training for comprehensive medical image understanding

MedMO Training Pipeline and Capabilities

General Medical SFT

Large-scale training for foundational understanding

18.5M samples 768×768

High-Resolution SFT

Spatial localization & fine-grained grounding

3M samples 1280×1280

Instruction Tuning

Human-style medical instruction following

4.3M samples Multi-task

Reinforcement Learning

GRPO with verifiable rewards

300K samples BBox IoU

Cross-Modal Pretraining

Align heterogeneous visual encoders with a medical language backbone using DeepStack fusion mechanism.

Multi-Task Supervision

Training spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes.

Verifiable Rewards

Novel bounding-box GIoU reward combined with factuality checks for enhanced spatial grounding.

Scalable Architecture

Built upon Qwen3-VL with a modular design enabling future expansion across additional modalities.

Performance

Benchmark Results

State-of-the-art performance across medical VQA, Text QA, and Grounding tasks

Key Performance Insights

BASELINE (Qwen3-VL-8B)

VQA: 48.8% · Text QA: 54.5%

CURRENT SOTA (Fleming-VL-8B)

Best VQA Avg: 64.4% · Text QA: 46.9%

MedMO-4B (OURS)

VQA: 52.0% · Text QA: 56.2%

🏆 MedMO-8B (OURS) — NEW SOTA

VQA: 62.5% · Text QA: 61.4%

MedMO achieves the best overall balance among open-source models, outperforming both Lingshu-7B and Fleming-VL-8B with the strongest Text-QA results (+14.5% over Fleming-VL) while maintaining competitive VQA performance within 1.9% of SOTA.

Medical VQA Benchmarks

Model	MMMU-Med	VQA-RAD	SLAKE	PathVQA	PMC-VQA	OmniMedVQA	MedXQA	Avg.
GPT-4.1	75.2	65.0	72.2	55.5	55.2	75.5	45.2	63.4
Claude Sonnet 4	74.6	67.6	70.6	54.2	54.4	65.5	43.3	61.5
Gemini-2.5-Flash	76.9	68.5	75.8	55.4	55.4	71.0	52.8	65.1
BiomedGPT	24.9	16.6	13.6	11.3	27.6	27.9	–	–
Med-R1-2B	34.8	39.0	54.5	15.3	47.4	–	21.1	–
MedVLM-R1-2B	35.2	48.6	56.0	32.5	47.6	77.7	20.4	45.4
MedGemma-4B-IT	43.7	72.5	76.4	48.8	49.9	69.8	22.3	54.8
LLaVA-Med-7B	29.3	53.7	48.0	38.8	30.5	44.3	20.3	37.8
HuatuoGPT-V-7B	47.3	67.0	67.8	48.0	53.3	74.2	21.6	54.2
BioMediX2-8B	39.8	49.2	57.7	37.0	43.5	63.3	21.8	44.6
Qwen2.5VL-7B	50.6	64.5	67.2	44.1	51.9	63.6	22.3	52.0
InternVL2.5-8B	53.5	59.4	69.0	42.1	51.3	81.3	21.7	54.0
InternVL3-8B	59.2	65.4	72.8	48.6	53.8	79.1	22.4	57.3
Lingshu-7B	54.0	67.9	83.1	61.9	56.3	82.9	26.1	61.8
Fleming-VL-8B	63.3	66.1	86.5	62.9	64.3	86.7	21.6	64.4
Qwen3VL-8B (Baseline)	61.4	64.1	47.3	14.6	52.3	77.2	24.8	48.8
MedMO-4B (Ours)	54.6	50.9	41.0	62.4	50.6	79.7	24.8	52.0↑+3.2
MedMO-8B (Ours)	64.6	64.7	81.6	56.3	59.4	84.8	26.2	62.5↑+13.7

Medical Text QA Benchmarks

Model	MMLU-Med	PubMedQA	MedMCQA	MedQA	Medbullets	MedXQA	SGPQA	Avg.
GPT-4.1	89.6	75.6	77.7	89.1	77.0	30.9	49.9	70.0
Claude Sonnet 4	91.3	78.6	79.3	92.1	80.2	33.6	56.3	73.1
Gemini-2.5-Flash	84.2	73.8	73.6	91.2	77.6	35.6	53.3	69.9
Med-R1-2B	51.5	66.2	39.1	39.9	33.6	11.2	17.9	37.0
MedVLM-R1-2B	51.8	66.4	39.7	42.3	33.8	11.8	19.1	37.8
MedGemma-4B-IT	66.7	72.2	52.2	56.2	45.6	12.8	21.6	46.8
LLaVA-Med-7B	50.6	26.4	39.4	42.0	34.4	9.9	16.1	31.3
HuatuoGPT-V-7B	69.3	72.8	51.2	52.9	40.9	10.1	21.9	45.6
BioMediX2-8B	68.6	75.2	52.9	58.9	45.9	13.4	25.2	48.6
Qwen2.5VL-7B	73.4	76.4	52.6	57.3	42.1	12.8	26.3	48.7
InternVL2.5-8B	74.2	76.4	52.4	53.7	42.4	11.6	26.1	48.1
InternVL3-8B	77.5	75.4	57.7	62.1	48.5	13.1	31.2	52.2
Lingshu-7B	74.5	76.6	55.9	63.3	56.2	16.5	26.3	52.8
Fleming-VL-8B	71.8	74.0	51.8	53.7	40.5	12.1	24.9	46.9
Qwen3VL-8B (Baseline)	79.3	70.4	60.0	66.1	56.1	15.1	34.7	54.5
MedMO-4B (Ours)	75.7	78.0	58.0	78.5	57.5	16.4	29.4	56.2↑+1.7
MedMO-8B (Ours)	81.0	77.6	65.0	90.4	60.2	19.9	36.0	61.4↑+6.9

Medical Report Generation

Semantic (ROUGE-L, CIDEr) and model-based (RaTE, Semb) metrics

Model	MIMIC-CXR				CheXpert Plus				IU-Xray				Med-Trinity
Model	R-L	CIDEr	RaTE	Semb	R-L	CIDEr	RaTE	Semb	R-L	CIDEr	RaTE	Semb	R-L	CIDEr	RaTE	Semb
GPT-4.1	9.0	82.8	51.3	23.9	24.5	78.8	45.5	23.2	30.2	124.6	51.3	47.5	–	–	–	–
Claude Sonnet 4	20.0	56.6	45.6	19.7	22.0	59.5	43.5	18.9	25.4	88.3	55.4	41.0	–	–	–	–
Gemini-2.5-Flash	25.4	80.7	50.3	29.7	23.6	72.2	44.3	27.4	33.5	129.3	55.6	50.9	–	–	–	–
Med-R1-2B	19.3	35.4	40.6	14.8	18.6	37.1	38.5	17.8	16.1	38.3	41.4	12.5	–	–	–	–
MedVLM-R1-2B	20.3	40.1	41.6	14.2	20.9	43.5	38.9	15.5	22.7	61.1	46.1	22.7	–	–	–	–
MedGemma-4B-IT	25.6	81.0	52.4	29.2	27.1	79.0	47.2	29.3	30.8	103.6	57.0	46.8	–	–	–	–
LLaVA-Med-7B	15.0	43.4	12.8	18.3	18.4	45.5	38.8	23.5	18.8	68.2	40.9	16.0	–	–	–	–
HuatuoGPT-V-7B	23.4	69.5	48.9	20.0	21.3	64.7	44.2	19.3	29.6	104.3	52.9	40.7	–	–	–	–
BioMediX2-8B	20.0	52.8	44.4	17.7	18.1	47.9	40.8	21.6	19.6	58.8	40.1	11.6	–	–	–	–
Qwen2.5VL-7B	24.1	63.7	47.0	18.4	22.2	62.0	41.0	17.2	26.5	78.1	48.4	36.3	23.5	81.5	44.9	38.3
InternVL2.5-8B	23.2	61.8	47.0	21.0	20.6	58.5	43.1	19.7	24.8	75.4	51.1	36.7	13.5	47.1	42.5	12.8
InternVL3-8B	22.9	66.2	48.2	21.5	20.9	65.4	44.3	25.2	22.9	76.2	51.2	31.3	12.9	46.6	42.2	3.7
Lingshu-7B	30.8	109.4	52.1	30.0	26.5	79.0	45.4	26.8	41.2	180.7	57.6	48.4	16.0	74.5	44.4	24.0
Fleming-VL-8B	35.7	132.5	56.7	33.6	26.1	82.2	47.1	40.1	44.9	198.6	66.0	51.3	13.1	35.8	41.9	18.1
Qwen3VL-8B (Baseline)	25.1	77.9	50.3	33.4	21.9	67.4	44.4	37.9	25.0	91.4	52.5	42.9	20.2	69.9	45.9	33.6
MedMO-4B (Ours)	26.0	92.6	49.8	31.6	15.1	62.3	36.6	34.2	26.6	94.0	42.1	41.3	22.5	152.6	47.8	34.3
MedMO-8B (Ours)	31.7	140.0	57.1	50.0	23.6	87.5	47.3	42.2	31.1	169.7	45.3	41.3	37.0	270.4	53.0	39.2

Key Result: MedMO achieves CIDEr 140.0 and Semb 50.0 on MIMIC-CXR — best semantic coherence and clinical accuracy. On Med-Trinity (diverse modalities), MedMO dramatically outperforms with CIDEr 270.4 (vs 81.5 for next best).

Medical Grounding Benchmarks (IoU %)

Model	NIH Chest	DeepLesion	Bacteria	MedSG (multi-view)	MedSG (tracking)	Avg.
InternVL3-8B	10.1	0.0	0.7	6.3	13.0	5.6
Fleming-VL-8B	0.0	0.0	8.3	42.0	36.7	17.2
Lingshu-7B	5.3	0.7	0.0	28.3	38.7	13.9
Qwen3VL-8B	16.4	0.0	9.16	8.4	17.8	13.7
MedMO (Ours)	8.83	38.5	54.6	75.8	77.2	54.2↑+40.4

Impact

Key Contributions

Open-Source Foundation Model

A powerful open-source post-trained multimodal large VLM designed for comprehensive medical image understanding and grounding, available in 4B and 8B variants.

Scalable Training Pipeline

Curated 26M+ multimodal medical samples from 45 datasets with a multi-stage post-training pipeline that progressively enhances cross-modal alignment.

Novel Evaluation Benchmark

Constructed a dedicated Cell dataset from open-source microscopy images with varying sizes, shapes, and densities for evaluating VLM detection capabilities.

Comprehensive Analysis

Extensive experiments across data and methodology dimensions, providing an open benchmark for future multimodal medical LLM research.

Unified Multimodal Medical Dataset

Dataset composition covering imaging modalities and biological systems

Composition of the unified multimodal medical dataset comprising diverse imaging modalities (X-ray, CT, MRI, Ultrasound, Nuclear Medicine, Optical, Pathology) and biological systems (Respiratory, Cardiovascular, Nervous, Digestive, Urinary, Musculoskeletal, and more).

MedMO: Grounding and Understanding
Multimodal LLMs for Medical Images

Abstract

Performance Comparison

Motivation & Challenges

Reliance on Distilled Data

Hallucination Risks

Narrow Modality Coverage

Multi-Stage Training Pipeline

Cross-Modal Pretraining

Multi-Task Supervision

Verifiable Rewards

Scalable Architecture

Benchmark Results

Key Performance Insights

Medical VQA Benchmarks

Medical Text QA Benchmarks

Medical Report Generation

Medical Grounding Benchmarks (IoU %)

Key Contributions

Open-Source Foundation Model

Scalable Training Pipeline

Novel Evaluation Benchmark

Comprehensive Analysis

Unified Multimodal Medical Dataset

Qualitative Results

MedMO: Grounding and Understanding Multimodal LLMs for Medical Images

Abstract

Performance Comparison

Motivation & Challenges

Reliance on Distilled Data

Hallucination Risks

Narrow Modality Coverage

Multi-Stage Training Pipeline

Cross-Modal Pretraining

Multi-Task Supervision

Verifiable Rewards

Scalable Architecture

Benchmark Results

Key Performance Insights

Medical VQA Benchmarks

Medical Text QA Benchmarks

Medical Report Generation

Medical Grounding Benchmarks (IoU %)

Key Contributions

Open-Source Foundation Model

Scalable Training Pipeline

Novel Evaluation Benchmark

Comprehensive Analysis

Unified Multimodal Medical Dataset

Qualitative Results

MedMO: Grounding and Understanding
Multimodal LLMs for Medical Images