Mammoth Icon MAmmoTH-VL:
Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Carnegie Mellon University M-A-P
Nanyang Technological University University of Waterloo
The University of Manchester
†Corresponding to: xyue2@andrew.cmu.edu

Abstract

Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a \textit{scalable and cost-effective method} to construct a \textit{large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning}. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.

Hybrid Instruction Tuning of MAmmoTH2-Plus

Figure 1: Up shows overview of our simple but scalable visual instruction data rewriting pipeline, comprising three main steps. First, we manually collect and identify potential data sources. Second, we rewrite the original instruction data using open MLLMs and LLMs. Finally, we employ the same MLLM as a judge to filter the data. Below shows examples comparing pre- and post-rewriting results in two categories: Math and Caption, demonstrating how the pipeline transforms basic questions into detailed, step-by-step responses.

Data Generation Pipeline

While previous efforts have highlighted the potential of visual instruction tuning, many rely on resource-intensive methods such as human annotations or proprietary models. These approaches limit scalability and accessibility, particularly in open-source contexts. To address these challenges, we introduce a simple, scalable, and cost-effective data generation pipeline that produces 12 million high-quality samples. Our pipeline involves three key steps:

(1) open-source data collection and categorization

(2) task-specific data augmentation and rewriting using open models

(3) quality filtering to remove hallucinated or irrelevant content

Hybrid Instruction Tuning of MAmmoTH2-Plus

Figure 2: The data distribution of MAmmoTH-VL-Instruct (12M). Left: Category distribution. Right: Details of data sources.

Distribution Comparison

To analyze the distributional differences between the original and rewritten data, we randomly sampled 80,000 examples from the dataset both before and after rewriting and visualized their distributions using t-SNE to project the instructions onto a two-dimensional plot in Figure 3. The resulting figure reveals two key takeaways:

(1) The rewritten data exhibits significant overlap with the original data, indicating that it retains the core characteristics of the original distribution. This ensures that the rewritten data preserves the foundational structure of the dataset.

(2) The rewritten data extends beyond the boundaries of the original distribution, demonstrating that it introduces new dimensions or variations, which shows that rewriting enhances the dataset by broadening its scope while maintaining its original essence.

Based on this observation, during the experimental validation phase, we utilize a mixed dataset consisting of 70% rewritten data and 30% original data to train the model.

Hybrid Instruction Tuning of MAmmoTH2-Plus

Figure 3: The t-SNE data distribution plot demonstrates how the rewritten data expands beyond the original dataset, increasing topic diversity and enhancing coverage of complex queries and reasoning.

Overall Results

To reveal the generality and effectiveness of the model, we comprehensively evaluate it across different scenarios, including single-image, multi-image, and video benchmarks. Detailed results are presented in Table 1, Table 2 and Table 3, respectively. We denote the model checkpoint that completed the single-image stage and one-vision stage as MAmmoTH-VL-8B (SI) and MAmmoTH-VL-8B. We conduct standardized, reproducible evaluations of our model across all 23 benchmarks using LMMs-Eval. To ensure a fair comparison with other MLLMs, we primarily report results from the original papers. When results are unavailable, we onboard the models in LMMs-Eval and evaluate them using consistent settings. All results are reported using greedy decoding and zero-shot settings unless specified.

Model Multi-Discipline Knowledge and Mathematical Reasoning
MMStar
test
MMMU
val
MMMU-Pro
vision
SeedBench
test
MMBench
en-test
MMVet
test
MathVerse
mini-vision
MathVista
testmini
GPT-4o 64.7 69.1 49.7 76.2 82.1 76.2 50.2 63.8
Gemini-1.5-Pro 59.1 65.8 44.4 76.0 73.9 64.0 - 63.9
Claude-3.5-Sonnet 62.2 68.3 48.0 72.2 79.7 75.4 - 67.7
InternVL2-LLaMa3-76B 67.1 58.2 38.0 77.6 86.5 64.4 - 65.5
Qwen2-VL-72B-Ins 68.6 64.5 37.1 77.9 86.9 73.9 37.3 70.5
LLaVA-OV-72B (SI) 65.2 57.4 26.0 77.6 86.6 60.0 37.7 66.5
LLaVA-OV-72B 66.1 56.8 24.0 78.0 85.9 63.7 39.1 67.5
MiniCPM-V-2.6-7B 57.5 49.8 21.7 74.0 81.5 60.0 - 60.6
InternLM-XComp-2.5-7B 59.9 42.9 - 75.4 74.4 51.7 20.0 59.6
Llama-3.2-11B-Vision-Ins 49.8 50.7 23.7 72.7 73.2 57.6 23.6 51.5
InternVL-2-8B 59.4 49.3 25.4 76.0 81.7 60.0 27.5 58.3
Qwen2-VL-7B-Ins 60.7 52.1 26.9 74.3 83.0 62.0 28.2 58.2
Cambrian-1-8B - 42.7 14.7 73.3 74.6 48 - 49.0
Llava-CoT-11B 57.6 48.9 18.5 75.2 75.0 60.3 24.2 54.8
Molmo-7B-D 50.5 45.3 18.9 74.1 73.6 58.0 21.5 51.6
LLaVA-OV-7B (SI) 60.9 47.3 16.8 74.8 80.5 58.8 26.9 56.1
LLaVA-OV-7B 61.7 48.8 18.7 75.4 80.8 58.6 26.2 63.2
MAmmoTH-VL-8B (SI) 55.4 49.4 26.0 73.3 83.0 60.6 35.0 67.6
MAmmoTH-VL-8B 63.0 50.8 25.3 76.0 83.4 62.3 34.2 67.6
∆ Over Best Open-Source
(~10B Scale)
+1.3 +1.9 +7.1 +0.6 +2.6 +2.0 +8.1 +4.4
Table1: Main results on Multi-discipline Knowledge and Mathematical Reasoning benchmarks. Results are taken from official papers or blogs when available; otherwise, we use lmms-eval for evaluation. The models highlighted in gray represent closed-source models, those in purple-blue are open-weight in terms of model weights but lack open-source training data or code, and those in green have fully open-source training details, including weights, data, and code
Model Chart & Doc Understanding Multimodal Interactions & Preferences
AI2D
test
ChartQA
test
InfoVQA
test
DocVQA
test
RealWorldQA
test
WildVision
0617
L-Wilder
small
GPT-4o 94.2 85.7 79.2 92.8 76.5 89.4 85.9
Gemini-1.5-Pro 94.4 87.2 81.0 93.1 70.4 - -
Claude-3.5-Sonnet 94.7 90.8 49.7 95.2 59.9 50.0 83.1
InternVL2-LLaMa3-76B 88.4 88.4 82.0 94.1 72.7 - -
Qwen2-VL-72B-Ins 88.1 88.3 84.5 96.5 77.8 - -
LLaVA-OV-72B (SI) 85.1 84.9 74.6 91.8 73.8 49.5 72.9
LLaVA-OV-72B 85.6 83.7 74.9 91.3 71.9 52.3 72.0
MiniCPM-V-2.6-7B 82.1 82.4 - 90.8 65.0 11.7 -
InternLM-XComp-2.5-7B 81.5 82.2 70.0 90.9 67.8 - 61.4
Llama-3.2-11B-Vision-Ins 77.3 83.4 65.0 88.4 63.3 49.7 62.0
InternVL-2-8B 83.8 83.3 74.8 91.6 64.4 51.5 62.5
Qwen2-VL-7B-Ins 83.0 83.0 76.5 94.5 70.1 44 66.3
Cambrian-1-8B 73.3 73.3 41.6 77.8 64.2 - -
Llava-CoT-11B - 67.0 44.8 - - - 65.3
Molmo-7B-D 81.0 84.1 72.6 92.2 70.7 40 -
LLaVA-OV-7B (SI) 81.6 78.8 65.3 86.9 65.5 39.2 69.1
LLaVA-OV-7B 81.4 80.0 68.8 87.5 66.3 53.8 67.8
MAmmoTH-VL-8B (SI) 83.4 85.9 74.8 93.8 71.3 51.9 71.3
MAmmoTH-VL-8B 84.0 86.2 73.1 93.7 69.9 51.1 70.8
∆ Over Best Open-Source
(~10B Scale)
+2.4 +2.1 +2.2 +1.6 +0.6 -1.9 +2.2
Table2: Main results on Chart, Diagram, and Document Understanding, and Real-world Multimodal Interactions and Human Preferences benchmarks.
Model Multi-Image and Video
MuirBench
test
MEGABench
test
EgoSchema
test
PerceptionTest
test
SeedBench
video
MLVU
dev
MVBench
test
VideoMME
w/o subs
GPT-4o 68.0 54.2 - - - 64.6 - 71.9
GPT-4v 62.3 - - - 60.5 49.2 43.5 59.9
LLaVA-OV-72B (SI) 33.2 - 58.6 62.3 60.9 60.9 57.1 64.8
LLaVA-OV-72B 54.8 33.8 62.0 66.9 62.1 66.4 59.4 66.2
InternVL-2-8B 59.4 27.7 54.2 57.4 54.9 30.2 66.4 54.0
Qwen2-VL-7B-Ins 41.6 36.0 66.7 62.3 55.3 58.6 67 63.3
LLaVA-OV-7B (SI) 32.7 22.1 52.9 54.9 51.1 60.2 51.2 55.0
LLaVA-OV-7B 41.8 23.9 60.1 57.1 56.9 64.7 56.7 58.2
MAmmoTH-VL-8B 55.1 28.2 58.5 59.3 57.1 64.7 59.1 58.8
∆ Over Best Open-Source
(~10B Scale)
+13.3 +4.3 -1.6 +2.2 +0.2 +0 +2.4 +0.6
Table 3: Main results on Multi-Image and Video benchmarks.

Effect Of Data Scale

As shown in Figure 4, performance is tracked across benchmarks with training dataset size increasing in 2-million-sample intervals. Results are compared to three leading models: Llava-OneVision-7B & 72B and Llava-CoT. The findings demonstrate a positive correlation between training data scale and performance, indicating that diverse instruction data improves the model's ability to handle complex tasks.

Hybrid Instruction Tuning of MAmmoTH2-Plus

Figure 4: Scaling effects of MAmmoTH-VL-8B on eight multimodal evaluation datasets. A simple rewriting approach using open models improves the quality of visual instruction data by eliciting chain-of-thought (CoT) reasoning. Training on this rewritten data demonstrates significant performance gains through increased model scale. Llava-OneVision-7B & 72B and Llava-CoT are included as references.

Effect Of Rewrite Model

To assess the effect of model size on the quality of rewritten data, we conduct experiments using four models trained on a dataset of 500K samples. The first model is trained on the original dataset. The second model use data rewritten by InternVL2-Llama3-76B and Meta-Llama-3-70B-Instruct. The third model is trained on data rewritten by Qwen2-VL-7B-Instruct and Qwen2.5-7B-Instruct. The fourth model is trained on data rewritten by InternVL2-8B and InternLM2.5-7B. Among these, Qwen2.5-72B-Instruct, Qwen2.5-7B-Instruct, and InternLM2.5-7B are employed solely for rewriting caption data, while InternVL2-Llama3-76B, Qwen2-VL-7B-Instruct, and InternVL2-8B are used for data filtering.

As shown in Figure 5, our analysis reveals distinct patterns in model performance across different task categories. For knowledge & reasoning tasks, models trained on data rewritten by smaller models (approximately 7B parameters) achieve performance comparable to those using larger model rewrites. However, the impact of data rewriting varies significantly by task type. For chart and document-related tasks, rewriting with smaller models actually leads to performance degradation, while larger models provide modest improvements. This suggests that sophisticated visual understanding capabilities of larger models are crucial for effectively rewriting such data. In contrast, Multi Interact & Preference tasks demonstrate a clear correlation with model scale, where larger models excel in handling these complex scenarios that demand subtle understanding and nuanced preference modeling.

Hybrid Instruction Tuning of MAmmoTH2-Plus

Figure 5: Performance of data rewritten by different models on three benchmark subsets.

Reference

Please kindly cite our paper if you use our code, data, models or results:


@article{guo2024mammothvlelicitingmultimodalreasoning,
    title={MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale}, 
    author={Jarvis Guo and Tuney Zheng and Yuelin Bai and Bo Li and Yubo Wang and King Zhu and Yizhi Li and Graham Neubig and Wenhu Chen and Xiang Yue},
    year={2024},
    eprint={2412.05237},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2412.05237}, 
}