Awesome Data-Model Co-Development of MLLMs ¶

Welcome to the “Awesome List” for data-model co-development of Multi-Modal Large Language Models (MLLMs), a continually updated resource tailored for the open-source community. This compilation features cutting-edge research, insightful articles focusing on improving MLLMs involving with the data-model co-development of MLLMs, and tagged based on the proposed taxonomy from our data-model co-development survey, as illustrated below.

Overview of Our Taxonomy Due to the rapid development in the field, this repository and our paper are continuously being updated and synchronized with each other. Please feel free to make pull requests or open issues to contribute to this list and add more related resources! We will periodically update our arXiv version according to this repository.

News¶

🎉 [2025-06-04] Our Data-Model Co-development Survey has been accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)! Welcome to explore and contribute this awesome-list.
[2025-05-25] We added 20 academic papers related to this survey.
[2024-10-23] We built a dynamic table based on the paper list that supports filtering and searching.
[2024-10-22] We restructured our paper list to provide more streamlined information.

Candidate Co-Development Tags¶

These tags correspond to the taxonomy in our paper, and each work may be assigned with more than one tags.

Data4Model: Scaling¶

For Scaling Up of MLLMs: Larger Datasets¶

Section Title	Tag
Data Acquisition
Data Augmentation
Data Diversity

For Scaling Effectiveness of MLLMs: Better Subsets¶

Section Title	Tag
Data Condensation
Data Mixture
Data Packing
Cross-Modal Alignment

Data4Model: Usability¶

For Instruction Responsiveness of MLLMs¶

Section Title	Tag
Prompt Design
ICL Data
Human-Behavior Alignment Data

For Reasoning Ability of MLLMs¶

Section Title	Tag
Data for Single-Hop Reasoning
Data for Multi-Hop Reasoning

For Ethics of MLLMs¶

Section Title	Tag
Data Toxicity
Data Privacy and Intellectual Property

For Evaluation of MLLMs¶

Section Title	Tag
Benchmarks for Multi-Modal Understanding
Benchmarks for Multi-Modal Generation:
Benchmarks for Multi-Modal Retrieval:
Benchmarks for Multi-Modal Reasoning:

Model4Data: Synthesis¶

Section Title	Tag
Model as a Data Creator
Model as a Data Mapper
Model as a Data Filter
Model as a Data Evaluator

Model4Data: Insights¶

Section Title	Tag
Model as a Data Navigator
Model as a Data Extractor
Model as a Data Analyzer
Model as a Data Visualizer

Paper List¶

Below is a paper list summarized based on our survey. Additionally, we have provided a dynamic table that supports filtering and searching, with the data source same as the list below.

Title	Tags
No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
Med-MMHL: A Multi-Modal Dataset for Detecting Human- and LLM-Generated Misinformation in the Medical Domain
Probing Heterogeneous Pretraining Datasets with Small Curated Datasets
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
VideoChat: Chat-Centric Video Understanding
Aligned with LLM: a new multi-modal training paradigm for encoding fMRI activity in visual cortex
3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding
GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
Audio Retrieval with WavText5K and CLAP Training
The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering
Demystifying CLIP Data
Learning Transferable Visual Models From Natural Language Supervision
DataComp: In search of the next generation of multimodal datasets
Beyond neural scaling laws: beating power law scaling via data pruning
Flamingo: a visual language model for few-shot learning
Quality not quantity: On the interaction between dataset design and robustness of clip
VBench: Comprehensive Benchmark Suite for Video Generative Models
EvalCraftr: Benchmarking and Evaluating Large Video Generation Models
Training Compute-Optimal Large Language Models
NExT-GPT: Any-to-Any Multimodal LLM
ChartThinker: A Contextual Chain-of-Thought Approach to Optimized Chart Summarization
ChartReformer: Natural Language-Driven Chart Image Editing
GroundingGPT: Language Enhanced Multi-modal Grounding Model
Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic
Kosmos-2: Grounding Multimodal Large Language Models to the World
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation
3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset
Structured Packing in LLM Training Improves Long Context Utilization
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
MoDE: CLIP Data Experts via Clustering
Efficient Multimodal Learning from Data-centric Perspective
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Perception Test: A Diagnostic Benchmark for Multimodal Video Models
FunQA: Towards Surprising Video ComprehensionFunQA: Towards Surprising Video Comprehension
OneChart: Purify the Chart Structural Extraction via One Auxiliary Token
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning
WorldGPT: Empowering LLM as Multimodal World Model
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction
How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
Fewer Truncations Improve Language Modeling
MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception
UNIAA: A Unified Multi-modal Image Aesthetic Data AugmentationAssessment Baseline and Benchmark
Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models
The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models
M3DBench: Let’s Instruct Large Models with Multi-modal 3D Prompts
MoqaGPT: Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model
Open-TransMind: A New Baseline and Benchmark for 1st Foundation Model Challenge of Intelligent Transportation
On the Adversarial Robustness of Multi-Modal Foundation Models
What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
PaLM-E: An Embodied Multimodal Language Model
Multimodal Data Curation via Object Detection and Filter Ensembles
Sieve: Multimodal Dataset Pruning Using Image Captioning Models
Towards a statistical theory of data selection under weak supervision
𝐷2 Pruning: Message Passing for Balancing Diversity & Difficulty in Data Pruning
UIClip: A Data-driven Model for Assessing User Interface Design
CapsFusion: Rethinking Image-Text Data at Scale
Improving CLIP Training with Language Rewrites
OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation
A Decade’s Battle on Dataset Bias: Are We There Yet?
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Data Filtering Networks
T-MARS: Improving Visual Representations by Circumventing Text Feature Learning
InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4
Align and Attend: Multimodal Summarization with Dual Contrastive Losses
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
Text-centric Alignment for Multi-Modality Learning
Noisy Correspondence Learning with Meta Similarity Correction
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos
Language-Image Models with 3D Understanding
Scaling Laws for Generative Mixed-Modal Language Models
BLINK: Multimodal Large Language Models Can See but Not Perceive
Visual Hallucinations of Multi-modal Large Language Models
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Visual Instruction Tuning
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models
On the De-duplication of LAION-2B
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition
Data Augmentation for Text-based Person Retrieval Using Large Language Models
Aligning Actions and Walking to LLM-Generated Textual Descriptions
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Probing Multimodal LLMs as World Models for Driving
Unified Hallucination Detection for Multimodal Large Language Models
Semdedup: Data-efficient learning at web-scale through semantic deduplication
Automated Multi-level Preference for MLLMs
Silkie: Preference distillation for large visual language models
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
M3it: A large-scale dataset towards multi-modal multilingual instruction tuning
Aligning Large Multimodal Models with Factually Augmented RLHF
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts
Improving Multimodal Datasets with Image Captioning
Bridging Research and Readers: A Multi-Modal Automated Academic Papers Interpretation System
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition
PDFChatAnnotator: A Human-LLM Collaborative Multi-Modal Data Annotation Tool for PDF-Format Catalogs
CiT: Curation in Training for Effective Vision-Language Data
InstructPix2Pix: Learning to Follow Image Editing Instructions
Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study
ModelGo: A Practical Tool for Machine Learning License Analysis
Scaling Laws of Synthetic Images for Model Training … for Now
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Segment Anything
AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
All in an Aggregated Image for In-Image Learning
Panda-70m: Captioning 70m videos with multiple cross-modality teachers
Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
Imagebind: One embedding space to bind them all
UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations
Genixer: Empowering Multimodal Large Language Model as a Powerful Data Generator
ZooProbe: A Data Engine for Evaluating, Exploring, and Evolving Large-scale Training Data for Multimodal LLMs
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering
Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMs
FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
REFINESUMM: Self-Refining MLLM for Generating a Multimodal Summarization Dataset
FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models
A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

Contribution to This Survey¶

Due to the rapid development in the field, this repository and our paper are continuously being updated and synchronized with each other. Please feel free to make pull requests or open issues to contribute to this list and add more related resources! You can add the titles of relevant papers to the table above, and (optionally) provide suggested tags along with the corresponding sections if possible. We will attempt to complete the remaining information and periodically update our survey based on the updated content of this document.

References¶

If you find our work useful for your research or development, please kindly cite the following paper.

@article{qin2024synergy,
  title={The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective},
  author={Qin, Zhen and Chen, Daoyuan and Zhang, Wenhao and Liuyi, Yao and Yilun, Huang and Ding, Bolin and Li, Yaliang and Deng, Shuiguang},
  journal={arXiv preprint arXiv:2407.08583},
  year={2024}
}

“Section - Mentioned Papers” Retrieval List¶

We provide a collapsible list of back reference, allowing readers to see which (sub)section mention the papers from the table above. The collapsible list of back reference will be periodically updated based on the content of the table and our paper.

Sec. 3.1 For Scaling of MLLMs: Larger Datasets

No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
Training Compute-Optimal Large Language Models

Sec. 3.1.1 Data Acquisition

No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
Audio Retrieval with WavText5K and CLAP Training
DataComp: In search of the next generation of multimodal datasets
Learning Transferable Visual Models From Natural Language Supervision
NExT-GPT: Any-to-Any Multimodal LLM
ChartThinker: A Contextual Chain-of-Thought Approach to Optimized Chart Summarization
ChartReformer: Natural Language-Driven Chart Image Editing
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
UIClip: A Data-driven Model for Assessing User Interface Design
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Visual Instruction Tuning
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
Probing Multimodal LLMs as World Models for Driving
Genixer: Empowering Multimodal Large Language Model as a Powerful Data Generator
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Sec. 3.1.2 Data Augmentation

Improved Baselines for Data-efficient Perceptual Augmentation of LLMs
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
CapsFusion: Rethinking Image-Text Data at Scale
Improving CLIP Training with Language Rewrites
Data Augmentation for Text-based Person Retrieval Using Large Language Models
Aligning Actions and Walking to LLM-Generated Textual Descriptions
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

Sec. 3.1.3 Data Diversity

No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
Audio Retrieval with WavText5K and CLAP Training
DataComp: In search of the next generation of multimodal datasets
Flamingo: a visual language model for few-shot learning
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation
ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning
Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models
PaLM-E: An Embodied Multimodal Language Model
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

Sec. 3.2 For Scaling Effectiveness of MLLMs: Better Subsets

No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
DataComp: In search of the next generation of multimodal datasets

Sec. 3.2.1 Data Condensation

The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering
DataComp: In search of the next generation of multimodal datasets
Beyond neural scaling laws: beating power law scaling via data pruning
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
Efficient Multimodal Learning from Data-centric Perspective
Multimodal Data Curation via Object Detection and Filter Ensembles
Sieve: Multimodal Dataset Pruning Using Image Captioning Models
Towards a statistical theory of data selection under weak supervision
Data Filtering Networks
T-MARS: Improving Visual Representations by Circumventing Text Feature Learning
InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4
Semdedup: Data-efficient learning at web-scale through semantic deduplication
On the De-duplication of LAION-2B
Improving Multimodal Datasets with Image Captioning
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Sec. 3.2.2 Data Mixture

Learning Transferable Visual Models From Natural Language Supervision
Flamingo: a visual language model for few-shot learning
Quality not quantity: On the interaction between dataset design and robustness of clip
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
A Decade’s Battle on Dataset Bias: Are We There Yet?
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
Demystifying CLIP Data

Sec. 3.2.3 Data Packing

Structured Packing in LLM Training Improves Long Context Utilization
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
MoDE: CLIP Data Experts via Clustering
Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
Fewer Truncations Improve Language Modeling

Sec. 3.2.4 Cross-Modal Alignment

No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
DataComp: In search of the next generation of multimodal datasets
Multimodal Data Curation via Object Detection and Filter Ensembles
Sieve: Multimodal Dataset Pruning Using Image Captioning Models
ChartThinker: A Contextual Chain-of-Thought Approach to Optimized Chart Summarization
Data Filtering Networks
T-MARS: Improving Visual Representations by Circumventing Text Feature Learning
Text-centric Alignment for Multi-Modality Learning
Noisy Correspondence Learning with Meta Similarity Correction
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
Semdedup: Data-efficient learning at web-scale through semantic deduplication
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability
Improving Multimodal Datasets with Image Captioning
Imagebind: One embedding space to bind them all
UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations

Sec. 4.1 For Instruction Responsiveness of MLLMs

ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model

Sec. 4.1.1 Prompt Design

Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic
Kosmos-2: Grounding Multimodal Large Language Models to the World
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models|
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Scaling Laws of Synthetic Images for Model Training … for Now

Sec. 4.1.2 ICL Data

GroundingGPT: Language Enhanced Multi-modal Grounding Model
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models
All in an Aggregated Image for In-Image Learning
AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Sec. 4.1.3 Human-Behavior Alignment Data

Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Automated Multi-level Preference for MLLMs
Silkie: Preference distillation for large visual language models
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Aligning Large Multimodal Models with Factually Augmented RLHF
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

Sec. 4.2 For Reasoning Ability of MLLMs

Sec. 4.2.1 Data for Single-Hop Reasoning

FunQA: Towards Surprising Video ComprehensionFunQA: Towards Surprising Video Comprehension
StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding
What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models

Sec. 4.2.2 Data for Multi-Hop Reasoning

MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos
Language-Image Models with 3D Understanding
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models

Sec. 4.3 For Ethics of MLLMs

Sec. 4.3.1 Data Toxicity

Med-MMHL: A Multi-Modal Dataset for Detecting Human- and LLM-Generated Misinformation in the Medical Domain
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts
On the Adversarial Robustness of Multi-Modal Foundation Models

Sec. 4.3.2 Data Privacy and Intellectual Property

ModelGo: A Practical Tool for Machine Learning License Analysis

Sec. 4.4 For Evaluation of MLLMs

Sec. 4.4.1 Benchmarks for Multi-Modal Understanding

DataComp: In search of the next generation of multimodal datasets
3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
OneChart: Purify the Chart Structural Extraction via One Auxiliary Token
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction
UNIAA: A Unified Multi-modal Image Aesthetic Data AugmentationAssessment Baseline and Benchmark
M3DBench: Let’s Instruct Large Models with Multi-modal 3D Prompts
Open-TransMind: A New Baseline and Benchmark for 1st Foundation Model Challenge of Intelligent Transportation
BLINK: Multimodal Large Language Models Can See but Not Perceive
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

Sec. 4.4.2 Benchmarks for Multi-Modal Generation

VBench: Comprehensive Benchmark Suite for Video Generative Models
EvalCraftr: Benchmarking and Evaluating Large Video Generation Models
Perception Test: A Diagnostic Benchmark for Multimodal Video Models
WorldGPT: Empowering LLM as Multimodal World Model
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation
Visual Hallucinations of Multi-modal Large Language Models
Unified Hallucination Detection for Multimodal Large Language Models
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Sec. 4.4.3 Benchmarks for Multi-Modal Retrieval

Audio Retrieval with WavText5K and CLAP Training
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Open-TransMind: A New Baseline and Benchmark for 1st Foundation Model Challenge of Intelligent Transportation

Sec. 4.4.4 Benchmarks for Multi-Modal Reasoning

FunQA: Towards Surprising Video ComprehensionFunQA: Towards Surprising Video Comprehension
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning
Probing Multimodal LLMs as World Models for Driving
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

Sec. 5.1 Model as a Data Creator

What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
VideoChat: Chat-Centric Video Understanding
3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation
OneChart: Purify the Chart Structural Extraction via One Auxiliary Token
ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
UNIAA: A Unified Multi-modal Image Aesthetic Data AugmentationAssessment Baseline and Benchmark
Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives
What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
InstructPix2Pix: Learning to Follow Image Editing Instructions
Genixer: Empowering Multimodal Large Language Model as a Powerful Data Generator
World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

Sec. 5.2 Model as a Data Mapper

VideoChat: Chat-Centric Video Understanding
Aligned with LLM: a new multi-modal training paradigm for encoding fMRI activity in visual cortex
GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting
MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
MoqaGPT: Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model
Improving CLIP Training with Language Rewrites
Data Augmentation for Text-based Person Retrieval Using Large Language Models
Aligning Actions and Walking to LLM-Generated Textual Descriptions
Unified Hallucination Detection for Multimodal Large Language Models
PDFChatAnnotator: A Human-LLM Collaborative Multi-Modal Data Annotation Tool for PDF-Format Catalogs
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMs
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
REFINESUMM: Self-Refining MLLM for Generating a Multimodal Summarization Dataset
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Sec. 5.3 Model as a Data Filter

The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters
DataComp: In search of the next generation of multimodal datasets
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models
Towards a statistical theory of data selection under weak supervision
Visual Hallucinations of Multi-modal Large Language Models
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

Sec. 5.4 Model as a Data Evaluator

Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image
ZooProbe: A Data Engine for Evaluating, Exploring, and Evolving Large-scale Training Data for Multimodal LLMs
FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models
A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

Sec. 6.1 Model as a Data Navigator

How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?

Sec. 6.2 Model as a Data Extractor

No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition
Unified Hallucination Detection for Multimodal Large Language Models
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

Sec. 6.3 Model as a Data Analyzer

ChartLlama: A Multimodal LLM for Chart Understanding and Generation
OneChart: Purify the Chart Structural Extraction via One Auxiliary Token
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding
ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model
Bridging Research and Readers: A Multi-Modal Automated Academic Papers Interpretation System

Sec. 6.4 Model as a Data Visualizer

ChartLlama: A Multimodal LLM for Chart Understanding and Generation
ChartReformer: Natural Language-Driven Chart Image Editing
Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study

Sec. 8.1 Data-Model Co-Development Infrastructures

DataComp: In search of the next generation of multimodal datasets

Sec. 8.2 Externally-Boosted MLLM Development

Sec. 8.2.1 MLLM-Based Data Discovery

No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
ModelGo: A Practical Tool for Machine Learning License Analysis

Sec. 8.2.2 Modality-Compatibility Detection with MLLMs

Improving Multimodal Datasets with Image Captioning

Sec. 8.2.3 Automatic Knowledge Transfer for MLLMs

- Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

Sec. 8.3 Self-Boosted MLLM Development

Sec. 8.3.1 Self Data Scaling with MLLMs

Sieve: Multimodal Dataset Pruning Using Image Captioning Models
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
Segment Anything

Sec. 8.3.2 Self Data Condensation with MLLMs

Sec. 8.3.3 RL from Self Feedback of MLLMs

The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering
DataComp: In search of the next generation of multimodal datasets
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Multimodal Data Curation via Object Detection and Filter Ensembles
Sieve: Multimodal Dataset Pruning Using Image Captioning Models
Data Filtering Networks
T-MARS: Improving Visual Representations by Circumventing Text Feature Learning
Semdedup: Data-efficient learning at web-scale through semantic deduplication
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
CiT: Curation in Training for Effective Vision-Language Data
Improving Multimodal Datasets with Image Captioning

Tab. 2

- No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance - DataComp: In search of the next generation of multimodal datasets - TextSquare: Scaling up Text-Centric Visual Instruction Tuning - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria - Align and Attend: Multimodal Summarization with Dual Contrastive Losses - MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? - BLINK: Multimodal Large Language Models Can See but Not Perceive - Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering - ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark - Unified Hallucination Detection for Multimodal Large Language Models - Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning - M3it: A large-scale dataset towards multi-modal multilingual instruction tuning - MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark - MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI - M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought - Panda-70m: Captioning 70m videos with multiple cross-modality teachers - Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text - ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning