Awesome Data-Model Co-Development of MLLMs Awesome

Welcome to the “Awesome List” for data-model co-development of Multi-Modal Large Language Models (MLLMs), a continually updated resource tailored for the open-source community. This compilation features cutting-edge research, insightful articles focusing on improving MLLMs involving with the data-model co-development of MLLMs, and tagged based on the proposed taxonomy from our data-model co-development survey, as illustrated below.

Overview of Our Taxonomy Due to the rapid development in the field, this repository and our paper are continuously being updated and synchronized with each other. Please feel free to make pull requests or open issues to contribute to this list and add more related resources!

News

  • 🎉 [2025-06-04] Our Data-Model Co-development Survey has been accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)! Welcome to explore and contribute this awesome-list.

  • [2025-05-25] We added 20 academic papers related to this survey.

  • new [2024-10-23] We built a dynamic table based on the paper list that supports filtering and searching.

  • new [2024-10-22] We restructured our paper list to provide more streamlined information.

Candidate Co-Development Tags

These tags correspond to the taxonomy in our paper, and each work may be assigned with more than one tags.

Data4Model: Scaling

For Scaling Up of MLLMs: Larger Datasets

Section Title

Tag

Data Acquisition

Data Augmentation

Data Diversity

For Scaling Effectiveness of MLLMs: Better Subsets

Section Title

Tag

Data Condensation

Data Mixture

Data Packing

Cross-Modal Alignment

Data4Model: Usability

For Instruction Responsiveness of MLLMs

Section Title

Tag

Prompt Design

ICL Data

Human-Behavior Alignment Data

For Reasoning Ability of MLLMs

Section Title

Tag

Data for Single-Hop Reasoning

Data for Multi-Hop Reasoning

For Ethics of MLLMs

Section Title

Tag

Data Toxicity

Data Privacy and Intellectual Property

For Evaluation of MLLMs

Section Title

Tag

Benchmarks for Multi-Modal Understanding

Benchmarks for Multi-Modal Generation:

Benchmarks for Multi-Modal Retrieval:

Benchmarks for Multi-Modal Reasoning:

Model4Data: Synthesis

Section Title

Tag

Model as a Data Creator

Model as a Data Mapper

Model as a Data Filter

Model as a Data Evaluator

Model4Data: Insights

Section Title

Tag

Model as a Data Navigator

Model as a Data Extractor

Model as a Data Analyzer

Model as a Data Visualizer

Paper List

Below is a paper list summarized based on our survey. Additionally, we have provided a dynamic table that supports filtering and searching, with the data source same as the list below.

Title

Tags

No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning

Med-MMHL: A Multi-Modal Dataset for Detecting Human- and LLM-Generated Misinformation in the Medical Domain

Probing Heterogeneous Pretraining Datasets with Small Curated Datasets

ChartLlama: A Multimodal LLM for Chart Understanding and Generation

VideoChat: Chat-Centric Video Understanding

Aligned with LLM: a new multi-modal training paradigm for encoding fMRI activity in visual cortex

3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding

GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

Audio Retrieval with WavText5K and CLAP Training

The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering

Demystifying CLIP Data

Learning Transferable Visual Models From Natural Language Supervision

DataComp: In search of the next generation of multimodal datasets

Beyond neural scaling laws: beating power law scaling via data pruning

Flamingo: a visual language model for few-shot learning

Quality not quantity: On the interaction between dataset design and robustness of clip

VBench: Comprehensive Benchmark Suite for Video Generative Models

EvalCraftr: Benchmarking and Evaluating Large Video Generation Models

Training Compute-Optimal Large Language Models

NExT-GPT: Any-to-Any Multimodal LLM

ChartThinker: A Contextual Chain-of-Thought Approach to Optimized Chart Summarization

ChartReformer: Natural Language-Driven Chart Image Editing

GroundingGPT: Language Enhanced Multi-modal Grounding Model

Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic

Kosmos-2: Grounding Multimodal Large Language Models to the World

Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters

Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset

Structured Packing in LLM Training Improves Long Context Utilization

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

MoDE: CLIP Data Experts via Clustering

Efficient Multimodal Learning from Data-centric Perspective

Improved Baselines for Data-efficient Perceptual Augmentation of LLMs

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

FunQA: Towards Surprising Video ComprehensionFunQA: Towards Surprising Video Comprehension

OneChart: Purify the Chart Structural Extraction via One Auxiliary Token

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding

MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning

ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning

WorldGPT: Empowering LLM as Multimodal World Model

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction

How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Fewer Truncations Improve Language Modeling

MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception

UNIAA: A Unified Multi-modal Image Aesthetic Data AugmentationAssessment Baseline and Benchmark

Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives

Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models

M3DBench: Let’s Instruct Large Models with Multi-modal 3D Prompts

MoqaGPT: Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

Open-TransMind: A New Baseline and Benchmark for 1st Foundation Model Challenge of Intelligent Transportation

On the Adversarial Robustness of Multi-Modal Foundation Models

What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

PaLM-E: An Embodied Multimodal Language Model

Multimodal Data Curation via Object Detection and Filter Ensembles

Sieve: Multimodal Dataset Pruning Using Image Captioning Models

Towards a statistical theory of data selection under weak supervision

𝐷2 Pruning: Message Passing for Balancing Diversity & Difficulty in Data Pruning

UIClip: A Data-driven Model for Assessing User Interface Design

CapsFusion: Rethinking Image-Text Data at Scale

Improving CLIP Training with Language Rewrites

OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation

A Decade’s Battle on Dataset Bias: Are We There Yet?

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Data Filtering Networks

T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4

Align and Attend: Multimodal Summarization with Dual Contrastive Losses

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Text-centric Alignment for Multi-Modality Learning

Noisy Correspondence Learning with Meta Similarity Correction

Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos

Language-Image Models with 3D Understanding

Scaling Laws for Generative Mixed-Modal Language Models

BLINK: Multimodal Large Language Models Can See but Not Perceive

Visual Hallucinations of Multi-modal Large Language Models

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Visual Instruction Tuning

ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

On the De-duplication of LAION-2B

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

Data Augmentation for Text-based Person Retrieval Using Large Language Models

Aligning Actions and Walking to LLM-Generated Textual Descriptions

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Probing Multimodal LLMs as World Models for Driving

Unified Hallucination Detection for Multimodal Large Language Models

Semdedup: Data-efficient learning at web-scale through semantic deduplication

Automated Multi-level Preference for MLLMs

Silkie: Preference distillation for large visual language models

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

M3it: A large-scale dataset towards multi-modal multilingual instruction tuning

Aligning Large Multimodal Models with Factually Augmented RLHF

DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

ImgTrojan: Jailbreaking Vision-Language Models with ONE Image

VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models

Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts

Improving Multimodal Datasets with Image Captioning

Bridging Research and Readers: A Multi-Modal Automated Academic Papers Interpretation System

LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

PDFChatAnnotator: A Human-LLM Collaborative Multi-Modal Data Annotation Tool for PDF-Format Catalogs

CiT: Curation in Training for Effective Vision-Language Data

InstructPix2Pix: Learning to Follow Image Editing Instructions

Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study

ModelGo: A Practical Tool for Machine Learning License Analysis

Scaling Laws of Synthetic Images for Model Training … for Now

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Segment Anything

AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

All in an Aggregated Image for In-Image Learning

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text

ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning

Imagebind: One embedding space to bind them all

UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Binding Touch to Everything: Learning Unified Multimodal Tactile Representations

Genixer: Empowering Multimodal Large Language Model as a Powerful Data Generator

ZooProbe: A Data Engine for Evaluating, Exploring, and Evolving Large-scale Training Data for Multimodal LLMs

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMs

FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

REFINESUMM: Self-Refining MLLM for Generating a Multimodal Summarization Dataset

FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models

A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

Contribution to This Survey

Due to the rapid development in the field, this repository and our paper are continuously being updated and synchronized with each other. Please feel free to make pull requests or open issues to contribute to this list and add more related resources! You can add the titles of relevant papers to the table above, and (optionally) provide suggested tags along with the corresponding sections if possible. We will attempt to complete the remaining information and periodically update our survey based on the updated content of this document.

References

If you find our work useful for your research or development, please kindly cite the following paper.

@article{qin2024synergy,
  title={The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective},
  author={Qin, Zhen and Chen, Daoyuan and Zhang, Wenhao and Liuyi, Yao and Yilun, Huang and Ding, Bolin and Li, Yaliang and Deng, Shuiguang},
  journal={arXiv preprint arXiv:2407.08583},
  year={2024}
}

“Section - Mentioned Papers” Retrieval List

We provide a collapsible list of back reference, allowing readers to see which (sub)section mention the papers from the table above. The collapsible list of back reference will be periodically updated based on the content of the table and our paper.

Sec. 3.1 For Scaling of MLLMs: Larger Datasets
  • No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

  • Training Compute-Optimal Large Language Models

Sec. 3.1.1 Data Acquisition

  • No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

  • GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting

  • Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

  • Audio Retrieval with WavText5K and CLAP Training

  • DataComp: In search of the next generation of multimodal datasets

  • Learning Transferable Visual Models From Natural Language Supervision

  • NExT-GPT: Any-to-Any Multimodal LLM

  • ChartThinker: A Contextual Chain-of-Thought Approach to Optimized Chart Summarization

  • ChartReformer: Natural Language-Driven Chart Image Editing

  • Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

  • Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

  • StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding

  • MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning

  • List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

  • TextSquare: Scaling up Text-Centric Visual Instruction Tuning

  • ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction

  • TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

  • BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

  • ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

  • UIClip: A Data-driven Model for Assessing User Interface Design

  • EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

  • Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

  • Visual Instruction Tuning

  • ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model

  • Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

  • Probing Multimodal LLMs as World Models for Driving

  • Genixer: Empowering Multimodal Large Language Model as a Powerful Data Generator

  • MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

  • MiniCPM-V: A GPT-4V Level MLLM on Your Phone

  • Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

  • Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

  • LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Sec. 3.1.2 Data Augmentation

  • Improved Baselines for Data-efficient Perceptual Augmentation of LLMs

  • mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

  • Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives

  • mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

  • CapsFusion: Rethinking Image-Text Data at Scale

  • Improving CLIP Training with Language Rewrites

  • Data Augmentation for Text-based Person Retrieval Using Large Language Models

  • Aligning Actions and Walking to LLM-Generated Textual Descriptions

  • GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

Sec. 3.1.3 Data Diversity

  • No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

  • Audio Retrieval with WavText5K and CLAP Training

  • DataComp: In search of the next generation of multimodal datasets

  • Flamingo: a visual language model for few-shot learning

  • Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

  • ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning

  • Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models

  • PaLM-E: An Embodied Multimodal Language Model

  • SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

  • Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

  • MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

Sec. 3.2 For Scaling Effectiveness of MLLMs: Better Subsets
  • No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

  • DataComp: In search of the next generation of multimodal datasets

Sec. 3.2.1 Data Condensation

  • The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering

  • DataComp: In search of the next generation of multimodal datasets

  • Beyond neural scaling laws: beating power law scaling via data pruning

  • Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters

  • Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

  • Efficient Multimodal Learning from Data-centric Perspective

  • Multimodal Data Curation via Object Detection and Filter Ensembles

  • Sieve: Multimodal Dataset Pruning Using Image Captioning Models

  • Towards a statistical theory of data selection under weak supervision

  • Data Filtering Networks

  • T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

  • InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4

  • Semdedup: Data-efficient learning at web-scale through semantic deduplication

  • On the De-duplication of LAION-2B

  • Improving Multimodal Datasets with Image Captioning

  • MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Sec. 3.2.2 Data Mixture

  • Learning Transferable Visual Models From Natural Language Supervision

  • Flamingo: a visual language model for few-shot learning

  • Quality not quantity: On the interaction between dataset design and robustness of clip

  • List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

  • A Decade’s Battle on Dataset Bias: Are We There Yet?

  • Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

  • Demystifying CLIP Data

Sec. 3.2.3 Data Packing

  • Structured Packing in LLM Training Improves Long Context Utilization

  • Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

  • MoDE: CLIP Data Experts via Clustering

  • Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

  • Fewer Truncations Improve Language Modeling

Sec. 3.2.4 Cross-Modal Alignment

  • No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

  • DataComp: In search of the next generation of multimodal datasets

  • Multimodal Data Curation via Object Detection and Filter Ensembles

  • Sieve: Multimodal Dataset Pruning Using Image Captioning Models

  • ChartThinker: A Contextual Chain-of-Thought Approach to Optimized Chart Summarization

  • Data Filtering Networks

  • T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

  • Text-centric Alignment for Multi-Modality Learning

  • Noisy Correspondence Learning with Meta Similarity Correction

  • ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model

  • Semdedup: Data-efficient learning at web-scale through semantic deduplication

  • Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

  • AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

  • Improving Multimodal Datasets with Image Captioning

  • Imagebind: One embedding space to bind them all

  • UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

  • FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

  • LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

  • Binding Touch to Everything: Learning Unified Multimodal Tactile Representations

Sec. 4.1 For Instruction Responsiveness of MLLMs
  • ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model

Sec. 4.1.1 Prompt Design

  • Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic

  • Kosmos-2: Grounding Multimodal Large Language Models to the World

  • Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

  • Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

  • ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model

  • Time-LLM: Time Series Forecasting by Reprogramming Large Language Models|

  • Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

  • Scaling Laws of Synthetic Images for Model Training … for Now

Sec. 4.1.2 ICL Data

  • GroundingGPT: Language Enhanced Multi-modal Grounding Model

  • List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

  • Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models

  • All in an Aggregated Image for In-Image Learning

  • AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning

  • MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Sec. 4.1.3 Human-Behavior Alignment Data

  • Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

  • MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

  • ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model

  • LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

  • Automated Multi-level Preference for MLLMs

  • Silkie: Preference distillation for large visual language models

  • Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

  • Aligning Large Multimodal Models with Factually Augmented RLHF

  • DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback

  • RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

Sec. 4.2 For Reasoning Ability of MLLMs
Sec. 4.2.1 Data for Single-Hop Reasoning

  • FunQA: Towards Surprising Video ComprehensionFunQA: Towards Surprising Video Comprehension

  • StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding

  • What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models

Sec. 4.2.2 Data for Multi-Hop Reasoning

  • MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

  • Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos

  • Language-Image Models with 3D Understanding

  • DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

  • EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

  • Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

  • Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models

Sec. 4.3 For Ethics of MLLMs
Sec. 4.3.1 Data Toxicity

  • Med-MMHL: A Multi-Modal Dataset for Detecting Human- and LLM-Generated Misinformation in the Medical Domain

  • Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

  • The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative

  • MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

  • ImgTrojan: Jailbreaking Vision-Language Models with ONE Image

  • VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models

  • Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts

  • On the Adversarial Robustness of Multi-Modal Foundation Models

Sec. 4.3.2 Data Privacy and Intellectual Property

  • ModelGo: A Practical Tool for Machine Learning License Analysis

Sec. 4.4 For Evaluation of MLLMs
Sec. 4.4.1 Benchmarks for Multi-Modal Understanding

  • DataComp: In search of the next generation of multimodal datasets

  • 3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset

  • MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

  • SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

  • OneChart: Purify the Chart Structural Extraction via One Auxiliary Token

  • MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning

  • ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction

  • UNIAA: A Unified Multi-modal Image Aesthetic Data AugmentationAssessment Baseline and Benchmark

  • M3DBench: Let’s Instruct Large Models with Multi-modal 3D Prompts

  • Open-TransMind: A New Baseline and Benchmark for 1st Foundation Model Challenge of Intelligent Transportation

  • BLINK: Multimodal Large Language Models Can See but Not Perceive

  • LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

Sec. 4.4.2 Benchmarks for Multi-Modal Generation

  • VBench: Comprehensive Benchmark Suite for Video Generative Models

  • EvalCraftr: Benchmarking and Evaluating Large Video Generation Models

  • Perception Test: A Diagnostic Benchmark for Multimodal Video Models

  • WorldGPT: Empowering LLM as Multimodal World Model

  • MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

  • MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

  • OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation

  • Visual Hallucinations of Multi-modal Large Language Models

  • Unified Hallucination Detection for Multimodal Large Language Models

  • MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Sec. 4.4.3 Benchmarks for Multi-Modal Retrieval

  • Audio Retrieval with WavText5K and CLAP Training

  • MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

  • Open-TransMind: A New Baseline and Benchmark for 1st Foundation Model Challenge of Intelligent Transportation

Sec. 4.4.4 Benchmarks for Multi-Modal Reasoning

  • FunQA: Towards Surprising Video ComprehensionFunQA: Towards Surprising Video Comprehension

  • ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

  • ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning

  • Probing Multimodal LLMs as World Models for Driving

  • M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

Sec. 5.1 Model as a Data Creator

  • What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning

  • ChartLlama: A Multimodal LLM for Chart Understanding and Generation

  • VideoChat: Chat-Centric Video Understanding

  • 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding

  • Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters

  • Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

  • OneChart: Purify the Chart Structural Extraction via One Auxiliary Token

  • ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning

  • TextSquare: Scaling up Text-Centric Visual Instruction Tuning

  • UNIAA: A Unified Multi-modal Image Aesthetic Data AugmentationAssessment Baseline and Benchmark

  • Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives

  • What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models

  • EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

  • AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

  • InstructPix2Pix: Learning to Follow Image Editing Instructions

  • Genixer: Empowering Multimodal Large Language Model as a Powerful Data Generator

  • World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

  • Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

Sec. 5.2 Model as a Data Mapper

  • VideoChat: Chat-Centric Video Understanding

  • Aligned with LLM: a new multi-modal training paradigm for encoding fMRI activity in visual cortex

  • GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting

  • MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

  • AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception

  • BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

  • MoqaGPT: Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model

  • Improving CLIP Training with Language Rewrites

  • Data Augmentation for Text-based Person Retrieval Using Large Language Models

  • Aligning Actions and Walking to LLM-Generated Textual Descriptions

  • Unified Hallucination Detection for Multimodal Large Language Models

  • PDFChatAnnotator: A Human-LLM Collaborative Multi-Modal Data Annotation Tool for PDF-Format Catalogs

  • MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

  • MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

  • Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMs

  • MiniCPM-V: A GPT-4V Level MLLM on Your Phone

  • Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

  • REFINESUMM: Self-Refining MLLM for Generating a Multimodal Summarization Dataset

  • Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

  • LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Sec. 5.3 Model as a Data Filter

  • The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering

  • Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters

  • DataComp: In search of the next generation of multimodal datasets

  • TextSquare: Scaling up Text-Centric Visual Instruction Tuning

  • What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models

  • Towards a statistical theory of data selection under weak supervision

  • Visual Hallucinations of Multi-modal Large Language Models

  • MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

  • FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

Sec. 5.4 Model as a Data Evaluator

  • Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

  • TextSquare: Scaling up Text-Centric Visual Instruction Tuning

  • Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

  • MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

  • MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

  • ImgTrojan: Jailbreaking Vision-Language Models with ONE Image

  • ZooProbe: A Data Engine for Evaluating, Exploring, and Evolving Large-scale Training Data for Multimodal LLMs

  • FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models

  • A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

Sec. 6.1 Model as a Data Navigator

  • How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?

Sec. 6.2 Model as a Data Extractor

  • No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

  • LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

  • Unified Hallucination Detection for Multimodal Large Language Models

  • LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

Sec. 6.3 Model as a Data Analyzer

  • ChartLlama: A Multimodal LLM for Chart Understanding and Generation

  • OneChart: Purify the Chart Structural Extraction via One Auxiliary Token

  • ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

  • StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding

  • ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning

  • mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

  • mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

  • mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

  • Bridging Research and Readers: A Multi-Modal Automated Academic Papers Interpretation System

Sec. 6.4 Model as a Data Visualizer

  • ChartLlama: A Multimodal LLM for Chart Understanding and Generation

  • ChartReformer: Natural Language-Driven Chart Image Editing

  • Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study

Sec. 8.1 Data-Model Co-Development Infrastructures
  • DataComp: In search of the next generation of multimodal datasets

Sec. 8.2 Externally-Boosted MLLM Development
Sec. 8.2.1 MLLM-Based Data Discovery
  • No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

  • ModelGo: A Practical Tool for Machine Learning License Analysis

Sec. 8.2.2 Modality-Compatibility Detection with MLLMs
  • Improving Multimodal Datasets with Image Captioning

Sec. 8.2.3 Automatic Knowledge Transfer for MLLMs - Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria
Sec. 8.3 Self-Boosted MLLM Development
Sec. 8.3.1 Self Data Scaling with MLLMs
  • Sieve: Multimodal Dataset Pruning Using Image Captioning Models

  • ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model

  • Segment Anything

Sec. 8.3.2 Self Data Condensation with MLLMs
Sec. 8.3.3 RL from Self Feedback of MLLMs
  • The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering

  • DataComp: In search of the next generation of multimodal datasets

  • Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters

  • Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

  • Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

  • TextSquare: Scaling up Text-Centric Visual Instruction Tuning

  • Multimodal Data Curation via Object Detection and Filter Ensembles

  • Sieve: Multimodal Dataset Pruning Using Image Captioning Models

  • Data Filtering Networks

  • T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

  • Semdedup: Data-efficient learning at web-scale through semantic deduplication

  • MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

  • CiT: Curation in Training for Effective Vision-Language Data

  • Improving Multimodal Datasets with Image Captioning

Tab. 2 - No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance - DataComp: In search of the next generation of multimodal datasets - TextSquare: Scaling up Text-Centric Visual Instruction Tuning - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria - Align and Attend: Multimodal Summarization with Dual Contrastive Losses - MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? - BLINK: Multimodal Large Language Models Can See but Not Perceive - Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering - ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark - Unified Hallucination Detection for Multimodal Large Language Models - Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning - M3it: A large-scale dataset towards multi-modal multilingual instruction tuning - MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark - MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI - M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought - Panda-70m: Captioning 70m videos with multiple cross-modality teachers - Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text - ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning