Awesome Data-Model Co-Development of MLLMs

Welcome to the "Awesome List" for data-model co-development of Multi-Modal Large Language Models (MLLMs), a continually updated resource tailored for the open-source community. This compilation features cutting-edge research, insightful articles focusing on improving MLLMs involving with the data-model co-development of MLLMs, and tagged based on the proposed taxonomy from our data-model co-development survey, as illustrated below.

Overview of Our Taxonomy

Due to the rapid development in the field, this repository and our paper are continuously being updated and synchronized with each other. Please feel free to make pull requests or open issues to contribute to this list and add more related resources!

Detailed Paper List

Title Tags
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance Data4Model->Scaling Up->Acquisition Data4Model->Scaling Effectiveness->CrossModalAlignment Model4Data->Synthesis->Evaluator
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning Model4Data->Synthesis->Creator
Med-MMHL: A Multi-Modal Dataset for Detecting Human- and LLM-Generated Misinformation in the Medical Domain Data4Model->Usability->Ethic->Toxicity
Probing Heterogeneous Pretraining Datasets with Small Curated Datasets Data4Model->Scaling Effectiveness->Condensation
ChartLlama: A Multimodal LLM for Chart Understanding and Generation Model4Data->Synthesis->Creator Model4Data->Insights->Visualizer
VideoChat: Chat-Centric Video Understanding Model4Data->Synthesis->Creator Model4Data->Synthesis->Mapper
Aligned with LLM: a new multi-modal training paradigm for encoding fMRI activity in visual cortex Model4Data->Synthesis->Mapper
3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding Model4Data->Synthesis->Creator
GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting Data4Model->Scaling Up->Acquisition Model4Data->Synthesis->Mapper
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation Data4Model->Scaling Up->Acquisition
Audio Retrieval with WavText5K and CLAP Training Data4Model->Scaling Up->Diversity Data4Model->Scaling Up->Acquisition Data4Model->Usability->Eval->Retrieval
The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering Data4Model->Scaling Effectiveness->Condensation
Demystifying CLIP Data Data4Model->Scaling Effectiveness->Mixture
Learning Transferable Visual Models From Natural Language Supervision Data4Model->Scaling Up->Acquisition
DataComp: In search of the next generation of multimodal datasets Data4Model->Scaling Effectiveness->Condensation Data4Model->Scaling Up->Acquisition Data4Model->Usability->Eval->Generation Model4Data->Synthesis->Filter
Beyond neural scaling laws: beating power law scaling via data pruning Data4Model->Scaling Effectiveness->Condensation
Flamingo: a visual language model for few-shot learning Data4Model->Scaling Effectiveness->Mixture
Quality not quantity: On the interaction between dataset design and robustness of clip Data4Model->Scaling Effectiveness->Condensation Data4Model->Scaling Effectiveness->Mixture
VBench: Comprehensive Benchmark Suite for Video Generative Models Data4Model->Usability->Eval->Generation
EvalCraftr: Benchmarking and Evaluating Large Video Generation Models Data4Model->Usability->Eval->Generation
Training Compute-Optimal Large Language Models Data4Model->Scaling Up->Acquisition
NExT-GPT: Any-to-Any Multimodal LLM Data4Model->Scaling Up->Acquisition
ChartThinker: A Contextual Chain-of-Thought Approach to Optimized Chart Summarization Data4Model->Scaling Up->Acquisition Data4Model->Scaling Effectiveness->CrossModalAlignment
ChartReformer: Natural Language-Driven Chart Image Editing Data4Model->Scaling Up->Acquisition Model4Data->Insights->Visualizer
GroundingGPT: Language Enhanced Multi-modal Grounding Model Data4Model->Usability->Responsiveness->ICL
Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic Data4Model->Usability->Responsiveness->Prompt
Kosmos-2: Grounding Multimodal Large Language Models to the World Data4Model->Usability->Responsiveness->Prompt
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters Model4Data->Synthesis->Filter Model4Data->Synthesis->Creator
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training Data4Model->Scaling Effectiveness->Condensation
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation Model4Data->Synthesis->Creator Data4Model->Scaling Up->Acquisition Data4Model->Scaling Up->Diversity Data4Model->Usability->Responsiveness->HumanBehavior
3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset Data4Model->Usability->Eval->Understanding
Structured Packing in LLM Training Improves Long Context Utilization Data4Model->Scaling Effectiveness->Packing
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models Data4Model->Scaling Effectiveness->Packing
MoDE: CLIP Data Experts via Clustering Data4Model->Scaling Effectiveness->Packing
Efficient Multimodal Learning from Data-centric Perspective Data4Model->Scaling Effectiveness->Condensation
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs Data4Model->Scaling Up->Augmentation
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Data4Model->Usability->Eval->Understanding
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension Data4Model->Usability->Eval->Understanding
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Data4Model->Scaling Up->Acquisition
Perception Test: A Diagnostic Benchmark for Multimodal Video Models Data4Model->Usability->Eval->Understanding
FunQA: Towards Surprising Video ComprehensionFunQA: Towards Surprising Video Comprehension Data4Model->Usability->Eval->Reasoning
OneChart: Purify the Chart Structural Extraction via One Auxiliary Token Data4Model->Usability->Eval->Understanding Model4Data->Synthesis->Creator
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning Data4Model->Usability->Eval->Reasoning
StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding Data4Model->Scaling Up->Acquisition Data4Model->Usability->Reasoning->SingleHop
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning Data4Model->Scaling Up->Acquisition Data4Model->Usability->Eval->Understanding
ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning Data4Model->Usability->Eval->Understanding Model4Data->Synthesis->Creator Data4Model->Scaling Up->Diversity
WorldGPT: Empowering LLM as Multimodal World Model Data4Model->Usability->Eval->Generation
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs Data4Model->Usability->Responsiveness->Prompt Data4Model->Scaling Up->Acquisition Data4Model->Usability->Responsiveness->ICL
TextSquare: Scaling up Text-Centric Visual Instruction Tuning Data4Model->Scaling Up->Acquisition Model4Data->Synthesis->Creator Model4Data->Synthesis->Filter Model4Data->Synthesis->Evaluator
ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction Data4Model->Usability->Eval->Understanding Data4Model->Scaling Up->Acquisition
How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning? Data4Model->Usability->Responsiveness->ICL Model4Data->Insights->Navigator
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want Data4Model->Usability->Responsiveness->HumanBehavior
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution Data4Model->Scaling Effectiveness->Packing
Fewer Truncations Improve Language Modeling Data4Model->Scaling Effectiveness->Packing
MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale Data4Model->Usability->Reasoning->MultiHop Model4Data->Synthesis->Mapper
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception Data4Model->Scaling Up->Acquisition Model4Data->Synthesis->Mapper
UNIAA: A Unified Multi-modal Image Aesthetic Data AugmentationAssessment Baseline and Benchmark Data4Model->Usability->Eval->Understanding Model4Data->Synthesis->Creator
Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives Data4Model->Scaling Up->Augmentation Model4Data->Synthesis->Creator
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation Data4Model->Usability->Responsiveness->Prompt Data4Model->Usability->Ethic->Toxicity Model4Data->Synthesis->Evaluator
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models Data4Model->Scaling Up->Acquisition
The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative Data4Model->Usability->Ethic->Toxicity
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs Model4Data->Synthesis->Mapper Data4Model->Scaling Up->Acquisition
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria Data4Model->Usability->Eval->Understanding Model4Data->Synthesis->Evaluator
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models Data4Model->Usability->Eval->Generation Data4Model->Usability->Ethic->Toxicity
Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models Data4Model->Usability->Responsiveness->ICL Data4Model->Usability->Reasoning->MultiHop Data4Model->Scaling Up->Diversity
M3DBench: Let’s Instruct Large Models with Multi-modal 3D Prompts Data4Model->Usability->Eval->Understanding
MoqaGPT: Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model Model4Data->Insights->Analyzer Model4Data->Synthesis->Mapper
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding Model4Data->Insights->Analyzer
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding Model4Data->Insights->Analyzer
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration Data4Model->Scaling Up->Augmentation
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model Model4Data->Insights->Analyzer
Open-TransMind: A New Baseline and Benchmark for 1st Foundation Model Challenge of Intelligent Transportation Data4Model->Usability->Eval->Understanding Data4Model->Usability->Eval->Retrieval
On the Adversarial Robustness of Multi-Modal Foundation Models Data4Model->Usability->Ethic->Toxicity
What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models Data4Model->Usability->Reasoning->SingleHop Model4Data->Synthesis->Filter Model4Data->Synthesis->Creator
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions Data4Model->Scaling Up->Acquisition
PaLM-E: An Embodied Multimodal Language Model Data4Model->Scaling Up->Diversity
Multimodal Data Curation via Object Detection and Filter Ensembles Data4Model->Scaling Effectiveness->Condensation
Sieve: Multimodal Dataset Pruning Using Image Captioning Models Data4Model->Scaling Effectiveness->Condensation
Towards a statistical theory of data selection under weak supervision Data4Model->Scaling Effectiveness->Condensation
𝐷2 Pruning: Message Passing for Balancing Diversity & Difficulty in Data Pruning Data4Model->Scaling Up->Diversity Data4Model->Scaling Effectiveness->Condensation
UIClip: A Data-driven Model for Assessing User Interface Design Data4Model->Scaling Up->Acquisition
CapsFusion: Rethinking Image-Text Data at Scale Data4Model->Scaling Up->Augmentation
Improving CLIP Training with Language Rewrites Model4Data->Synthesis->Mapper Data4Model->Scaling Up->Augmentation
OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation Data4Model->Usability->Eval->Generation
A Decade's Battle on Dataset Bias: Are We There Yet? Data4Model->Scaling Effectiveness->Mixture
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets Data4Model->Scaling Up->Acquisition Data4Model->Scaling Effectiveness->CrossModalAlignment
Data Filtering Networks Data4Model->Scaling Effectiveness->Condensation
T-MARS: Improving Visual Representations by Circumventing Text Feature Learning Data4Model->Scaling Effectiveness->Condensation
InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4 Data4Model->Scaling Effectiveness->Condensation
Align and Attend: Multimodal Summarization with Dual Contrastive Losses Data4Model->Scaling Effectiveness->CrossModalAlignment
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? Data4Model->Usability->Reasoning->SingleHop Data4Model->Usability->Reasoning->MultiHop Data4Model->Usability->Eval->Reasoning
Text-centric Alignment for Multi-Modality Learning Model4Data->Synthesis->Mapper
Noisy Correspondence Learning with Meta Similarity Correction Data4Model->Scaling Effectiveness->CrossModalAlignment
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos Data4Model->Usability->Reasoning->MultiHop
Language-Image Models with 3D Understanding Data4Model->Scaling Up->Acquisition Data4Model->Usability->Reasoning->SingleHop Data4Model->Usability->Reasoning->MultiHop
Scaling Laws for Generative Mixed-Modal Language Models Data4Model->Scaling Up->Acquisition
BLINK: Multimodal Large Language Models Can See but Not Perceive Data4Model->Usability->Eval->Understanding
Visual Hallucinations of Multi-modal Large Language Models Data4Model->Usability->Eval->Generation
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models Data4Model->Usability->Responsiveness->Prompt Data4Model->Usability->Reasoning->MultiHop
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought Data4Model->Scaling Up->Acquisition Data4Model->Usability->Reasoning->MultiHop Model4Data->Synthesis->Creator
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering Data4Model->Scaling Up->Acquisition Data4Model->Usability->Reasoning->MultiHop
Visual Instruction Tuning Data4Model->Scaling Up->Acquisition Model4Data->Synthesis->Creator Model4Data->Synthesis->Mapper
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model Data4Model->Scaling Up->Acquisition Data4Model->Scaling Effectiveness->CrossModalAlignment Data4Model->Usability->Responsiveness->HumanBehavior
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models Data4Model->Usability->Responsiveness->Prompt
On the De-duplication of LAION-2B Data4Model->Scaling Effectiveness->Condensation
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding Data4Model->Scaling Up->Acquisition Data4Model->Scaling Effectiveness->Mixture
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark Data4Model->Usability->Eval->Understanding
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition Data4Model->Usability->Responsiveness->Prompt
Data Augmentation for Text-based Person Retrieval Using Large Language Models Data4Model->Scaling Up->Augmentation Data4Model->Scaling Effectiveness->Mixture Model4Data->Synthesis->Mapper
Aligning Actions and Walking to LLM-Generated Textual Descriptions Data4Model->Scaling Up->Augmentation Model4Data->Synthesis->Mapper
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Data4Model->Scaling Up->Augmentation
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models Data4Model->Scaling Up->Diversity
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability Data4Model->Scaling Effectiveness->CrossModalAlignment
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling Model4Data->Synthesis->Creator
Probing Multimodal LLMs as World Models for Driving Data4Model->Usability->Eval->Understanding Data4Model->Usability->Eval->Reasoning
Unified Hallucination Detection for Multimodal Large Language Models Data4Model->Usability->Eval->Generation Model4Data->Insights->Extractor Model4Data->Synthesis->Mapper
Semdedup: Data-efficient learning at web-scale through semantic deduplication Data4Model->Scaling Effectiveness->Condensation
Automated Multi-level Preference for MLLMs Data4Model->Usability->Responsiveness->HumanBehavior
Silkie: Preference distillation for large visual language models Data4Model->Usability->Responsiveness->HumanBehavior
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning Data4Model->Usability->Responsiveness->HumanBehavior
M3it: A large-scale dataset towards multi-modal multilingual instruction tuning Data4Model->Usability->Responsiveness->HumanBehavior
Aligning Large Multimodal Models with Factually Augmented RLHF Data4Model->Usability->Responsiveness->HumanBehavior
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback Data4Model->Usability->Responsiveness->HumanBehavior
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback Data4Model->Scaling Effectiveness->CrossModalAlignment
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark Data4Model->Usability->Eval->Generation Model4Data->Synthesis->Evaluator
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI Data4Model->Usability->Eval->Understanding Data4Model->Usability->Eval->Retrieval
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought Data4Model->Usability->Eval->Reasoning
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image Data4Model->Usability->Ethic->Toxicity Model4Data->Synthesis->Evaluator Model4Data->Synthesis->Creator
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models Data4Model->Usability->Ethic->Toxicity
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts Data4Model->Usability->Ethic->Toxicity
Improving Multimodal Datasets with Image Captioning Data4Model->Scaling Effectiveness->Condensation
Bridging Research and Readers: A Multi-Modal Automated Academic Papers Interpretation System Model4Data->Insights->Analyzer
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition Model4Data->Insights->Extractor
PDFChatAnnotator: A Human-LLM Collaborative Multi-Modal Data Annotation Tool for PDF-Format Catalogs Model4Data->Insights->Extractor Model4Data->Synthesis->Mapper
CiT: Curation in Training for Effective Vision-Language Data Data4Model->Scaling Effectiveness->Condensation Data4Model->Scaling Effectiveness->Mixture
InstructPix2Pix: Learning to Follow Image Editing Instructions Model4Data->Synthesis->Creator
Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study Model4Data->Insights->Visualizer
ModelGo: A Practical Tool for Machine Learning License Analysis Data4Model->Usability->Ethic->Privacy&IP
Scaling Laws of Synthetic Images for Model Training ... for Now Data4Model->Scaling Up->Acquisition Data4Model->Usability->Responsiveness->Prompt
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Data4Model->Scaling Up->Diversity
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V Data4Model->Usability->Responsiveness->Prompt
Segment Anything Data4Model->Scaling Up->Acquisition
AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning Data4Model->Usability->Responsiveness->ICL
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning Data4Model->Usability->Responsiveness->ICL
All in an Aggregated Image for In-Image Learning Data4Model->Usability->Responsiveness->ICL
Panda-70m: Captioning 70m videos with multiple cross-modality teachers Data4Model->Scaling Up->Acquisition
Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text Data4Model->Scaling Up->Acquisition
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning Data4Model->Scaling Up->Acquisition