Detailed Paper List

Filter by tag:

Title	Tags
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance	Data4Model->Scaling Up->Acquisition Data4Model->Scaling Effectiveness->CrossModalAlignment Model4Data->Synthesis->Evaluator
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning	Model4Data->Synthesis->Creator
Med-MMHL: A Multi-Modal Dataset for Detecting Human- and LLM-Generated Misinformation in the Medical Domain	Data4Model->Usability->Ethic->Toxicity
Probing Heterogeneous Pretraining Datasets with Small Curated Datasets	Data4Model->Scaling Effectiveness->Condensation
ChartLlama: A Multimodal LLM for Chart Understanding and Generation	Model4Data->Synthesis->Creator Model4Data->Insights->Visualizer
VideoChat: Chat-Centric Video Understanding	Model4Data->Synthesis->Creator Model4Data->Synthesis->Mapper
Aligned with LLM: a new multi-modal training paradigm for encoding fMRI activity in visual cortex	Model4Data->Synthesis->Mapper
3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding	Model4Data->Synthesis->Creator
GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting	Data4Model->Scaling Up->Acquisition Model4Data->Synthesis->Mapper
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation	Data4Model->Scaling Up->Acquisition
Audio Retrieval with WavText5K and CLAP Training	Data4Model->Scaling Up->Diversity Data4Model->Scaling Up->Acquisition Data4Model->Usability->Eval->Retrieval
The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering	Data4Model->Scaling Effectiveness->Condensation
Demystifying CLIP Data	Data4Model->Scaling Effectiveness->Mixture
Learning Transferable Visual Models From Natural Language Supervision	Data4Model->Scaling Up->Acquisition
DataComp: In search of the next generation of multimodal datasets	Data4Model->Scaling Effectiveness->Condensation Data4Model->Scaling Up->Acquisition Data4Model->Usability->Eval->Generation Model4Data->Synthesis->Filter
Beyond neural scaling laws: beating power law scaling via data pruning	Data4Model->Scaling Effectiveness->Condensation
Flamingo: a visual language model for few-shot learning	Data4Model->Scaling Effectiveness->Mixture
Quality not quantity: On the interaction between dataset design and robustness of clip	Data4Model->Scaling Effectiveness->Condensation Data4Model->Scaling Effectiveness->Mixture
VBench: Comprehensive Benchmark Suite for Video Generative Models	Data4Model->Usability->Eval->Generation
EvalCraftr: Benchmarking and Evaluating Large Video Generation Models	Data4Model->Usability->Eval->Generation
Training Compute-Optimal Large Language Models	Data4Model->Scaling Up->Acquisition
NExT-GPT: Any-to-Any Multimodal LLM	Data4Model->Scaling Up->Acquisition
ChartThinker: A Contextual Chain-of-Thought Approach to Optimized Chart Summarization	Data4Model->Scaling Up->Acquisition Data4Model->Scaling Effectiveness->CrossModalAlignment
ChartReformer: Natural Language-Driven Chart Image Editing	Data4Model->Scaling Up->Acquisition Model4Data->Insights->Visualizer
GroundingGPT: Language Enhanced Multi-modal Grounding Model	Data4Model->Usability->Responsiveness->ICL
Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic	Data4Model->Usability->Responsiveness->Prompt
Kosmos-2: Grounding Multimodal Large Language Models to the World	Data4Model->Usability->Responsiveness->Prompt
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters	Model4Data->Synthesis->Filter Model4Data->Synthesis->Creator
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training	Data4Model->Scaling Effectiveness->Condensation
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation	Model4Data->Synthesis->Creator Data4Model->Scaling Up->Acquisition Data4Model->Scaling Up->Diversity Data4Model->Usability->Responsiveness->HumanBehavior
3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset	Data4Model->Usability->Eval->Understanding
Structured Packing in LLM Training Improves Long Context Utilization	Data4Model->Scaling Effectiveness->Packing
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models	Data4Model->Scaling Effectiveness->Packing
MoDE: CLIP Data Experts via Clustering	Data4Model->Scaling Effectiveness->Packing
Efficient Multimodal Learning from Data-centric Perspective	Data4Model->Scaling Effectiveness->Condensation
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs	Data4Model->Scaling Up->Augmentation
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	Data4Model->Usability->Eval->Understanding
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	Data4Model->Usability->Eval->Understanding
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Data4Model->Scaling Up->Acquisition
Perception Test: A Diagnostic Benchmark for Multimodal Video Models	Data4Model->Usability->Eval->Understanding
FunQA: Towards Surprising Video ComprehensionFunQA: Towards Surprising Video Comprehension	Data4Model->Usability->Eval->Reasoning
OneChart: Purify the Chart Structural Extraction via One Auxiliary Token	Data4Model->Usability->Eval->Understanding Model4Data->Synthesis->Creator
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning	Data4Model->Usability->Eval->Reasoning
StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding	Data4Model->Scaling Up->Acquisition Data4Model->Usability->Reasoning->SingleHop
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning	Data4Model->Scaling Up->Acquisition Data4Model->Usability->Eval->Understanding
ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning	Data4Model->Usability->Eval->Understanding Model4Data->Synthesis->Creator Data4Model->Scaling Up->Diversity
WorldGPT: Empowering LLM as Multimodal World Model	Data4Model->Usability->Eval->Generation
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs	Data4Model->Usability->Responsiveness->Prompt Data4Model->Scaling Up->Acquisition Data4Model->Usability->Responsiveness->ICL
TextSquare: Scaling up Text-Centric Visual Instruction Tuning	Data4Model->Scaling Up->Acquisition Model4Data->Synthesis->Creator Model4Data->Synthesis->Filter Model4Data->Synthesis->Evaluator
ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction	Data4Model->Usability->Eval->Understanding Data4Model->Scaling Up->Acquisition
How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?	Data4Model->Usability->Responsiveness->ICL Model4Data->Insights->Navigator
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want	Data4Model->Usability->Responsiveness->HumanBehavior
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution	Data4Model->Scaling Effectiveness->Packing
Fewer Truncations Improve Language Modeling	Data4Model->Scaling Effectiveness->Packing
MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale	Data4Model->Usability->Reasoning->MultiHop Model4Data->Synthesis->Mapper
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception	Data4Model->Scaling Up->Acquisition Model4Data->Synthesis->Mapper
UNIAA: A Unified Multi-modal Image Aesthetic Data AugmentationAssessment Baseline and Benchmark	Data4Model->Usability->Eval->Understanding Model4Data->Synthesis->Creator
Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives	Data4Model->Scaling Up->Augmentation Model4Data->Synthesis->Creator
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation	Data4Model->Usability->Responsiveness->Prompt Data4Model->Usability->Ethic->Toxicity Model4Data->Synthesis->Evaluator
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models	Data4Model->Scaling Up->Acquisition
The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative	Data4Model->Usability->Ethic->Toxicity
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	Model4Data->Synthesis->Mapper Data4Model->Scaling Up->Acquisition
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria	Data4Model->Usability->Eval->Understanding Model4Data->Synthesis->Evaluator
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models	Data4Model->Usability->Eval->Generation Data4Model->Usability->Ethic->Toxicity
Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models	Data4Model->Usability->Responsiveness->ICL Data4Model->Usability->Reasoning->MultiHop Data4Model->Scaling Up->Diversity
M3DBench: Let’s Instruct Large Models with Multi-modal 3D Prompts	Data4Model->Usability->Eval->Understanding
MoqaGPT: Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model	Model4Data->Insights->Analyzer Model4Data->Synthesis->Mapper
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding	Model4Data->Insights->Analyzer
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding	Model4Data->Insights->Analyzer
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	Data4Model->Scaling Up->Augmentation
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model	Model4Data->Insights->Analyzer
Open-TransMind: A New Baseline and Benchmark for 1st Foundation Model Challenge of Intelligent Transportation	Data4Model->Usability->Eval->Understanding Data4Model->Usability->Eval->Retrieval
On the Adversarial Robustness of Multi-Modal Foundation Models	Data4Model->Usability->Ethic->Toxicity
What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models	Data4Model->Usability->Reasoning->SingleHop Model4Data->Synthesis->Filter Model4Data->Synthesis->Creator
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions	Data4Model->Scaling Up->Acquisition
PaLM-E: An Embodied Multimodal Language Model	Data4Model->Scaling Up->Diversity
Multimodal Data Curation via Object Detection and Filter Ensembles	Data4Model->Scaling Effectiveness->Condensation
Sieve: Multimodal Dataset Pruning Using Image Captioning Models	Data4Model->Scaling Effectiveness->Condensation
Towards a statistical theory of data selection under weak supervision	Data4Model->Scaling Effectiveness->Condensation
𝐷2 Pruning: Message Passing for Balancing Diversity & Difficulty in Data Pruning	Data4Model->Scaling Up->Diversity Data4Model->Scaling Effectiveness->Condensation
UIClip: A Data-driven Model for Assessing User Interface Design	Data4Model->Scaling Up->Acquisition
CapsFusion: Rethinking Image-Text Data at Scale	Data4Model->Scaling Up->Augmentation
Improving CLIP Training with Language Rewrites	Model4Data->Synthesis->Mapper Data4Model->Scaling Up->Augmentation
OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation	Data4Model->Usability->Eval->Generation
A Decade's Battle on Dataset Bias: Are We There Yet?	Data4Model->Scaling Effectiveness->Mixture
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets	Data4Model->Scaling Up->Acquisition Data4Model->Scaling Effectiveness->CrossModalAlignment
Data Filtering Networks	Data4Model->Scaling Effectiveness->Condensation
T-MARS: Improving Visual Representations by Circumventing Text Feature Learning	Data4Model->Scaling Effectiveness->Condensation
InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4	Data4Model->Scaling Effectiveness->Condensation
Align and Attend: Multimodal Summarization with Dual Contrastive Losses	Data4Model->Scaling Effectiveness->CrossModalAlignment
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?	Data4Model->Usability->Reasoning->SingleHop Data4Model->Usability->Reasoning->MultiHop Data4Model->Usability->Eval->Reasoning
Text-centric Alignment for Multi-Modality Learning	Model4Data->Synthesis->Mapper
Noisy Correspondence Learning with Meta Similarity Correction	Data4Model->Scaling Effectiveness->CrossModalAlignment
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos	Data4Model->Usability->Reasoning->MultiHop
Language-Image Models with 3D Understanding	Data4Model->Scaling Up->Acquisition Data4Model->Usability->Reasoning->SingleHop Data4Model->Usability->Reasoning->MultiHop
Scaling Laws for Generative Mixed-Modal Language Models	Data4Model->Scaling Up->Acquisition
BLINK: Multimodal Large Language Models Can See but Not Perceive	Data4Model->Usability->Eval->Understanding
Visual Hallucinations of Multi-modal Large Language Models	Data4Model->Usability->Eval->Generation
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models	Data4Model->Usability->Responsiveness->Prompt Data4Model->Usability->Reasoning->MultiHop
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought	Data4Model->Scaling Up->Acquisition Data4Model->Usability->Reasoning->MultiHop Model4Data->Synthesis->Creator
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering	Data4Model->Scaling Up->Acquisition Data4Model->Usability->Reasoning->MultiHop
Visual Instruction Tuning	Data4Model->Scaling Up->Acquisition Model4Data->Synthesis->Creator Model4Data->Synthesis->Mapper
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model	Data4Model->Scaling Up->Acquisition Data4Model->Scaling Effectiveness->CrossModalAlignment Data4Model->Usability->Responsiveness->HumanBehavior
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models	Data4Model->Usability->Responsiveness->Prompt
On the De-duplication of LAION-2B	Data4Model->Scaling Effectiveness->Condensation
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding	Data4Model->Scaling Up->Acquisition Data4Model->Scaling Effectiveness->Mixture
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	Data4Model->Usability->Eval->Understanding
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition	Data4Model->Usability->Responsiveness->Prompt
Data Augmentation for Text-based Person Retrieval Using Large Language Models	Data4Model->Scaling Up->Augmentation Data4Model->Scaling Effectiveness->Mixture Model4Data->Synthesis->Mapper
Aligning Actions and Walking to LLM-Generated Textual Descriptions	Data4Model->Scaling Up->Augmentation Model4Data->Synthesis->Mapper
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction	Data4Model->Scaling Up->Augmentation
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models	Data4Model->Scaling Up->Diversity
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability	Data4Model->Scaling Effectiveness->CrossModalAlignment
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling	Model4Data->Synthesis->Creator
Probing Multimodal LLMs as World Models for Driving	Data4Model->Usability->Eval->Understanding Data4Model->Usability->Eval->Reasoning
Unified Hallucination Detection for Multimodal Large Language Models	Data4Model->Usability->Eval->Generation Model4Data->Insights->Extractor Model4Data->Synthesis->Mapper
Semdedup: Data-efficient learning at web-scale through semantic deduplication	Data4Model->Scaling Effectiveness->Condensation
Automated Multi-level Preference for MLLMs	Data4Model->Usability->Responsiveness->HumanBehavior
Silkie: Preference distillation for large visual language models	Data4Model->Usability->Responsiveness->HumanBehavior
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	Data4Model->Usability->Responsiveness->HumanBehavior
M3it: A large-scale dataset towards multi-modal multilingual instruction tuning	Data4Model->Usability->Responsiveness->HumanBehavior
Aligning Large Multimodal Models with Factually Augmented RLHF	Data4Model->Usability->Responsiveness->HumanBehavior
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback	Data4Model->Usability->Responsiveness->HumanBehavior
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback	Data4Model->Scaling Effectiveness->CrossModalAlignment
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark	Data4Model->Usability->Eval->Generation Model4Data->Synthesis->Evaluator
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI	Data4Model->Usability->Eval->Understanding Data4Model->Usability->Eval->Retrieval
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought	Data4Model->Usability->Eval->Reasoning
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image	Data4Model->Usability->Ethic->Toxicity Model4Data->Synthesis->Evaluator Model4Data->Synthesis->Creator
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models	Data4Model->Usability->Ethic->Toxicity
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts	Data4Model->Usability->Ethic->Toxicity
Improving Multimodal Datasets with Image Captioning	Data4Model->Scaling Effectiveness->Condensation
Bridging Research and Readers: A Multi-Modal Automated Academic Papers Interpretation System	Model4Data->Insights->Analyzer
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition	Model4Data->Insights->Extractor
PDFChatAnnotator: A Human-LLM Collaborative Multi-Modal Data Annotation Tool for PDF-Format Catalogs	Model4Data->Insights->Extractor Model4Data->Synthesis->Mapper
CiT: Curation in Training for Effective Vision-Language Data	Data4Model->Scaling Effectiveness->Condensation Data4Model->Scaling Effectiveness->Mixture
InstructPix2Pix: Learning to Follow Image Editing Instructions	Model4Data->Synthesis->Creator
Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study	Model4Data->Insights->Visualizer
ModelGo: A Practical Tool for Machine Learning License Analysis	Data4Model->Usability->Ethic->Privacy&IP
Scaling Laws of Synthetic Images for Model Training ... for Now	Data4Model->Scaling Up->Acquisition Data4Model->Usability->Responsiveness->Prompt
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs	Data4Model->Scaling Up->Diversity
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V	Data4Model->Usability->Responsiveness->Prompt
Segment Anything	Data4Model->Scaling Up->Acquisition
AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning	Data4Model->Usability->Responsiveness->ICL
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning	Data4Model->Usability->Responsiveness->ICL
All in an Aggregated Image for In-Image Learning	Data4Model->Usability->Responsiveness->ICL
Panda-70m: Captioning 70m videos with multiple cross-modality teachers	Data4Model->Scaling Up->Acquisition
Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text	Data4Model->Scaling Up->Acquisition
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning	Data4Model->Scaling Up->Acquisition