DJ-Cookbook¶
Curated Resources¶
Coding with Data-Juicer (DJ)¶
Use Cases & Data Recipes¶
-
Data-Juicer Minimal Example Recipe
Reproducing Open Source Text Datasets
Improving Open Source Pre-training Text Datasets
Improving Open Source Post-tuning Text Datasets
Synthetic Contrastive Learning Image-text Datasets
Improving Open Source Image-text Datasets
Basic Example Recipes for Video Data
Synthesizing Human-centric Video Benchmarks
Improving Existing Open Source Video Datasets
Data-Juicer related Competitions
Better Synth, explore the impact of large model synthetic data on image understanding ability with DJ-Sandbox Lab and multimodal large models
Modelscope-Sora Challenge, based on Data-Juicer and EasyAnimate framework, optimize data and train SORA-like small models to generate better videos
Better Mixture, only adjust data mixing and sampling strategies for given multiple candidate datasets
FT-Data Ranker (1B Track, 7B Track), For a specified candidate dataset, only adjust the data filtering and enhancement strategies
Kolors-LoRA Stylized Story Challenge, based on Data-Juicer and DiffSynth-Studio framework, explore Diffusion model fine-tuning
Based on Data-Juicer and AgentScope framework, leverage agents to call DJ Filters and call DJ Mappers
Interactive Examples¶
Introduction to Data-Juicer [ModelScope] [HuggingFace]
Data Visualization:
Basic Statistics [ModelScope] [HuggingFace]
Lexical Diversity [ModelScope] [HuggingFace]
Operator Insight (Single OP) [ModelScope] [HuggingFace]
Operator Effect (Multiple OPs) [ModelScope] [HuggingFace]
Data Processing:
Scientific Literature (e.g. arXiv) [ModelScope] [HuggingFace]
Programming Code (e.g. TheStack) [ModelScope] [HuggingFace]
Chinese Instruction Data (e.g. Alpaca-CoT) [ModelScope] [HuggingFace]
Tool Pool:
Dataset Splitting by Language [ModelScope] [HuggingFace]
Quality Classifier for CommonCrawl [ModelScope] [HuggingFace]
Auto Evaluation on HELM [ModelScope] [HuggingFace]
Data Sampling and Mixture [ModelScope] [HuggingFace]
Data Processing Loop [ModelScope] [HuggingFace]