DJ-Cookbook

Curated Resources

Coding with Data-Juicer (DJ)

Use Cases & Data Recipes

  • Data Recipe Gallery

    • Data-Juicer Minimal Example Recipe

    • Reproducing Open Source Text Datasets

    • Improving Open Source Pre-training Text Datasets

    • Improving Open Source Post-tuning Text Datasets

    • Synthetic Contrastive Learning Image-text Datasets

    • Improving Open Source Image-text Datasets

    • Basic Example Recipes for Video Data

    • Synthesizing Human-centric Video Benchmarks

    • Improving Existing Open Source Video Datasets

  • Data-Juicer related Competitions

    • Better Synth, explore the impact of large model synthetic data on image understanding ability with DJ-Sandbox Lab and multimodal large models

    • Modelscope-Sora Challenge, based on Data-Juicer and EasyAnimate framework, optimize data and train SORA-like small models to generate better videos

    • Better Mixture, only adjust data mixing and sampling strategies for given multiple candidate datasets

    • FT-Data Ranker (1B Track, 7B Track), For a specified candidate dataset, only adjust the data filtering and enhancement strategies

    • Kolors-LoRA Stylized Story Challenge, based on Data-Juicer and DiffSynth-Studio framework, explore Diffusion model fine-tuning

  • DJ-SORA

  • Based on Data-Juicer and AgentScope framework, leverage agents to call DJ Filters and call DJ Mappers

Interactive Examples