# Data Recipe Gallery - The recipe [folder](../configs) contains fruitful sample configuration files of Data-Juicer data recipes, which helps users easily understand, reuse and expand the configurations in various functional scenarios. - 📣📣📣 Community contributors can submit PRs to add customized data recipes to promote dissemination, reuse and related technology evolution. We welcome co-construction and will highlight [acknowledgements](https://github.com/modelscope/data-juicer?tab=readme-ov-file#acknowledgement)! Table of Contents - [1. Data-Juicer Minimal Example Recipe](#1-data-juicer-minimal-example-recipe) - [2. Reproduce Open Source Text Datasets](#2-reproduce-open-source-text-datasets) - [3. Improved Open Source Pre-training Text Datasets](#3-improved-open-source-pre-training-text-datasets) - [4. Improved Open Source Post-tuning Text Dataset](#4-improved-open-source-post-tuning-text-dataset) - [5. Synthetic Contrastive Learning Image-text datasets](#5-synthetic-contrastive-learning-image-text-datasets) - [6. Improved Open Source Image-text datasets](#6-improved-open-source-image-text-datasets) - [6.1. Evaluation and Verification](#61-evaluation-and-verification) - [7. Basic Example Recipes for Video Data](#7-basic-example-recipes-for-video-data) - [8. Synthesize Human-centric Video Benchmarks](#8-synthesize-human-centric-video-benchmarks) - [9. Improve Existing Open Source Video Datasets](#9-improve-existing-open-source-video-datasets) - [9.1. Evaluation and Verification](#91-evaluation-and-verification) ## 1. Data-Juicer Minimal Example Recipe Some basic configuration files are placed in the [Demo](../configs/demo/) folder to help users quickly familiarize themselves with the basic functions of Data-Juicer. Please refer to the folder for detailed description. ## 2. Reproduce Open Source Text Datasets - We reproduced the processing flow of part of the Redpajama dataset. Please refer to the [reproduced_redpajama](../configs/reproduced_redpajama) folder for detailed description. - We reproduced the processing flow of part of the BLOOM dataset. Please refer to the [reproduced_bloom](../configs/reproduced_bloom) folder for detailed description. ## 3. Improved Open Source Pre-training Text Datasets We found that there are still some "bad" data samples in the existing processed datasets (such as Redpajama, The Pile, etc.). So we use our Data-Juicer to refine these datasets and try to feed them to LLM to get better performance. We use a simple 3-σ rule to set the hyperparameters of the operators in each data processing recipe. | Data subset | Number of samples before refinement | Number of samples after refinement | Sample retention rate | Config link | Data link | Source | |----------------------|:---------------------------:|:--------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------| | arXiv | 1,724,497 | 1,655,259 | 95.99% | [redpajama-arxiv-refine.yaml](../configs/data_juicer_recipes/redpajama-arxiv-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-arxiv-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-arxiv-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-arxiv-refined-by-data-juicer) | Redpajama | | Books | 205,182 | 195,983 | 95.51% | [redpajama-book-refine.yaml](../configs/data_juicer_recipes/redpajama-book-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-book-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-book-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-book-refined-by-data-juicer) | Redpajama | | Wikipedia | 29,834,171 | 26,990,659 | 90.47% | [redpajama-wiki-refine.yaml](../configs/data_juicer_recipes/redpajama-wiki-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-wiki-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-wiki-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-wiki-refined-by-data-juicer) | Redpajama | | C4 | 364,868,892 | 344,491,171 | 94.42% | [redpajama-c4-refine.yaml](../configs/data_juicer_recipes/redpajama-c4-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-c4-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-c4-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer) | Redpajama | | Common Crawl 2019-30 | 81,085,420 | 36,557,283 | 45.08% | [redpajama-cc-2019-30-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2019-30-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2019-30-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2019-30-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2019-30-refined-by-data-juicer) | Redpajama | | Common Crawl 2020-05 | 90,850,492 | 42,612,596 | 46.90% | [redpajama-cc-2020-05-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2020-05-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2020-05-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2020-05-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2020-05-refined-by-data-juicer) | Redpajama | | Common Crawl 2021-04 | 98,878,523 | 44,724,752 | 45.23% | [redpajama-cc-2021-04-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2021-04-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2021-04-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2021-04-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2021-04-refined-by-data-juicer) | Redpajama | | Common Crawl 2022-05 | 94,058,868 | 42,648,496 | 45.34% | [redpajama-cc-2022-05-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2022-05-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2022-05-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2022-05-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2022-05-refined-by-data-juicer) | Redpajama | | Common Crawl 2023-06 | 111,402,716 | 50,643,699 | 45.46% | [redpajama-cc-2023-06-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2023-06-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2023-06-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2023-06-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2023-06-refined-by-data-juicer) | Redpajama | | Github Code | 73,208,524
+ 21,387,703 | 49,279,344 | 52.09% | [redpajama-code-refine.yaml](../configs/data_juicer_recipes/github_code/redpajama-code-refine.yaml)
[stack-code-refine.yaml](github_code/stack-code-refine.yaml)
[redpajama-stack-code-deduplicate.yaml](github_code/redpajama-stack-code-deduplicate.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-stack-code-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-stack-code-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-stack-code-refined-by-data-juicer) | Redpajama
The Stack | | StackExchange | 45,447,328 | 26,309,203 | 57.89% | [redpajama-pile-stackexchange-refine.yaml](../configs/data_juicer_recipes/redpajama-pile-stackexchange-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-pile-stackexchange-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-pile-stackexchange-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-pile-stackexchange-refined-by-data-juicer) | Redpajama
The Pile | | EuroParl | 69,814 | 61,601 | 88.23% | [pile-europarl-refine.yaml](../configs/data_juicer_recipes/pile-europarl-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-europarl-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-europarl-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-europarl-refined-by-data-juicer) | The Pile | | FreeLaw | 3,562,015 | 2,942,612 | 82.61% | [pile-freelaw-refine.yaml](../configs/data_juicer_recipes/pile-freelaw-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-freelaw-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-freelaw-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-freelaw-refined-by-data-juicer) | The Pile | | HackerNews | 373,027 | 371,331 | 99.55% | [pile-hackernews-refine.yaml](../configs/data_juicer_recipes/pile-hackernews-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-hackernews-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-hackernews-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-hackernews-refined-by-data-juicer) | The Pile | | NIH ExPorter | 939,661 | 858,492 | 91.36% | [pile-nih-refine.yaml](../configs/data_juicer_recipes/pile-nih-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-hin-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-nih-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-nih-refined-by-data-juicer) | The Pile | | PhilPapers | 32,782 | 29,117 | 88.82% | [pile-philpaper-refine.yaml](../configs/data_juicer_recipes/pile-philpaper-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-philpaper-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-philpaper-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-philpaper-refined-by-data-juicer) | The Pile | | PubMed Abstracts | 15,518,009 | 15,009,325 | 96.72% | [pile-pubmed-abstract-refine.yaml](../configs/data_juicer_recipes/pile-pubmed-abstract-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-pubmed-abstract-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-pubmed-abstracts-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-pubmed-abstracts-refined-by-data-juicer) | The Pile | | PubMed Central | 3,098,930 | 2,694,860 | 86.96% | [pile-pubmed-central-refine.yaml](../configs/data_juicer_recipes/pile-pubmed-central-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-pubmed-central-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-pubmed-central-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-pubmed-central-refined-by-data-juicer) | The Pile | | USPTO | 5,883,024 | 4,516,283 | 76.77% | [pile-uspto-refine.yaml](../configs/data_juicer_recipes/pile-uspto-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-uspto-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-uspto-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-uspto-refined-by-data-juicer) | The Pile | ## 4. Improved Open Source Post-tuning Text Dataset Take the Alpaca-CoT dataset as an example: | Data subset | Number of samples before improvement | Number of samples after improvement | Sample retention rate | Configuration link | Data link | Source | |-------------------|:------------------------:|:----------------------------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------| | Alpaca-Cot EN | 136,219,879 | 72,855,345 | 54.48% | [alpaca-cot-en-refine.yaml](../configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-en-refined-by-data-juicer) | [39 subsets from Alpaca-CoT](../configs/data_juicer_recipes/alpaca_cot/README.md) | | Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | [alpaca-cot-zh-refine.yaml](../configs/data_juicer_recipes/alpaca_cot/alpaca-cot-zh-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-zh-refined-by-data-juicer) | [28 subsets from Alpaca-CoT](../configs/data_juicer_recipes/alpaca_cot/README.md) | ## 5. Synthetic Contrastive Learning Image-text datasets Data-Juicer has built-in rich operators to support image multimodal data synthesis, such as the Img-Diff dataset. This synthetic data brings a 12-point performance improvement on the MMVP benchmark. For more details, see the Img-Diff [paper](https://arxiv.org/abs/2408.04594), and the corresponding recipe implementation can refer to [ImgDiff-Dev](https://github.com/modelscope/data-juicer/tree/ImgDiff). ## 6. Improved Open Source Image-text datasets | Data subset | Number of samples before improvement | Number of samples after improvement | Sample retention rate | Configuration link | Data link | Source | |---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| | LLaVA pretrain (LCS-558k) | 558,128 | 500,380 | 89.65% | [llava-pretrain-refine.yaml](../configs/data_juicer_recipes/llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer) | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) | | Data-Juicer (T2V, 147k) | 1,217,346 | 147,176 | 12.09% | [data-juicer-sandbox-optimal.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)
[HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid)
[Panda-70M (605k)](https://github.com/snap-research/Panda-70M)
[MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) | | Data-Juicer (DJ, 228k) | 3,408,553 | 227,867 | 8.15% | [data-juicer-sandbox-self-evolution.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid)
[Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M)
[Pexels (198k)](https://github.com/cj-mills/pexels-dataset)
[MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) | ### 6.1. Evaluation and Verification - LLaVA pretrain (LCS-558k): The model pre-trained with **the improved pre-training dataset** and fine-tuned with the original instruction dataset outperformed the baseline model LLaVA-1.5-13B on 10 of the 12 evaluation sets. | Models | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN | SEED | LLaVA-Bench-Wild | MM-Vet | |---------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | LLaVA-1.5-13B
(Baseline) | **80.0** | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 | | LLaVA-1.5-13B
(Rectified Pretraining Dataset) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** | - Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): Outperform the baseline model [T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo) on [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) with **refined dataset**. Here T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k), and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). For details, please refer to [Sandbox Laboratory](./Sandbox.md). | model | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality | |-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | | T2V-Turbo | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 | | Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** | | Data-Juicer (DJ, 228k) | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 | | model | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency | |-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | T2V-Turbo | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 | | Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 | | Data-Juicer (DJ, 228k) | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** | ## 7. Basic Example Recipes for Video Data We provide users with a video dataset processing recipe sample to help better use video-related operators: [general-video-refine-example.yaml](../configs/data_juicer_recipes/general-video-refine-example.yaml) . Here we apply three types of operators: - Text-only: Improve the dataset quality based on video description - Video-only: Improve the dataset quality based on video properties - Text-Video: Improve the dataset quality based on the alignment between text and video Users can start their video dataset processing workflow based on this recipe. ## 8. Synthesize Human-centric Video Benchmarks Data-Juicer can also support video benchmark synthesis, such as [HumanVBench](https://arxiv.org/abs/2412.17574), which converts in-the-wild videos into human-centric video benchmarks. The corresponding data recipes and construction process can be found in [HumanVBench-dev](https://github.com/modelscope/data-juicer/tree/HumanVBench). ## 9. Improve Existing Open Source Video Datasets | Data subset | Number of samples before improvement | Number of samples after improvement | Sample retention rate | Configuration link | Data link | Source | |---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| | Data-Juicer (T2V, 147k) | 1,217,346 | 147,176 | 12.09% | [data-juicer-sandbox-optimal.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)
[HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid)
[Panda-70M (605k)](https://github.com/snap-research/Panda-70M)
[MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) | | Data-Juicer (DJ, 228k) | 3,408,553 | 227,867 | 8.15% | [data-juicer-sandbox-self-evolution.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid)
[Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M)
[Pexels (198k)](https://github.com/cj-mills/pexels-dataset)
[MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) | ### 9.1. Evaluation and Verification - Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): Using the **refined dataset**, they fully surpass the baseline model [T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo) in [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard). Here, T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k), and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). For details, please refer to [Sandbox Lab](./Sandbox-ZH.md). | model | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality | |-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | | T2V-Turbo | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 | | Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** | | Data-Juicer (DJ, 228k) | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 | | model | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency | |-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | T2V-Turbo | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 | | Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 | | Data-Juicer (DJ, 228k) | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** |