# Data Recipe Gallery

- The recipe [folder](../configs) contains fruitful sample configuration files of Data-Juicer data recipes, which helps users easily understand, reuse and expand the configurations in various functional scenarios.
- 📣📣📣 Community contributors can submit PRs to add customized data recipes to promote dissemination, reuse and related technology evolution. We welcome co-construction and will highlight [acknowledgements](https://github.com/modelscope/data-juicer?tab=readme-ov-file#acknowledgement)!

Table of Contents
- [1. Data-Juicer Minimal Example Recipe](#1-data-juicer-minimal-example-recipe)
- [2. Reproduce Open Source Text Datasets](#2-reproduce-open-source-text-datasets)
- [3. Improved Open Source Pre-training Text Datasets](#3-improved-open-source-pre-training-text-datasets)
- [4. Improved Open Source Post-tuning Text Dataset](#4-improved-open-source-post-tuning-text-dataset)
- [5. Synthetic Contrastive Learning Image-text datasets](#5-synthetic-contrastive-learning-image-text-datasets)
- [6. Improved Open Source Image-text datasets](#6-improved-open-source-image-text-datasets)
  - [6.1. Evaluation and Verification](#61-evaluation-and-verification)
- [7. Basic Example Recipes for Video Data](#7-basic-example-recipes-for-video-data)
- [8. Synthesize Human-centric Video Benchmarks](#8-synthesize-human-centric-video-benchmarks)
- [9. Improve Existing Open Source Video Datasets](#9-improve-existing-open-source-video-datasets)
  - [9.1. Evaluation and Verification](#91-evaluation-and-verification)


## 1. Data-Juicer Minimal Example Recipe
Some basic configuration files are placed in the [Demo](../configs/demo/) folder to help users quickly familiarize themselves with the basic functions of Data-Juicer. Please refer to the folder for detailed description.

## 2. Reproduce Open Source Text Datasets
- We reproduced the processing flow of part of the Redpajama dataset. Please refer to the [reproduced_redpajama](../configs/reproduced_redpajama) folder for detailed description.
- We reproduced the processing flow of part of the BLOOM dataset. Please refer to the [reproduced_bloom](../configs/reproduced_bloom) folder for detailed description.

## 3. Improved Open Source Pre-training Text Datasets

We found that there are still some "bad" data samples in the existing processed datasets (such as Redpajama, The Pile, etc.). So we use our Data-Juicer to refine these datasets and try to feed them to LLM to get better performance.

We use a simple 3-σ rule to set the hyperparameters of the operators in each data processing recipe.

| Data subset | Number of samples before refinement | Number of samples after refinement | Sample retention rate | Config link | Data link | Source |
|----------------------|:---------------------------:|:--------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------|
| arXiv                |          1,724,497          |   1,655,259    |   95.99%   | [redpajama-arxiv-refine.yaml](../configs/data_juicer_recipes/redpajama-arxiv-refine.yaml)                                                                                                                                                                         | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-arxiv-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-arxiv-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-arxiv-refined-by-data-juicer)                                        | Redpajama               |
| Books                |           205,182           |    195,983     |   95.51%   | [redpajama-book-refine.yaml](../configs/data_juicer_recipes/redpajama-book-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-book-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-book-refined-by-data-juicer/summary)   <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-book-refined-by-data-juicer)                                        | Redpajama               |
| Wikipedia            |         29,834,171          |   26,990,659   |   90.47%   | [redpajama-wiki-refine.yaml](../configs/data_juicer_recipes/redpajama-wiki-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-wiki-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-wiki-refined-by-data-juicer/summary)   <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-wiki-refined-by-data-juicer)                                        | Redpajama               |
| C4                   |         364,868,892         |  344,491,171   |   94.42%   | [redpajama-c4-refine.yaml](../configs/data_juicer_recipes/redpajama-c4-refine.yaml)                                                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-c4-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-c4-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer)                                             | Redpajama               |
| Common Crawl 2019-30 |         81,085,420          |   36,557,283   |   45.08%   | [redpajama-cc-2019-30-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2019-30-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2019-30-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2019-30-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2019-30-refined-by-data-juicer)  | Redpajama               |
| Common Crawl 2020-05 |         90,850,492          |   42,612,596   |   46.90%   | [redpajama-cc-2020-05-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2020-05-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2020-05-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2020-05-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2020-05-refined-by-data-juicer)  | Redpajama               |
| Common Crawl 2021-04 |         98,878,523          |   44,724,752   |   45.23%   | [redpajama-cc-2021-04-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2021-04-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2021-04-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2021-04-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2021-04-refined-by-data-juicer)  | Redpajama               |
| Common Crawl 2022-05 |         94,058,868          |   42,648,496   |   45.34%   | [redpajama-cc-2022-05-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2022-05-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2022-05-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2022-05-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2022-05-refined-by-data-juicer)  | Redpajama               |
| Common Crawl 2023-06 |         111,402,716         |   50,643,699   |   45.46%   | [redpajama-cc-2023-06-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2023-06-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2023-06-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2023-06-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2023-06-refined-by-data-juicer) | Redpajama               |
| Github Code          | 73,208,524 <br>+ 21,387,703 |   49,279,344   |   52.09%   | [redpajama-code-refine.yaml](../configs/data_juicer_recipes/github_code/redpajama-code-refine.yaml)<br>[stack-code-refine.yaml](github_code/stack-code-refine.yaml)<br>[redpajama-stack-code-deduplicate.yaml](github_code/redpajama-stack-code-deduplicate.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-stack-code-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-stack-code-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-stack-code-refined-by-data-juicer)                             | Redpajama<br>The Stack  |
| StackExchange        |         45,447,328          |   26,309,203   |   57.89%   | [redpajama-pile-stackexchange-refine.yaml](../configs/data_juicer_recipes/redpajama-pile-stackexchange-refine.yaml)                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-pile-stackexchange-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-pile-stackexchange-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-pile-stackexchange-refined-by-data-juicer)             | Redpajama<br>The Pile   |
| EuroParl             |           69,814            |     61,601     |   88.23%   | [pile-europarl-refine.yaml](../configs/data_juicer_recipes/pile-europarl-refine.yaml)                                                                                                                                                                             | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-europarl-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-europarl-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-europarl-refined-by-data-juicer)                                   | The Pile                |
| FreeLaw              |          3,562,015          |   2,942,612    |   82.61%   | [pile-freelaw-refine.yaml](../configs/data_juicer_recipes/pile-freelaw-refine.yaml)                                                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-freelaw-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-freelaw-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-freelaw-refined-by-data-juicer)                                     | The Pile                |
| HackerNews           |           373,027           |    371,331     |   99.55%   | [pile-hackernews-refine.yaml](../configs/data_juicer_recipes/pile-hackernews-refine.yaml)                                                                                                                                                                         | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-hackernews-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-hackernews-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-hackernews-refined-by-data-juicer)                               | The Pile                |
| NIH ExPorter         |           939,661           |    858,492     |   91.36%   | [pile-nih-refine.yaml](../configs/data_juicer_recipes/pile-nih-refine.yaml)                                                                                                                                                                                       | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-hin-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-nih-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-nih-refined-by-data-juicer)                                             | The Pile                |
| PhilPapers           |           32,782            |     29,117     |   88.82%   | [pile-philpaper-refine.yaml](../configs/data_juicer_recipes/pile-philpaper-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-philpaper-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-philpaper-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-philpaper-refined-by-data-juicer)                                 | The Pile                |
| PubMed Abstracts     |         15,518,009          |   15,009,325   |   96.72%   | [pile-pubmed-abstract-refine.yaml](../configs/data_juicer_recipes/pile-pubmed-abstract-refine.yaml)                                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-pubmed-abstract-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-pubmed-abstracts-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-pubmed-abstracts-refined-by-data-juicer)                    | The Pile                |
| PubMed Central       |          3,098,930          |   2,694,860    |   86.96%   | [pile-pubmed-central-refine.yaml](../configs/data_juicer_recipes/pile-pubmed-central-refine.yaml)                                                                                                                                                                 | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-pubmed-central-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-pubmed-central-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-pubmed-central-refined-by-data-juicer)                        | The Pile                |
| USPTO                |          5,883,024          |   4,516,283    |   76.77%   | [pile-uspto-refine.yaml](../configs/data_juicer_recipes/pile-uspto-refine.yaml)                                                                                                                                                                                   | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-uspto-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-uspto-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-uspto-refined-by-data-juicer) | The Pile                |


## 4. Improved Open Source Post-tuning Text Dataset
Take the Alpaca-CoT dataset as an example:

| Data subset | Number of samples before improvement | Number of samples after improvement | Sample retention rate | Configuration link | Data link | Source |
|-------------------|:------------------------:|:----------------------------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|
| Alpaca-Cot EN     |       136,219,879        | 72,855,345 |   54.48%   | [alpaca-cot-en-refine.yaml](../configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml)                                                                                                                                                                   | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-en-refined-by-data-juicer)   | [39 subsets from Alpaca-CoT](../configs/data_juicer_recipes/alpaca_cot/README.md) |
| Alpaca-Cot ZH     |        21,197,246        |             9,873,214              |  46.58%   | [alpaca-cot-zh-refine.yaml](../configs/data_juicer_recipes/alpaca_cot/alpaca-cot-zh-refine.yaml)                                                                                                                                                                   | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-zh-refined-by-data-juicer)   | [28 subsets from Alpaca-CoT](../configs/data_juicer_recipes/alpaca_cot/README.md) |

## 5. Synthetic Contrastive Learning Image-text datasets
Data-Juicer has built-in rich operators to support image multimodal data synthesis, such as the Img-Diff dataset. This synthetic data brings a 12-point performance improvement on the MMVP benchmark. For more details, see the Img-Diff [paper](https://arxiv.org/abs/2408.04594), and the corresponding recipe implementation can refer to [ImgDiff-Dev](https://github.com/modelscope/data-juicer/tree/ImgDiff).

## 6. Improved Open Source Image-text datasets

| Data subset | Number of samples before improvement | Number of samples after improvement | Sample retention rate | Configuration link | Data link | Source |
|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| LLaVA pretrain (LCS-558k) |          558,128          |   500,380    |   89.65%   | [llava-pretrain-refine.yaml](../configs/data_juicer_recipes/llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer)                                        | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
| Data-Juicer (T2V, 147k) |          1,217,346          |   147,176    |   12.09%   | [data-juicer-sandbox-optimal.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
| Data-Juicer (DJ, 228k) |          3,408,553          |   227,867    |   8.15%   | [data-juicer-sandbox-self-evolution.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M) <br> [Pexels (198k)](https://github.com/cj-mills/pexels-dataset) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |

### 6.1. Evaluation and Verification
- LLaVA pretrain (LCS-558k): The model pre-trained with **the improved pre-training dataset** and fine-tuned with the original instruction dataset outperformed the baseline model LLaVA-1.5-13B on 10 of the 12 evaluation sets.

| Models | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN | SEED | LLaVA-Bench-Wild | MM-Vet |
|---------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| LLaVA-1.5-13B <br> (Baseline) | **80.0** | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
| LLaVA-1.5-13B <br> (Rectified Pretraining Dataset) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |

- Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): Outperform the baseline model [T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo) on [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) with **refined dataset**. Here T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k), and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). For details, please refer to [Sandbox Laboratory](./Sandbox.md).
  

| model                         | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- |
| T2V-Turbo               | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
| Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** |
| Data-Juicer (DJ, 228k)  | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 |

| model                         | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
|-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| T2V-Turbo               | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
| Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 |
| Data-Juicer (DJ, 228k)  | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** |

## 7. Basic Example Recipes for Video Data
We provide users with a video dataset processing recipe sample to help better use video-related operators: [general-video-refine-example.yaml](../configs/data_juicer_recipes/general-video-refine-example.yaml) . Here we apply three types of operators:
- Text-only: Improve the dataset quality based on video description
- Video-only: Improve the dataset quality based on video properties
- Text-Video: Improve the dataset quality based on the alignment between text and video
Users can start their video dataset processing workflow based on this recipe.

## 8. Synthesize Human-centric Video Benchmarks 
Data-Juicer can also support video benchmark synthesis, such as [HumanVBench](https://arxiv.org/abs/2412.17574), which converts in-the-wild videos into human-centric video benchmarks. The corresponding data recipes and construction process can be found in [HumanVBench-dev](https://github.com/modelscope/data-juicer/tree/HumanVBench).

## 9. Improve Existing Open Source Video Datasets

| Data subset | Number of samples before improvement | Number of samples after improvement | Sample retention rate | Configuration link | Data link | Source |
|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| Data-Juicer (T2V, 147k) |          1,217,346          |   147,176    |   12.09%   | [data-juicer-sandbox-optimal.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
| Data-Juicer (DJ, 228k) |          3,408,553          |   227,867    |   8.15%   | [data-juicer-sandbox-self-evolution.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M) <br> [Pexels (198k)](https://github.com/cj-mills/pexels-dataset) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |

### 9.1. Evaluation and Verification
- Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): Using the **refined dataset**, they fully surpass the baseline model [T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo) in [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard). Here, T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k), and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). For details, please refer to [Sandbox Lab](./Sandbox-ZH.md).

| model                         | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- |
| T2V-Turbo               | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
| Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** |
| Data-Juicer (DJ, 228k)  | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 |

| model                         | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
|-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| T2V-Turbo               | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
| Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 |
| Data-Juicer (DJ, 228k)  | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** |