# 数据菜谱Gallery

- 菜谱[文件夹](../configs)下包含丰富的Data-Juicer数据菜谱的示例文件，帮助用户轻松理解、复用、扩展各种功能场景下的配置。
- 📣📣📣 社区贡献者可提PR添加自定义的数据菜谱，促进传播、复用和相关技术演进。我们非常欢迎共建，并会高亮[致谢](https://github.com/modelscope/data-juicer?tab=readme-ov-file#acknowledgement)！

目录
- [1. Data-Juicer最小示例菜谱](#1-data-juicer最小示例菜谱)
- [2. 复现开源文本数据集](#2-复现开源文本数据集)
- [3. 改良开源文本预训练数据集](#3-改良开源文本预训练数据集)
- [4. 改良开源文本后处理数据集](#4-改良开源文本后处理数据集)
- [5. 合成对比学习图文数据集](#5-合成对比学习图文数据集)
- [6. 改良开源图文数据集](#6-改良开源图文数据集)
  - [6.1. 评测验证](#61-评测验证)
- [7. 面向视频数据的基础实例菜谱](#7-面向视频数据的基础实例菜谱)
- [8. 合成以人为中心的视频评测集](#8-合成以人为中心的视频评测集)
- [9. 改良现有开源视频数据集](#9-改良现有开源视频数据集)
  - [9.1. 评测验证](#91-评测验证)

## 1. Data-Juicer最小示例菜谱
[Demo](../configs/demo/)文件夹下放置了一些基础配置文件，用于帮助用户快速熟悉 Data-Juicer 的基本功能，请参阅以获取详细说明。

## 2. 复现开源文本数据集
- 我们复现了部分 Redpajama 数据集的处理流程，请参阅 [reproduced_redpajama](../configs/reproduced_redpajama) 文件夹以获取详细说明。
- 我们重现了部分 BLOOM 数据集的处理流程，请参阅 [reproduced_bloom](../configs/reproduced_bloom) 文件夹以获取详细说明。

## 3. 改良开源文本预训练数据集

我们发现在现有的已经处理过的数据集（如 Redpajama、The Pile 等）中仍然存在一些“脏”数据样本。所以我们使用我们的 Data-Juicer 来完善这些数据集，并尝试将它们提供给 LLM 以获得更好的性能。

我们使用简单的 3-σ 规则来设置每个数据处理菜谱中的算子的超参数。

| 数据子集                 |          完善前的样本数目           |    完善后的样本数目    |   样本保留率   | 配置链接                                                                                                                                                                                                                                | 数据链接                                                                                                                                                                                                                                                                                       | 来源                       |
|----------------------|:---------------------------:|:--------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------|
| arXiv                |          1,724,497          |   1,655,259    |   95.99%   | [redpajama-arxiv-refine.yaml](../configs/data_juicer_recipes/redpajama-arxiv-refine.yaml)                                                                                                                                                                         | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-arxiv-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-arxiv-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-arxiv-refined-by-data-juicer)                                        | Redpajama               |
| Books                |           205,182           |    195,983     |   95.51%   | [redpajama-book-refine.yaml](../configs/data_juicer_recipes/redpajama-book-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-book-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-book-refined-by-data-juicer/summary)   <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-book-refined-by-data-juicer)                                        | Redpajama               |
| Wikipedia            |         29,834,171          |   26,990,659   |   90.47%   | [redpajama-wiki-refine.yaml](../configs/data_juicer_recipes/redpajama-wiki-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-wiki-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-wiki-refined-by-data-juicer/summary)   <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-wiki-refined-by-data-juicer)                                        | Redpajama               |
| C4                   |         364,868,892         |  344,491,171   |   94.42%   | [redpajama-c4-refine.yaml](../configs/data_juicer_recipes/redpajama-c4-refine.yaml)                                                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-c4-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-c4-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer)                                             | Redpajama               |
| Common Crawl 2019-30 |         81,085,420          |   36,557,283   |   45.08%   | [redpajama-cc-2019-30-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2019-30-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2019-30-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2019-30-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2019-30-refined-by-data-juicer)  | Redpajama               |
| Common Crawl 2020-05 |         90,850,492          |   42,612,596   |   46.90%   | [redpajama-cc-2020-05-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2020-05-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2020-05-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2020-05-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2020-05-refined-by-data-juicer)  | Redpajama               |
| Common Crawl 2021-04 |         98,878,523          |   44,724,752   |   45.23%   | [redpajama-cc-2021-04-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2021-04-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2021-04-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2021-04-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2021-04-refined-by-data-juicer)  | Redpajama               |
| Common Crawl 2022-05 |         94,058,868          |   42,648,496   |   45.34%   | [redpajama-cc-2022-05-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2022-05-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2022-05-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2022-05-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2022-05-refined-by-data-juicer)  | Redpajama               |
| Common Crawl 2023-06 |         111,402,716         |   50,643,699   |   45.46%   | [redpajama-cc-2023-06-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2023-06-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2023-06-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2023-06-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2023-06-refined-by-data-juicer) | Redpajama               |
| Github Code          | 73,208,524 <br>+ 21,387,703 |   49,279,344   |   52.09%   | [redpajama-code-refine.yaml](../configs/data_juicer_recipes/github_code/redpajama-code-refine.yaml)<br>[stack-code-refine.yaml](../configs/data_juicer_recipes/github_code/stack-code-refine.yaml)<br>[redpajama-stack-code-deduplicate.yaml](github_code/redpajama-stack-code-deduplicate.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-stack-code-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-stack-code-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-stack-code-refined-by-data-juicer)                             | Redpajama<br>The Stack  |
| StackExchange        |         45,447,328          |   26,309,203   |   57.89%   | [redpajama-pile-stackexchange-refine.yaml](../configs/data_juicer_recipes/redpajama-pile-stackexchange-refine.yaml)                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-pile-stackexchange-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-pile-stackexchange-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-pile-stackexchange-refined-by-data-juicer)             | Redpajama<br>The Pile   |
| EuroParl             |           69,814            |     61,601     |   88.23%   | [pile-europarl-refine.yaml](../configs/data_juicer_recipes/pile-europarl-refine.yaml)                                                                                                                                                                             | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-europarl-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-europarl-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-europarl-refined-by-data-juicer)                                   | The Pile                |
| FreeLaw              |          3,562,015          |   2,942,612    |   82.61%   | [pile-freelaw-refine.yaml](../configs/data_juicer_recipes/pile-freelaw-refine.yaml)                                                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-freelaw-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-freelaw-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-freelaw-refined-by-data-juicer)                                     | The Pile                |
| HackerNews           |           373,027           |    371,331     |   99.55%   | [pile-hackernews-refine.yaml](../configs/data_juicer_recipes/pile-hackernews-refine.yaml)                                                                                                                                                                         | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-hackernews-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-hackernews-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-hackernews-refined-by-data-juicer)                               | The Pile                |
| NIH ExPorter         |           939,661           |    858,492     |   91.36%   | [pile-nih-refine.yaml](../configs/data_juicer_recipes/pile-nih-refine.yaml)                                                                                                                                                                                       | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-hin-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-nih-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-nih-refined-by-data-juicer)                                             | The Pile                |
| PhilPapers           |           32,782            |     29,117     |   88.82%   | [pile-philpaper-refine.yaml](../configs/data_juicer_recipes/pile-philpaper-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-philpaper-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-philpaper-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-philpaper-refined-by-data-juicer)                                 | The Pile                |
| PubMed Abstracts     |         15,518,009          |   15,009,325   |   96.72%   | [pile-pubmed-abstract-refine.yaml](../configs/data_juicer_recipes/pile-pubmed-abstract-refine.yaml)                                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-pubmed-abstract-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-pubmed-abstracts-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-pubmed-abstracts-refined-by-data-juicer)                    | The Pile                |
| PubMed Central       |          3,098,930          |   2,694,860    |   86.96%   | [pile-pubmed-central-refine.yaml](../configs/data_juicer_recipes/pile-pubmed-central-refine.yaml)                                                                                                                                                                 | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-pubmed-central-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-pubmed-central-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-pubmed-central-refined-by-data-juicer)                        | The Pile                |
| USPTO                |          5,883,024          |   4,516,283    |   76.77%   | [pile-uspto-refine.yaml](../configs/data_juicer_recipes/pile-uspto-refine.yaml)                                                                                                                                                                                   | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-uspto-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-uspto-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-uspto-refined-by-data-juicer) | The Pile                |


## 4. 改良开源文本后处理数据集
以Alpaca-CoT数据集为例：

| 数据子集              |         完善前的样本数目         |              完善后的样本数目              |   样本保留率   | 配置链接                                                                                                                                                                                                                                | 数据链接                                                                                                                                                                                                                                     | 来源                                            |
|-------------------|:------------------------:|:----------------------------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|
| Alpaca-Cot EN     |       136,219,879        | 72,855,345 |   54.48%   | [alpaca-cot-en-refine.yaml](../configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml)                                                                                                                                                                   | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-en-refined-by-data-juicer)   | [来自Alpaca-CoT的39个子集](../configs/data_juicer_recipes/alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |
| Alpaca-Cot ZH     |        21,197,246        |             9,873,214              |  46.58%   | [alpaca-cot-zh-refine.yaml](../configs/data_juicer_recipes/alpaca_cot/alpaca-cot-zh-refine.yaml)                                                                                                                                                                   | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-zh-refined-by-data-juicer)   | [来自Alpaca-CoT的28个子集](../configs/data_juicer_recipes/alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |

## 5. 合成对比学习图文数据集
Data-Juicer内置了丰富的算子来支持图片多模态数据合成，譬如Img-Diff数据集。该合成数据在MMVP基准上带来了12个性能点的模型提升。更多细节参见Img-Diff[论文](https://arxiv.org/abs/2408.04594)，对应菜谱实现可参考[ImgDiff-Dev](https://github.com/modelscope/data-juicer/tree/ImgDiff).


## 6. 改良开源图文数据集

| 数据子集                    |      完善前的样本数目       | 完善后的样本数目 | 样本保留率 | 配置链接                          | 数据链接                                                                                                                                                                                                                                                                                 | 来源            |
|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| LLaVA pretrain (LCS-558k) |          558,128          |   500,380    |   89.65%   | [llava-pretrain-refine.yaml](../configs/data_juicer_recipes/llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer)                                        | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
| Data-Juicer (T2V, 147k) |          1,217,346          |   147,176    |   12.09%   | [data-juicer-sandbox-optimal.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
| Data-Juicer (DJ, 228k) |          3,408,553          |   227,867    |   8.15%   | [data-juicer-sandbox-self-evolution.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M) <br> [Pexels (198k)](https://github.com/cj-mills/pexels-dataset) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |

### 6.1. 评测验证
- LLaVA pretrain (LCS-558k): 使用**完善后的预训练数据集**预训练并使用原始的指令数据集微调后的模型在12个评测集上有10个超过了基线模型LLaVA-1.5-13B。

| 模型                              | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN | SEED | LLaVA-Bench-Wild | MM-Vet |
|---------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| LLaVA-1.5-13B <br> (基线)         | **80.0**  | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
| LLaVA-1.5-13B <br> (完善后的预训练数据集) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |

- Data-Juicer (T2V, 147k) 和 Data-Juicer (DJ, 228k): 使用**完善后的数据集**在 [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) 全面超过基线模型 [T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo)。这里 T2V-Turbo 是 Data-Juicer (T2V, 147k) 的teacher模型，Data-Juicer (T2V, 147k) 是 Data-Juicer (DJ, 228k) 的teacher模型，详情请参考[沙盒实验室](./Sandbox-ZH.md)。

| model                         | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- |
| T2V-Turbo               | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
| Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** |
| Data-Juicer (DJ, 228k)  | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 |

| model                         | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
|-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| T2V-Turbo               | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
| Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 |
| Data-Juicer (DJ, 228k)  | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** |

## 7. 面向视频数据的基础实例菜谱
我们为用户提供了一个视频数据集处理菜谱样例以协助更好地使用视频相关的算子： [general-video-refine-example.yaml](../configs/data_juicer_recipes/general-video-refine-example.yaml) 。这里我们应用了三种类型的算子：
- 仅文本：根据视频描述提高数据集质量
- 仅视频：根据视频性质提高数据集质量
- 文本-视频：根据文本和视频间的对齐提高数据集质量
用户可以基于这个菜谱开始他们的视频数据集处理流程。

## 8. 合成以人为中心的视频评测集
Data-Juicer还可以支持视频评测集合成，如[HumanVBench](https://arxiv.org/abs/2412.17574)，其将in-the-wild视频转化为以人为中心的视频评测集），对应的数据菜谱和构造流程可参考[HumanVBench-dev](https://github.com/modelscope/data-juicer/tree/HumanVBench)。

## 9. 改良现有开源视频数据集

| 数据子集                    |      完善前的样本数目       | 完善后的样本数目 | 样本保留率 | 配置链接                          | 数据链接                                                                                                                                                                                                                                                                                 | 来源            |
|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| Data-Juicer (T2V, 147k) |          1,217,346          |   147,176    |   12.09%   | [data-juicer-sandbox-optimal.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
| Data-Juicer (DJ, 228k) |          3,408,553          |   227,867    |   8.15%   | [data-juicer-sandbox-self-evolution.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M) <br> [Pexels (198k)](https://github.com/cj-mills/pexels-dataset) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |

### 9.1. 评测验证
- Data-Juicer (T2V, 147k) 和 Data-Juicer (DJ, 228k): 使用**完善后的数据集**在 [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) 全面超过基线模型 [T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo)。这里 T2V-Turbo 是 Data-Juicer (T2V, 147k) 的teacher模型，Data-Juicer (T2V, 147k) 是 Data-Juicer (DJ, 228k) 的teacher模型，详情请参考[沙盒实验室](./Sandbox-ZH.md)。

| model                         | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- |
| T2V-Turbo               | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
| Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** |
| Data-Juicer (DJ, 228k)  | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 |

| model                         | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
|-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| T2V-Turbo               | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
| Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 |
| Data-Juicer (DJ, 228k)  | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** |