# 数据菜谱Gallery - 菜谱[文件夹](../configs)下包含丰富的Data-Juicer数据菜谱的示例文件,帮助用户轻松理解、复用、扩展各种功能场景下的配置。 - 📣📣📣 社区贡献者可提PR添加自定义的数据菜谱,促进传播、复用和相关技术演进。我们非常欢迎共建,并会高亮[致谢](https://github.com/modelscope/data-juicer?tab=readme-ov-file#acknowledgement)! 目录 - [1. Data-Juicer最小示例菜谱](#1-data-juicer最小示例菜谱) - [2. 复现开源文本数据集](#2-复现开源文本数据集) - [3. 改良开源文本预训练数据集](#3-改良开源文本预训练数据集) - [4. 改良开源文本后处理数据集](#4-改良开源文本后处理数据集) - [5. 合成对比学习图文数据集](#5-合成对比学习图文数据集) - [6. 改良开源图文数据集](#6-改良开源图文数据集) - [6.1. 评测验证](#61-评测验证) - [7. 面向视频数据的基础实例菜谱](#7-面向视频数据的基础实例菜谱) - [8. 合成以人为中心的视频评测集](#8-合成以人为中心的视频评测集) - [9. 改良现有开源视频数据集](#9-改良现有开源视频数据集) - [9.1. 评测验证](#91-评测验证) ## 1. Data-Juicer最小示例菜谱 [Demo](../configs/demo/)文件夹下放置了一些基础配置文件,用于帮助用户快速熟悉 Data-Juicer 的基本功能,请参阅以获取详细说明。 ## 2. 复现开源文本数据集 - 我们复现了部分 Redpajama 数据集的处理流程,请参阅 [reproduced_redpajama](../configs/reproduced_redpajama) 文件夹以获取详细说明。 - 我们重现了部分 BLOOM 数据集的处理流程,请参阅 [reproduced_bloom](../configs/reproduced_bloom) 文件夹以获取详细说明。 ## 3. 改良开源文本预训练数据集 我们发现在现有的已经处理过的数据集(如 Redpajama、The Pile 等)中仍然存在一些“脏”数据样本。所以我们使用我们的 Data-Juicer 来完善这些数据集,并尝试将它们提供给 LLM 以获得更好的性能。 我们使用简单的 3-σ 规则来设置每个数据处理菜谱中的算子的超参数。 | 数据子集 | 完善前的样本数目 | 完善后的样本数目 | 样本保留率 | 配置链接 | 数据链接 | 来源 | |----------------------|:---------------------------:|:--------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------| | arXiv | 1,724,497 | 1,655,259 | 95.99% | [redpajama-arxiv-refine.yaml](../configs/data_juicer_recipes/redpajama-arxiv-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-arxiv-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-arxiv-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-arxiv-refined-by-data-juicer) | Redpajama | | Books | 205,182 | 195,983 | 95.51% | [redpajama-book-refine.yaml](../configs/data_juicer_recipes/redpajama-book-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-book-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-book-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-book-refined-by-data-juicer) | Redpajama | | Wikipedia | 29,834,171 | 26,990,659 | 90.47% | [redpajama-wiki-refine.yaml](../configs/data_juicer_recipes/redpajama-wiki-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-wiki-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-wiki-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-wiki-refined-by-data-juicer) | Redpajama | | C4 | 364,868,892 | 344,491,171 | 94.42% | [redpajama-c4-refine.yaml](../configs/data_juicer_recipes/redpajama-c4-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-c4-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-c4-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer) | Redpajama | | Common Crawl 2019-30 | 81,085,420 | 36,557,283 | 45.08% | [redpajama-cc-2019-30-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2019-30-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2019-30-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2019-30-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2019-30-refined-by-data-juicer) | Redpajama | | Common Crawl 2020-05 | 90,850,492 | 42,612,596 | 46.90% | [redpajama-cc-2020-05-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2020-05-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2020-05-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2020-05-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2020-05-refined-by-data-juicer) | Redpajama | | Common Crawl 2021-04 | 98,878,523 | 44,724,752 | 45.23% | [redpajama-cc-2021-04-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2021-04-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2021-04-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2021-04-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2021-04-refined-by-data-juicer) | Redpajama | | Common Crawl 2022-05 | 94,058,868 | 42,648,496 | 45.34% | [redpajama-cc-2022-05-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2022-05-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2022-05-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2022-05-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2022-05-refined-by-data-juicer) | Redpajama | | Common Crawl 2023-06 | 111,402,716 | 50,643,699 | 45.46% | [redpajama-cc-2023-06-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2023-06-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2023-06-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2023-06-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2023-06-refined-by-data-juicer) | Redpajama | | Github Code | 73,208,524
+ 21,387,703 | 49,279,344 | 52.09% | [redpajama-code-refine.yaml](../configs/data_juicer_recipes/github_code/redpajama-code-refine.yaml)
[stack-code-refine.yaml](../configs/data_juicer_recipes/github_code/stack-code-refine.yaml)
[redpajama-stack-code-deduplicate.yaml](github_code/redpajama-stack-code-deduplicate.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-stack-code-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-stack-code-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-stack-code-refined-by-data-juicer) | Redpajama
The Stack | | StackExchange | 45,447,328 | 26,309,203 | 57.89% | [redpajama-pile-stackexchange-refine.yaml](../configs/data_juicer_recipes/redpajama-pile-stackexchange-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-pile-stackexchange-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-pile-stackexchange-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-pile-stackexchange-refined-by-data-juicer) | Redpajama
The Pile | | EuroParl | 69,814 | 61,601 | 88.23% | [pile-europarl-refine.yaml](../configs/data_juicer_recipes/pile-europarl-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-europarl-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-europarl-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-europarl-refined-by-data-juicer) | The Pile | | FreeLaw | 3,562,015 | 2,942,612 | 82.61% | [pile-freelaw-refine.yaml](../configs/data_juicer_recipes/pile-freelaw-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-freelaw-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-freelaw-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-freelaw-refined-by-data-juicer) | The Pile | | HackerNews | 373,027 | 371,331 | 99.55% | [pile-hackernews-refine.yaml](../configs/data_juicer_recipes/pile-hackernews-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-hackernews-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-hackernews-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-hackernews-refined-by-data-juicer) | The Pile | | NIH ExPorter | 939,661 | 858,492 | 91.36% | [pile-nih-refine.yaml](../configs/data_juicer_recipes/pile-nih-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-hin-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-nih-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-nih-refined-by-data-juicer) | The Pile | | PhilPapers | 32,782 | 29,117 | 88.82% | [pile-philpaper-refine.yaml](../configs/data_juicer_recipes/pile-philpaper-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-philpaper-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-philpaper-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-philpaper-refined-by-data-juicer) | The Pile | | PubMed Abstracts | 15,518,009 | 15,009,325 | 96.72% | [pile-pubmed-abstract-refine.yaml](../configs/data_juicer_recipes/pile-pubmed-abstract-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-pubmed-abstract-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-pubmed-abstracts-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-pubmed-abstracts-refined-by-data-juicer) | The Pile | | PubMed Central | 3,098,930 | 2,694,860 | 86.96% | [pile-pubmed-central-refine.yaml](../configs/data_juicer_recipes/pile-pubmed-central-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-pubmed-central-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-pubmed-central-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-pubmed-central-refined-by-data-juicer) | The Pile | | USPTO | 5,883,024 | 4,516,283 | 76.77% | [pile-uspto-refine.yaml](../configs/data_juicer_recipes/pile-uspto-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-uspto-refine-result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-uspto-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-uspto-refined-by-data-juicer) | The Pile | ## 4. 改良开源文本后处理数据集 以Alpaca-CoT数据集为例: | 数据子集 | 完善前的样本数目 | 完善后的样本数目 | 样本保留率 | 配置链接 | 数据链接 | 来源 | |-------------------|:------------------------:|:----------------------------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------| | Alpaca-Cot EN | 136,219,879 | 72,855,345 | 54.48% | [alpaca-cot-en-refine.yaml](../configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-en-refined-by-data-juicer) | [来自Alpaca-CoT的39个子集](../configs/data_juicer_recipes/alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) | | Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | [alpaca-cot-zh-refine.yaml](../configs/data_juicer_recipes/alpaca_cot/alpaca-cot-zh-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-zh-refined-by-data-juicer) | [来自Alpaca-CoT的28个子集](../configs/data_juicer_recipes/alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) | ## 5. 合成对比学习图文数据集 Data-Juicer内置了丰富的算子来支持图片多模态数据合成,譬如Img-Diff数据集。该合成数据在MMVP基准上带来了12个性能点的模型提升。更多细节参见Img-Diff[论文](https://arxiv.org/abs/2408.04594),对应菜谱实现可参考[ImgDiff-Dev](https://github.com/modelscope/data-juicer/tree/ImgDiff). ## 6. 改良开源图文数据集 | 数据子集 | 完善前的样本数目 | 完善后的样本数目 | 样本保留率 | 配置链接 | 数据链接 | 来源 | |---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| | LLaVA pretrain (LCS-558k) | 558,128 | 500,380 | 89.65% | [llava-pretrain-refine.yaml](../configs/data_juicer_recipes/llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary)
[HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer) | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) | | Data-Juicer (T2V, 147k) | 1,217,346 | 147,176 | 12.09% | [data-juicer-sandbox-optimal.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)
[HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid)
[Panda-70M (605k)](https://github.com/snap-research/Panda-70M)
[MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) | | Data-Juicer (DJ, 228k) | 3,408,553 | 227,867 | 8.15% | [data-juicer-sandbox-self-evolution.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid)
[Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M)
[Pexels (198k)](https://github.com/cj-mills/pexels-dataset)
[MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) | ### 6.1. 评测验证 - LLaVA pretrain (LCS-558k): 使用**完善后的预训练数据集**预训练并使用原始的指令数据集微调后的模型在12个评测集上有10个超过了基线模型LLaVA-1.5-13B。 | 模型 | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN | SEED | LLaVA-Bench-Wild | MM-Vet | |---------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | LLaVA-1.5-13B
(基线) | **80.0** | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 | | LLaVA-1.5-13B
(完善后的预训练数据集) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** | - Data-Juicer (T2V, 147k) 和 Data-Juicer (DJ, 228k): 使用**完善后的数据集**在 [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) 全面超过基线模型 [T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo)。这里 T2V-Turbo 是 Data-Juicer (T2V, 147k) 的teacher模型,Data-Juicer (T2V, 147k) 是 Data-Juicer (DJ, 228k) 的teacher模型,详情请参考[沙盒实验室](./Sandbox-ZH.md)。 | model | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality | |-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | | T2V-Turbo | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 | | Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** | | Data-Juicer (DJ, 228k) | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 | | model | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency | |-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | T2V-Turbo | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 | | Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 | | Data-Juicer (DJ, 228k) | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** | ## 7. 面向视频数据的基础实例菜谱 我们为用户提供了一个视频数据集处理菜谱样例以协助更好地使用视频相关的算子: [general-video-refine-example.yaml](../configs/data_juicer_recipes/general-video-refine-example.yaml) 。这里我们应用了三种类型的算子: - 仅文本:根据视频描述提高数据集质量 - 仅视频:根据视频性质提高数据集质量 - 文本-视频:根据文本和视频间的对齐提高数据集质量 用户可以基于这个菜谱开始他们的视频数据集处理流程。 ## 8. 合成以人为中心的视频评测集 Data-Juicer还可以支持视频评测集合成,如[HumanVBench](https://arxiv.org/abs/2412.17574),其将in-the-wild视频转化为以人为中心的视频评测集),对应的数据菜谱和构造流程可参考[HumanVBench-dev](https://github.com/modelscope/data-juicer/tree/HumanVBench)。 ## 9. 改良现有开源视频数据集 | 数据子集 | 完善前的样本数目 | 完善后的样本数目 | 样本保留率 | 配置链接 | 数据链接 | 来源 | |---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| | Data-Juicer (T2V, 147k) | 1,217,346 | 147,176 | 12.09% | [data-juicer-sandbox-optimal.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)
[HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid)
[Panda-70M (605k)](https://github.com/snap-research/Panda-70M)
[MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) | | Data-Juicer (DJ, 228k) | 3,408,553 | 227,867 | 8.15% | [data-juicer-sandbox-self-evolution.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip)
[ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid)
[Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M)
[Pexels (198k)](https://github.com/cj-mills/pexels-dataset)
[MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) | ### 9.1. 评测验证 - Data-Juicer (T2V, 147k) 和 Data-Juicer (DJ, 228k): 使用**完善后的数据集**在 [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) 全面超过基线模型 [T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo)。这里 T2V-Turbo 是 Data-Juicer (T2V, 147k) 的teacher模型,Data-Juicer (T2V, 147k) 是 Data-Juicer (DJ, 228k) 的teacher模型,详情请参考[沙盒实验室](./Sandbox-ZH.md)。 | model | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality | |-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | | T2V-Turbo | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 | | Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** | | Data-Juicer (DJ, 228k) | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 | | model | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency | |-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | T2V-Turbo | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 | | Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 | | Data-Juicer (DJ, 228k) | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** |