Data Recipe Gallery¶

The recipe folder contains fruitful sample configuration files of Data-Juicer data recipes, which helps users easily understand, reuse and expand the configurations in various functional scenarios.
📣📣📣 Community contributors can submit PRs to add customized data recipes to promote dissemination, reuse and related technology evolution. We welcome co-construction and will highlight acknowledgements!

Table of Contents

1. Data-Juicer Minimal Example Recipe
2. Reproduce Open Source Text Datasets
3. Improved Open Source Pre-training Text Datasets
4. Improved Open Source Post-tuning Text Dataset
5. Synthetic Contrastive Learning Image-text datasets
6. Improved Open Source Image-text datasets
- 6.1. Evaluation and Verification
7. Basic Example Recipes for Video Data
8. Synthesize Human-centric Video Benchmarks
9. Improve Existing Open Source Video Datasets
- 9.1. Evaluation and Verification

1. Data-Juicer Minimal Example Recipe¶

Some basic configuration files are placed in the Demo folder to help users quickly familiarize themselves with the basic functions of Data-Juicer. Please refer to the folder for detailed description.

2. Reproduce Open Source Text Datasets¶

We reproduced the processing flow of part of the Redpajama dataset. Please refer to the reproduced_redpajama folder for detailed description.
We reproduced the processing flow of part of the BLOOM dataset. Please refer to the reproduced_bloom folder for detailed description.

3. Improved Open Source Pre-training Text Datasets¶

We found that there are still some “bad” data samples in the existing processed datasets (such as Redpajama, The Pile, etc.). So we use our Data-Juicer to refine these datasets and try to feed them to LLM to get better performance.

We use a simple 3-σ rule to set the hyperparameters of the operators in each data processing recipe.

Data subset	Number of samples before refinement	Number of samples after refinement	Sample retention rate	Config link	Data link	Source
arXiv	1,724,497	1,655,259	95.99%	redpajama-arxiv-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Books	205,182	195,983	95.51%	redpajama-book-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Wikipedia	29,834,171	26,990,659	90.47%	redpajama-wiki-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
C4	364,868,892	344,491,171	94.42%	redpajama-c4-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Common Crawl 2019-30	81,085,420	36,557,283	45.08%	redpajama-cc-2019-30-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Common Crawl 2020-05	90,850,492	42,612,596	46.90%	redpajama-cc-2020-05-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Common Crawl 2021-04	98,878,523	44,724,752	45.23%	redpajama-cc-2021-04-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Common Crawl 2022-05	94,058,868	42,648,496	45.34%	redpajama-cc-2022-05-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Common Crawl 2023-06	111,402,716	50,643,699	45.46%	redpajama-cc-2023-06-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Github Code	73,208,524 + 21,387,703	49,279,344	52.09%	redpajama-code-refine.yaml stack-code-refine.yaml redpajama-stack-code-deduplicate.yaml	Aliyun ModelScope HuggingFace	Redpajama The Stack
StackExchange	45,447,328	26,309,203	57.89%	redpajama-pile-stackexchange-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama The Pile
EuroParl	69,814	61,601	88.23%	pile-europarl-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
FreeLaw	3,562,015	2,942,612	82.61%	pile-freelaw-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
HackerNews	373,027	371,331	99.55%	pile-hackernews-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
NIH ExPorter	939,661	858,492	91.36%	pile-nih-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
PhilPapers	32,782	29,117	88.82%	pile-philpaper-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
PubMed Abstracts	15,518,009	15,009,325	96.72%	pile-pubmed-abstract-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
PubMed Central	3,098,930	2,694,860	86.96%	pile-pubmed-central-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
USPTO	5,883,024	4,516,283	76.77%	pile-uspto-refine.yaml	Aliyun ModelScope HuggingFace	The Pile

4. Improved Open Source Post-tuning Text Dataset¶

Take the Alpaca-CoT dataset as an example:

Data subset	Number of samples before improvement	Number of samples after improvement	Sample retention rate	Configuration link	Data link	Source
Alpaca-Cot EN	136,219,879	72,855,345	54.48%	alpaca-cot-en-refine.yaml	Aliyun ModelScope HuggingFace	39 subsets from Alpaca-CoT
Alpaca-Cot ZH	21,197,246	9,873,214	46.58%	alpaca-cot-zh-refine.yaml	Aliyun ModelScope HuggingFace	28 subsets from Alpaca-CoT

5. Synthetic Contrastive Learning Image-text datasets¶

Data-Juicer has built-in rich operators to support image multimodal data synthesis, such as the Img-Diff dataset. This synthetic data brings a 12-point performance improvement on the MMVP benchmark. For more details, see the Img-Diff paper, and the corresponding recipe implementation can refer to ImgDiff-Dev.

6. Improved Open Source Image-text datasets¶

Data subset	Number of samples before improvement	Number of samples after improvement	Sample retention rate	Configuration link	Data link	Source
LLaVA pretrain (LCS-558k)	558,128	500,380	89.65%	llava-pretrain-refine.yaml	Aliyun ModelScope HuggingFace	LLaVA-1.5
Data-Juicer (T2V, 147k)	1,217,346	147,176	12.09%	data-juicer-sandbox-optimal.yaml	Aliyun ModelScope HuggingFace	InternVid (606k) Panda-70M (605k) MSR-VTT (6k)
Data-Juicer (DJ, 228k)	3,408,553	227,867	8.15%	data-juicer-sandbox-self-evolution.yaml	Aliyun ModelScope	InternVid (606k) Panda-70M (2,599k) Pexels (198k) MSR-VTT (6k)

6.1. Evaluation and Verification¶

LLaVA pretrain (LCS-558k): The model pre-trained with the improved pre-training dataset and fine-tuned with the original instruction dataset outperformed the baseline model LLaVA-1.5-13B on 10 of the 12 evaluation sets.

Models	VQAv2	GQA	VizWiz	SQA	TextVQA	POPE	MME	MM-Bench	MM-Bench-CN	SEED	LLaVA-Bench-Wild	MM-Vet
LLaVA-1.5-13B (Baseline)	80.0	63.3	53.6	71.6	61.3	85.9	1531.3	67.7	63.6	61.6	72.5	36.1
LLaVA-1.5-13B (Rectified Pretraining Dataset)	79.94	63.5	54.09	74.20	60.82	86.67	1565.53	68.2	63.9	61.8	75.9	37.4

Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): Outperform the baseline model T2V-Turbo on VBench with refined dataset. Here T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k), and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). For details, please refer to Sandbox Laboratory.

model	Total Score	Quality Score	Semantic Score	subject consistency	background consistency	temporal flickering	motion smoothness	dynamic degree	aesthetic quality
T2V-Turbo	81.01	82.57	74.76	96.28	97.02	97.48	97.34	49.17	63.04
Data-Juicer (T2V, 147k)	82.10	83.14	77.93	97.32	99.03	96.60	96.51	51.67	68.92
Data-Juicer (DJ, 228k)	82.53	83.38	79.13	97.92	99.27	98.14	97.77	38.89	67.39

model	imaging quality	object class	multiple objects	human action	color	spatial relationship	scene	appearance style	temporal style	overall consistency
T2V-Turbo	72.49	93.96	54.65	95.20	89.90	38.67	55.58	24.42	25.51	28.16
Data-Juicer (T2V, 147k)	70.42	95.85	61.63	95.60	94.06	46.95	57.57	24.42	26.34	28.90
Data-Juicer (DJ, 228k)	70.41	96.44	64.51	95.40	95.51	47.17	57.30	25.55	26.82	29.25

7. Basic Example Recipes for Video Data¶

We provide users with a video dataset processing recipe sample to help better use video-related operators: general-video-refine-example.yaml . Here we apply three types of operators:

Text-only: Improve the dataset quality based on video description
Video-only: Improve the dataset quality based on video properties
Text-Video: Improve the dataset quality based on the alignment between text and video Users can start their video dataset processing workflow based on this recipe.

8. Synthesize Human-centric Video Benchmarks¶

Data-Juicer can also support video benchmark synthesis, such as HumanVBench, which converts in-the-wild videos into human-centric video benchmarks. The corresponding data recipes and construction process can be found in HumanVBench-dev.

9. Improve Existing Open Source Video Datasets¶

Data subset	Number of samples before improvement	Number of samples after improvement	Sample retention rate	Configuration link	Data link	Source
Data-Juicer (T2V, 147k)	1,217,346	147,176	12.09%	data-juicer-sandbox-optimal.yaml	Aliyun ModelScope HuggingFace	InternVid (606k) Panda-70M (605k) MSR-VTT (6k)
Data-Juicer (DJ, 228k)	3,408,553	227,867	8.15%	data-juicer-sandbox-self-evolution.yaml	Aliyun ModelScope	InternVid (606k) Panda-70M (2,599k) Pexels (198k) MSR-VTT (6k)

9.1. Evaluation and Verification¶

Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): Using the refined dataset, they fully surpass the baseline model T2V-Turbo in VBench. Here, T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k), and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). For details, please refer to Sandbox Lab.

model	Total Score	Quality Score	Semantic Score	subject consistency	background consistency	temporal flickering	motion smoothness	dynamic degree	aesthetic quality
T2V-Turbo	81.01	82.57	74.76	96.28	97.02	97.48	97.34	49.17	63.04
Data-Juicer (T2V, 147k)	82.10	83.14	77.93	97.32	99.03	96.60	96.51	51.67	68.92
Data-Juicer (DJ, 228k)	82.53	83.38	79.13	97.92	99.27	98.14	97.77	38.89	67.39

model	imaging quality	object class	multiple objects	human action	color	spatial relationship	scene	appearance style	temporal style	overall consistency
T2V-Turbo	72.49	93.96	54.65	95.20	89.90	38.67	55.58	24.42	25.51	28.16
Data-Juicer (T2V, 147k)	70.42	95.85	61.63	95.60	94.06	46.95	57.57	24.42	26.34	28.90
Data-Juicer (DJ, 228k)	70.41	96.44	64.51	95.40	95.51	47.17	57.30	25.55	26.82	29.25