Data Recipe Gallery¶
The recipe folder contains fruitful sample configuration files of Data-Juicer data recipes, which helps users easily understand, reuse and expand the configurations in various functional scenarios.
📣📣📣 Community contributors can submit PRs to add customized data recipes to promote dissemination, reuse and related technology evolution. We welcome co-construction and will highlight acknowledgements!
Table of Contents
1. Data-Juicer Minimal Example Recipe¶
Some basic configuration files are placed in the Demo folder to help users quickly familiarize themselves with the basic functions of Data-Juicer. Please refer to the folder for detailed description.
2. Reproduce Open Source Text Datasets¶
We reproduced the processing flow of part of the Redpajama dataset. Please refer to the reproduced_redpajama folder for detailed description.
We reproduced the processing flow of part of the BLOOM dataset. Please refer to the reproduced_bloom folder for detailed description.
3. Improved Open Source Pre-training Text Datasets¶
We found that there are still some “bad” data samples in the existing processed datasets (such as Redpajama, The Pile, etc.). So we use our Data-Juicer to refine these datasets and try to feed them to LLM to get better performance.
We use a simple 3-σ rule to set the hyperparameters of the operators in each data processing recipe.
Data subset |
Number of samples before refinement |
Number of samples after refinement |
Sample retention rate |
Config link |
Data link |
Source |
---|---|---|---|---|---|---|
arXiv |
1,724,497 |
1,655,259 |
95.99% |
Redpajama |
||
Books |
205,182 |
195,983 |
95.51% |
Redpajama |
||
Wikipedia |
29,834,171 |
26,990,659 |
90.47% |
Redpajama |
||
C4 |
364,868,892 |
344,491,171 |
94.42% |
Redpajama |
||
Common Crawl 2019-30 |
81,085,420 |
36,557,283 |
45.08% |
Redpajama |
||
Common Crawl 2020-05 |
90,850,492 |
42,612,596 |
46.90% |
Redpajama |
||
Common Crawl 2021-04 |
98,878,523 |
44,724,752 |
45.23% |
Redpajama |
||
Common Crawl 2022-05 |
94,058,868 |
42,648,496 |
45.34% |
Redpajama |
||
Common Crawl 2023-06 |
111,402,716 |
50,643,699 |
45.46% |
Redpajama |
||
Github Code |
73,208,524 |
49,279,344 |
52.09% |
redpajama-code-refine.yaml |
Redpajama |
|
StackExchange |
45,447,328 |
26,309,203 |
57.89% |
Redpajama |
||
EuroParl |
69,814 |
61,601 |
88.23% |
The Pile |
||
FreeLaw |
3,562,015 |
2,942,612 |
82.61% |
The Pile |
||
HackerNews |
373,027 |
371,331 |
99.55% |
The Pile |
||
NIH ExPorter |
939,661 |
858,492 |
91.36% |
The Pile |
||
PhilPapers |
32,782 |
29,117 |
88.82% |
The Pile |
||
PubMed Abstracts |
15,518,009 |
15,009,325 |
96.72% |
The Pile |
||
PubMed Central |
3,098,930 |
2,694,860 |
86.96% |
The Pile |
||
USPTO |
5,883,024 |
4,516,283 |
76.77% |
The Pile |
4. Improved Open Source Post-tuning Text Dataset¶
Take the Alpaca-CoT dataset as an example:
Data subset |
Number of samples before improvement |
Number of samples after improvement |
Sample retention rate |
Configuration link |
Data link |
Source |
---|---|---|---|---|---|---|
Alpaca-Cot EN |
136,219,879 |
72,855,345 |
54.48% |
|||
Alpaca-Cot ZH |
21,197,246 |
9,873,214 |
46.58% |
5. Synthetic Contrastive Learning Image-text datasets¶
Data-Juicer has built-in rich operators to support image multimodal data synthesis, such as the Img-Diff dataset. This synthetic data brings a 12-point performance improvement on the MMVP benchmark. For more details, see the Img-Diff paper, and the corresponding recipe implementation can refer to ImgDiff-Dev.
6. Improved Open Source Image-text datasets¶
Data subset |
Number of samples before improvement |
Number of samples after improvement |
Sample retention rate |
Configuration link |
Data link |
Source |
---|---|---|---|---|---|---|
LLaVA pretrain (LCS-558k) |
558,128 |
500,380 |
89.65% |
|||
Data-Juicer (T2V, 147k) |
1,217,346 |
147,176 |
12.09% |
|||
Data-Juicer (DJ, 228k) |
3,408,553 |
227,867 |
8.15% |
InternVid (606k) |
6.1. Evaluation and Verification¶
LLaVA pretrain (LCS-558k): The model pre-trained with the improved pre-training dataset and fine-tuned with the original instruction dataset outperformed the baseline model LLaVA-1.5-13B on 10 of the 12 evaluation sets.
Models |
VQAv2 |
GQA |
VizWiz |
SQA |
TextVQA |
POPE |
MME |
MM-Bench |
MM-Bench-CN |
SEED |
LLaVA-Bench-Wild |
MM-Vet |
---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaVA-1.5-13B |
80.0 |
63.3 |
53.6 |
71.6 |
61.3 |
85.9 |
1531.3 |
67.7 |
63.6 |
61.6 |
72.5 |
36.1 |
LLaVA-1.5-13B |
79.94 |
63.5 |
54.09 |
74.20 |
60.82 |
86.67 |
1565.53 |
68.2 |
63.9 |
61.8 |
75.9 |
37.4 |
Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): Outperform the baseline model T2V-Turbo on VBench with refined dataset. Here T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k), and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). For details, please refer to Sandbox Laboratory.
model |
Total Score |
Quality Score |
Semantic Score |
subject consistency |
background consistency |
temporal flickering |
motion smoothness |
dynamic degree |
aesthetic quality |
---|---|---|---|---|---|---|---|---|---|
T2V-Turbo |
81.01 |
82.57 |
74.76 |
96.28 |
97.02 |
97.48 |
97.34 |
49.17 |
63.04 |
Data-Juicer (T2V, 147k) |
82.10 |
83.14 |
77.93 |
97.32 |
99.03 |
96.60 |
96.51 |
51.67 |
68.92 |
Data-Juicer (DJ, 228k) |
82.53 |
83.38 |
79.13 |
97.92 |
99.27 |
98.14 |
97.77 |
38.89 |
67.39 |
model |
imaging quality |
object class |
multiple objects |
human action |
color |
spatial relationship |
scene |
appearance style |
temporal style |
overall consistency |
---|---|---|---|---|---|---|---|---|---|---|
T2V-Turbo |
72.49 |
93.96 |
54.65 |
95.20 |
89.90 |
38.67 |
55.58 |
24.42 |
25.51 |
28.16 |
Data-Juicer (T2V, 147k) |
70.42 |
95.85 |
61.63 |
95.60 |
94.06 |
46.95 |
57.57 |
24.42 |
26.34 |
28.90 |
Data-Juicer (DJ, 228k) |
70.41 |
96.44 |
64.51 |
95.40 |
95.51 |
47.17 |
57.30 |
25.55 |
26.82 |
29.25 |
7. Basic Example Recipes for Video Data¶
We provide users with a video dataset processing recipe sample to help better use video-related operators: general-video-refine-example.yaml . Here we apply three types of operators:
Text-only: Improve the dataset quality based on video description
Video-only: Improve the dataset quality based on video properties
Text-Video: Improve the dataset quality based on the alignment between text and video Users can start their video dataset processing workflow based on this recipe.
8. Synthesize Human-centric Video Benchmarks¶
Data-Juicer can also support video benchmark synthesis, such as HumanVBench, which converts in-the-wild videos into human-centric video benchmarks. The corresponding data recipes and construction process can be found in HumanVBench-dev.
9. Improve Existing Open Source Video Datasets¶
Data subset |
Number of samples before improvement |
Number of samples after improvement |
Sample retention rate |
Configuration link |
Data link |
Source |
---|---|---|---|---|---|---|
Data-Juicer (T2V, 147k) |
1,217,346 |
147,176 |
12.09% |
|||
Data-Juicer (DJ, 228k) |
3,408,553 |
227,867 |
8.15% |
InternVid (606k) |
9.1. Evaluation and Verification¶
Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): Using the refined dataset, they fully surpass the baseline model T2V-Turbo in VBench. Here, T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k), and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). For details, please refer to Sandbox Lab.
model |
Total Score |
Quality Score |
Semantic Score |
subject consistency |
background consistency |
temporal flickering |
motion smoothness |
dynamic degree |
aesthetic quality |
---|---|---|---|---|---|---|---|---|---|
T2V-Turbo |
81.01 |
82.57 |
74.76 |
96.28 |
97.02 |
97.48 |
97.34 |
49.17 |
63.04 |
Data-Juicer (T2V, 147k) |
82.10 |
83.14 |
77.93 |
97.32 |
99.03 |
96.60 |
96.51 |
51.67 |
68.92 |
Data-Juicer (DJ, 228k) |
82.53 |
83.38 |
79.13 |
97.92 |
99.27 |
98.14 |
97.77 |
38.89 |
67.39 |
model |
imaging quality |
object class |
multiple objects |
human action |
color |
spatial relationship |
scene |
appearance style |
temporal style |
overall consistency |
---|---|---|---|---|---|---|---|---|---|---|
T2V-Turbo |
72.49 |
93.96 |
54.65 |
95.20 |
89.90 |
38.67 |
55.58 |
24.42 |
25.51 |
28.16 |
Data-Juicer (T2V, 147k) |
70.42 |
95.85 |
61.63 |
95.60 |
94.06 |
46.95 |
57.57 |
24.42 |
26.34 |
28.90 |
Data-Juicer (DJ, 228k) |
70.41 |
96.44 |
64.51 |
95.40 |
95.51 |
47.17 |
57.30 |
25.55 |
26.82 |
29.25 |