You're reading the documentation from the main branch. For the latest released version, please have a look at v1.3.3.

Contents Menu Expand Light mode Dark mode Auto light/dark, in light mode Auto light/dark, in dark mode Skip to content
data-juicer
data-juicer

Tutorial

  • DJ-Cookbook
  • Installation Guide
  • Quick Start

docs

  • Operator Schemas 算子提要
  • Data Recipe Gallery
  • Dataset Configuration Guide
  • “Bad” Data Exhibition
  • DJ-SORA
  • DJ_service
  • How-to Guide for Developers
  • Distributed Data Processing in Data-Juicer
  • Sandbox
  • Awesome Data-Model Co-Development of MLLMs

demos

  • Demos

tools

  • Distributed Fuzzy Deduplication Tools
  • Auto Evaluation Toolkit
  • GPT EVAL: Evaluate your model with OpenAI API
  • Evaluation Results Recorder
  • Format Conversion Tools
  • Multimodal Tools
  • Post Tuning Tools
  • Hyper-parameter Optimization for Data Recipe
  • Label Studio Service Utility
  • Metrics for video generation
  • Postprocess tools
  • Preprocess Tools
  • Data Scoring

thirdparty

  • LLM Ecosystems
  • Third-party Model Library

API Reference

  • API Reference
    • data_juicer.core package
      • data_juicer.core.data package
      • data_juicer.core.executor package
    • data_juicer.ops package
      • data_juicer.ops.aggregator package
      • data_juicer.ops.common package
      • data_juicer.ops.deduplicator package
      • data_juicer.ops.filter package
      • data_juicer.ops.grouper package
      • data_juicer.ops.mapper package
        • data_juicer.ops.mapper.annotation package
      • data_juicer.ops.selector package
    • data_juicer.ops.filter package
    • data_juicer.ops.mapper package
      • data_juicer.ops.mapper.annotation package
    • data_juicer.ops.deduplicator package
    • data_juicer.ops.selector package
    • data_juicer.ops.common package
    • data_juicer.analysis package
    • data_juicer.config package
    • data_juicer.format package
en|main
Language
English 简体中文
Version
v1.3.3 main
Back to top
View this page

Data Recipe Gallery¶

  • The recipe folder contains fruitful sample configuration files of Data-Juicer data recipes, which helps users easily understand, reuse and expand the configurations in various functional scenarios.

  • 📣📣📣 Community contributors can submit PRs to add customized data recipes to promote dissemination, reuse and related technology evolution. We welcome co-construction and will highlight acknowledgements!

Table of Contents

  • 1. Data-Juicer Minimal Example Recipe

  • 2. Reproduce Open Source Text Datasets

  • 3. Improved Open Source Pre-training Text Datasets

  • 4. Improved Open Source Post-tuning Text Dataset

  • 5. Synthetic Contrastive Learning Image-text datasets

  • 6. Improved Open Source Image-text datasets

    • 6.1. Evaluation and Verification

  • 7. Basic Example Recipes for Video Data

  • 8. Synthesize Human-centric Video Benchmarks

  • 9. Improve Existing Open Source Video Datasets

    • 9.1. Evaluation and Verification

1. Data-Juicer Minimal Example Recipe¶

Some basic configuration files are placed in the Demo folder to help users quickly familiarize themselves with the basic functions of Data-Juicer. Please refer to the folder for detailed description.

2. Reproduce Open Source Text Datasets¶

  • We reproduced the processing flow of part of the Redpajama dataset. Please refer to the reproduced_redpajama folder for detailed description.

  • We reproduced the processing flow of part of the BLOOM dataset. Please refer to the reproduced_bloom folder for detailed description.

3. Improved Open Source Pre-training Text Datasets¶

We found that there are still some “bad” data samples in the existing processed datasets (such as Redpajama, The Pile, etc.). So we use our Data-Juicer to refine these datasets and try to feed them to LLM to get better performance.

We use a simple 3-σ rule to set the hyperparameters of the operators in each data processing recipe.

Data subset

Number of samples before refinement

Number of samples after refinement

Sample retention rate

Config link

Data link

Source

arXiv

1,724,497

1,655,259

95.99%

redpajama-arxiv-refine.yaml

Aliyun
ModelScope
HuggingFace

Redpajama

Books

205,182

195,983

95.51%

redpajama-book-refine.yaml

Aliyun
ModelScope
HuggingFace

Redpajama

Wikipedia

29,834,171

26,990,659

90.47%

redpajama-wiki-refine.yaml

Aliyun
ModelScope
HuggingFace

Redpajama

C4

364,868,892

344,491,171

94.42%

redpajama-c4-refine.yaml

Aliyun
ModelScope
HuggingFace

Redpajama

Common Crawl 2019-30

81,085,420

36,557,283

45.08%

redpajama-cc-2019-30-refine.yaml

Aliyun
ModelScope
HuggingFace

Redpajama

Common Crawl 2020-05

90,850,492

42,612,596

46.90%

redpajama-cc-2020-05-refine.yaml

Aliyun
ModelScope
HuggingFace

Redpajama

Common Crawl 2021-04

98,878,523

44,724,752

45.23%

redpajama-cc-2021-04-refine.yaml

Aliyun
ModelScope
HuggingFace

Redpajama

Common Crawl 2022-05

94,058,868

42,648,496

45.34%

redpajama-cc-2022-05-refine.yaml

Aliyun
ModelScope
HuggingFace

Redpajama

Common Crawl 2023-06

111,402,716

50,643,699

45.46%

redpajama-cc-2023-06-refine.yaml

Aliyun
ModelScope
HuggingFace

Redpajama

Github Code

73,208,524
+ 21,387,703

49,279,344

52.09%

redpajama-code-refine.yaml
stack-code-refine.yaml
redpajama-stack-code-deduplicate.yaml

Aliyun
ModelScope
HuggingFace

Redpajama
The Stack

StackExchange

45,447,328

26,309,203

57.89%

redpajama-pile-stackexchange-refine.yaml

Aliyun
ModelScope
HuggingFace

Redpajama
The Pile

EuroParl

69,814

61,601

88.23%

pile-europarl-refine.yaml

Aliyun
ModelScope
HuggingFace

The Pile

FreeLaw

3,562,015

2,942,612

82.61%

pile-freelaw-refine.yaml

Aliyun
ModelScope
HuggingFace

The Pile

HackerNews

373,027

371,331

99.55%

pile-hackernews-refine.yaml

Aliyun
ModelScope
HuggingFace

The Pile

NIH ExPorter

939,661

858,492

91.36%

pile-nih-refine.yaml

Aliyun
ModelScope
HuggingFace

The Pile

PhilPapers

32,782

29,117

88.82%

pile-philpaper-refine.yaml

Aliyun
ModelScope
HuggingFace

The Pile

PubMed Abstracts

15,518,009

15,009,325

96.72%

pile-pubmed-abstract-refine.yaml

Aliyun
ModelScope
HuggingFace

The Pile

PubMed Central

3,098,930

2,694,860

86.96%

pile-pubmed-central-refine.yaml

Aliyun
ModelScope
HuggingFace

The Pile

USPTO

5,883,024

4,516,283

76.77%

pile-uspto-refine.yaml

Aliyun
ModelScope
HuggingFace

The Pile

4. Improved Open Source Post-tuning Text Dataset¶

Take the Alpaca-CoT dataset as an example:

Data subset

Number of samples before improvement

Number of samples after improvement

Sample retention rate

Configuration link

Data link

Source

Alpaca-Cot EN

136,219,879

72,855,345

54.48%

alpaca-cot-en-refine.yaml

Aliyun
ModelScope
HuggingFace

39 subsets from Alpaca-CoT

Alpaca-Cot ZH

21,197,246

9,873,214

46.58%

alpaca-cot-zh-refine.yaml

Aliyun
ModelScope
HuggingFace

28 subsets from Alpaca-CoT

5. Synthetic Contrastive Learning Image-text datasets¶

Data-Juicer has built-in rich operators to support image multimodal data synthesis, such as the Img-Diff dataset. This synthetic data brings a 12-point performance improvement on the MMVP benchmark. For more details, see the Img-Diff paper, and the corresponding recipe implementation can refer to ImgDiff-Dev.

6. Improved Open Source Image-text datasets¶

Data subset

Number of samples before improvement

Number of samples after improvement

Sample retention rate

Configuration link

Data link

Source

LLaVA pretrain (LCS-558k)

558,128

500,380

89.65%

llava-pretrain-refine.yaml

Aliyun
ModelScope
HuggingFace

LLaVA-1.5

Data-Juicer (T2V, 147k)

1,217,346

147,176

12.09%

data-juicer-sandbox-optimal.yaml

Aliyun
ModelScope
HuggingFace

InternVid (606k)
Panda-70M (605k)
MSR-VTT (6k)

Data-Juicer (DJ, 228k)

3,408,553

227,867

8.15%

data-juicer-sandbox-self-evolution.yaml

Aliyun
ModelScope

InternVid (606k)
Panda-70M (2,599k)
Pexels (198k)
MSR-VTT (6k)

6.1. Evaluation and Verification¶

  • LLaVA pretrain (LCS-558k): The model pre-trained with the improved pre-training dataset and fine-tuned with the original instruction dataset outperformed the baseline model LLaVA-1.5-13B on 10 of the 12 evaluation sets.

Models

VQAv2

GQA

VizWiz

SQA

TextVQA

POPE

MME

MM-Bench

MM-Bench-CN

SEED

LLaVA-Bench-Wild

MM-Vet

LLaVA-1.5-13B
(Baseline)

80.0

63.3

53.6

71.6

61.3

85.9

1531.3

67.7

63.6

61.6

72.5

36.1

LLaVA-1.5-13B
(Rectified Pretraining Dataset)

79.94

63.5

54.09

74.20

60.82

86.67

1565.53

68.2

63.9

61.8

75.9

37.4

  • Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): Outperform the baseline model T2V-Turbo on VBench with refined dataset. Here T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k), and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). For details, please refer to Sandbox Laboratory.

model

Total Score

Quality Score

Semantic Score

subject consistency

background consistency

temporal flickering

motion smoothness

dynamic degree

aesthetic quality

T2V-Turbo

81.01

82.57

74.76

96.28

97.02

97.48

97.34

49.17

63.04

Data-Juicer (T2V, 147k)

82.10

83.14

77.93

97.32

99.03

96.60

96.51

51.67

68.92

Data-Juicer (DJ, 228k)

82.53

83.38

79.13

97.92

99.27

98.14

97.77

38.89

67.39

model

imaging quality

object class

multiple objects

human action

color

spatial relationship

scene

appearance style

temporal style

overall consistency

T2V-Turbo

72.49

93.96

54.65

95.20

89.90

38.67

55.58

24.42

25.51

28.16

Data-Juicer (T2V, 147k)

70.42

95.85

61.63

95.60

94.06

46.95

57.57

24.42

26.34

28.90

Data-Juicer (DJ, 228k)

70.41

96.44

64.51

95.40

95.51

47.17

57.30

25.55

26.82

29.25

7. Basic Example Recipes for Video Data¶

We provide users with a video dataset processing recipe sample to help better use video-related operators: general-video-refine-example.yaml . Here we apply three types of operators:

  • Text-only: Improve the dataset quality based on video description

  • Video-only: Improve the dataset quality based on video properties

  • Text-Video: Improve the dataset quality based on the alignment between text and video Users can start their video dataset processing workflow based on this recipe.

8. Synthesize Human-centric Video Benchmarks¶

Data-Juicer can also support video benchmark synthesis, such as HumanVBench, which converts in-the-wild videos into human-centric video benchmarks. The corresponding data recipes and construction process can be found in HumanVBench-dev.

9. Improve Existing Open Source Video Datasets¶

Data subset

Number of samples before improvement

Number of samples after improvement

Sample retention rate

Configuration link

Data link

Source

Data-Juicer (T2V, 147k)

1,217,346

147,176

12.09%

data-juicer-sandbox-optimal.yaml

Aliyun
ModelScope
HuggingFace

InternVid (606k)
Panda-70M (605k)
MSR-VTT (6k)

Data-Juicer (DJ, 228k)

3,408,553

227,867

8.15%

data-juicer-sandbox-self-evolution.yaml

Aliyun
ModelScope

InternVid (606k)
Panda-70M (2,599k)
Pexels (198k)
MSR-VTT (6k)

9.1. Evaluation and Verification¶

  • Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): Using the refined dataset, they fully surpass the baseline model T2V-Turbo in VBench. Here, T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k), and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). For details, please refer to Sandbox Lab.

model

Total Score

Quality Score

Semantic Score

subject consistency

background consistency

temporal flickering

motion smoothness

dynamic degree

aesthetic quality

T2V-Turbo

81.01

82.57

74.76

96.28

97.02

97.48

97.34

49.17

63.04

Data-Juicer (T2V, 147k)

82.10

83.14

77.93

97.32

99.03

96.60

96.51

51.67

68.92

Data-Juicer (DJ, 228k)

82.53

83.38

79.13

97.92

99.27

98.14

97.77

38.89

67.39

model

imaging quality

object class

multiple objects

human action

color

spatial relationship

scene

appearance style

temporal style

overall consistency

T2V-Turbo

72.49

93.96

54.65

95.20

89.90

38.67

55.58

24.42

25.51

28.16

Data-Juicer (T2V, 147k)

70.42

95.85

61.63

95.60

94.06

46.95

57.57

24.42

26.34

28.90

Data-Juicer (DJ, 228k)

70.41

96.44

64.51

95.40

95.51

47.17

57.30

25.55

26.82

29.25

Next
Dataset Configuration Guide
Previous
Operator Schemas 算子提要
Copyright © 2024, Data-Juicer Team
Made with Sphinx and @pradyunsg's Furo
On this page
  • Data Recipe Gallery
    • 1. Data-Juicer Minimal Example Recipe
    • 2. Reproduce Open Source Text Datasets
    • 3. Improved Open Source Pre-training Text Datasets
    • 4. Improved Open Source Post-tuning Text Dataset
    • 5. Synthetic Contrastive Learning Image-text datasets
    • 6. Improved Open Source Image-text datasets
      • 6.1. Evaluation and Verification
    • 7. Basic Example Recipes for Video Data
    • 8. Synthesize Human-centric Video Benchmarks
    • 9. Improve Existing Open Source Video Datasets
      • 9.1. Evaluation and Verification