data_juicer.core.data.dataset_builder module

class data_juicer.core.data.dataset_builder.DatasetBuilder(cfg: Namespace, executor_type: str = 'default')[source]

Bases: object

DatasetBuilder is a class that builds a dataset from a configuration.

__init__(cfg: Namespace, executor_type: str = 'default')[source]
load_dataset(**kwargs) DJDataset[source]
classmethod load_dataset_by_generated_config(generated_dataset_config)[source]

load dataset by generated config

data_juicer.core.data.dataset_builder.rewrite_cli_datapath(dataset_path, max_sample_num=None) List[source]

rewrite the dataset_path from CLI into proper dataset config format that is compatible with YAML config style; retrofitting CLI input of local files and huggingface path

Parameters:
  • dataset_path – a dataset file or a dataset dir or a list of them, e.g. <w1> ds1.jsonl <w2> ds2_dir <w3> ds3_file.json

  • max_sample_num – the maximum number of samples to load

Returns:

list of dataset configs

data_juicer.core.data.dataset_builder.parse_cli_datapath(dataset_path) Tuple[List[str], List[float]][source]

Split every dataset path and its weight.

Parameters:

dataset_path – a dataset file or a dataset dir or a list of them, e.g. <w1> ds1.jsonl <w2> ds2_dir <w3> ds3_file.json

Returns:

list of dataset path and list of weights

data_juicer.core.data.dataset_builder.get_sample_numbers(weights, max_sample_num)[source]