data_juicer.core.data.dataset_builder module

class data_juicer.core.data.dataset_builder.DatasetBuilder(cfg: Namespace, executor_type: str = 'default')[源代码]

基类:object

DatasetBuilder is a class that builds a dataset from a configuration.

__init__(cfg: Namespace, executor_type: str = 'default')[源代码]
load_dataset(**kwargs) DJDataset[源代码]
classmethod load_dataset_by_generated_config(generated_dataset_config)[源代码]

load dataset by generated config

data_juicer.core.data.dataset_builder.rewrite_cli_datapath(dataset_path, max_sample_num=None) List[源代码]

rewrite the dataset_path from CLI into proper dataset config format that is compatible with YAML config style; retrofitting CLI input of local files and huggingface path

参数:
  • dataset_path -- a dataset file or a dataset dir or a list of them, e.g. <w1> ds1.jsonl <w2> ds2_dir <w3> ds3_file.json

  • max_sample_num -- the maximum number of samples to load

返回:

list of dataset configs

data_juicer.core.data.dataset_builder.parse_cli_datapath(dataset_path) Tuple[List[str], List[float]][源代码]

Split every dataset path and its weight.

参数:

dataset_path -- a dataset file or a dataset dir or a list of them, e.g. <w1> ds1.jsonl <w2> ds2_dir <w3> ds3_file.json

返回:

list of dataset path and list of weights

data_juicer.core.data.dataset_builder.get_sample_numbers(weights, max_sample_num)[源代码]