data_juicer.core.data.dataset_builder module¶
- class data_juicer.core.data.dataset_builder.DatasetBuilder(cfg: Namespace, executor_type: str = 'default')[source]¶
Bases:
object
DatasetBuilder is a class that builds a dataset from a configuration.
- data_juicer.core.data.dataset_builder.rewrite_cli_datapath(dataset_path, max_sample_num=None) List [source]¶
rewrite the dataset_path from CLI into proper dataset config format that is compatible with YAML config style; retrofitting CLI input of local files and huggingface path
- Parameters:
dataset_path – a dataset file or a dataset dir or a list of them, e.g. <w1> ds1.jsonl <w2> ds2_dir <w3> ds3_file.json
max_sample_num – the maximum number of samples to load
- Returns:
list of dataset configs
- data_juicer.core.data.dataset_builder.parse_cli_datapath(dataset_path) Tuple[List[str], List[float]] [source]¶
Split every dataset path and its weight.
- Parameters:
dataset_path – a dataset file or a dataset dir or a list of them, e.g. <w1> ds1.jsonl <w2> ds2_dir <w3> ds3_file.json
- Returns:
list of dataset path and list of weights