data_juicer.format.formatter module¶
- class data_juicer.format.formatter.LocalFormatter(dataset_path: str, type: str, suffixes: str | List[str] | None = None, text_keys: List[str] = None, add_suffix=False, **kwargs)[源代码]¶
-
The class is used to load a dataset from local files or local directory.
- __init__(dataset_path: str, type: str, suffixes: str | List[str] | None = None, text_keys: List[str] = None, add_suffix=False, **kwargs)[源代码]¶
Initialization method.
- 参数:
dataset_path -- path to a dataset file or a dataset directory
type -- a packaged dataset module type (json, csv, etc.)
suffixes -- files with specified suffixes to be processed
text_keys -- key names of field that stores sample text.
add_suffix -- whether to add the file suffix to dataset meta info
kwargs -- extra args
- class data_juicer.format.formatter.RemoteFormatter(dataset_path: str, text_keys: List[str] = None, **kwargs)[源代码]¶
-
The class is used to load a dataset from repository of huggingface hub.
- data_juicer.format.formatter.add_suffixes(datasets: DatasetDict, num_proc: int = 1) Dataset [源代码]¶
Add suffix filed to datasets.
- 参数:
datasets -- a DatasetDict object
num_proc -- number of processes to add suffixes
- 返回:
datasets with suffix features.
- data_juicer.format.formatter.unify_format(dataset: Dataset, text_keys: List[str] | str = 'text', num_proc: int = 1, global_cfg: dict | Namespace = None) Dataset [源代码]¶
Get an unified internal format, conduct the following modifications.
check keys of dataset
filter out those samples with empty or None text
- 参数:
dataset -- input dataset
text_keys -- original text key(s) of dataset.
num_proc -- number of processes for mapping
global_cfg -- the global cfg used in consequent processes, since cfg.text_key may be modified after unifying
- 返回:
unified_format_dataset