data_juicer.format.formatter module

class data_juicer.format.formatter.BaseFormatter[源代码]

基类:object

Base class to load dataset.

load_dataset(*args) Dataset[源代码]
class data_juicer.format.formatter.LocalFormatter(dataset_path: str, type: str, suffixes: str | List[str] | None = None, text_keys: List[str] = None, add_suffix=False, **kwargs)[源代码]

基类:BaseFormatter

The class is used to load a dataset from local files or local directory.

__init__(dataset_path: str, type: str, suffixes: str | List[str] | None = None, text_keys: List[str] = None, add_suffix=False, **kwargs)[源代码]

Initialization method.

参数:
  • dataset_path -- path to a dataset file or a dataset directory

  • type -- a packaged dataset module type (json, csv, etc.)

  • suffixes -- files with specified suffixes to be processed

  • text_keys -- key names of field that stores sample text.

  • add_suffix -- whether to add the file suffix to dataset meta info

  • kwargs -- extra args

load_dataset(num_proc: int = 1, global_cfg=None) Dataset[源代码]

Load a dataset from dataset file or dataset directory, and unify its format.

参数:
  • num_proc -- number of processes when loading the dataset

  • global_cfg -- global cfg used in consequent processes,

返回:

formatted dataset

class data_juicer.format.formatter.RemoteFormatter(dataset_path: str, text_keys: List[str] = None, **kwargs)[源代码]

基类:BaseFormatter

The class is used to load a dataset from repository of huggingface hub.

__init__(dataset_path: str, text_keys: List[str] = None, **kwargs)[源代码]

Initialization method.

参数:
  • dataset_path -- a dataset file or a dataset directory

  • text_keys -- key names of field that stores sample text.

  • kwargs -- extra args

load_dataset(num_proc: int = 1, global_cfg=None) Dataset[源代码]

Load a dataset from HuggingFace, and unify its format.

参数:
  • num_proc -- number of processes when loading the dataset

  • global_cfg -- the global cfg used in consequent processes,

返回:

formatted dataset

data_juicer.format.formatter.add_suffixes(datasets: DatasetDict, num_proc: int = 1) Dataset[源代码]

Add suffix filed to datasets.

参数:
  • datasets -- a DatasetDict object

  • num_proc -- number of processes to add suffixes

返回:

datasets with suffix features.

data_juicer.format.formatter.unify_format(dataset: Dataset, text_keys: List[str] | str = 'text', num_proc: int = 1, global_cfg: dict | Namespace = None) Dataset[源代码]

Get an unified internal format, conduct the following modifications.

  1. check keys of dataset

  2. filter out those samples with empty or None text

参数:
  • dataset -- input dataset

  • text_keys -- original text key(s) of dataset.

  • num_proc -- number of processes for mapping

  • global_cfg -- the global cfg used in consequent processes, since cfg.text_key may be modified after unifying

返回:

unified_format_dataset