data_juicer.config package

Submodules

data_juicer.config.config module

data_juicer.config.config.init_configs(args: List[str] | None = None, which_entry: object | None = None)[源代码]
initialize the jsonargparse parser and parse configs from one of:
  1. POSIX-style commands line args;

  2. config files in yaml (json and jsonnet supersets);

  3. environment variables

  4. hard-coded defaults

参数:
  • args -- list of params, e.g., ['--config', 'cfg.yaml'], default None.

  • which_entry -- which entry to init configs (executor/analyzer)

返回:

a global cfg object used by the DefaultExecutor or Analyzer

data_juicer.config.config.init_setup_from_cfg(cfg: Namespace)[源代码]

Do some extra setup tasks after parsing config file or command line.

  1. create working directory and a log directory

  2. update cache directory

  3. update checkpoint and temp_dir of tempfile

参数:
  • cfg -- an original cfg

  • cfg -- an updated cfg

data_juicer.config.config.load_ops_with_stats_meta()[源代码]
data_juicer.config.config.update_op_attr(op_list: list, attr_dict: dict | None = None)[源代码]
data_juicer.config.config.sort_op_by_types_and_names(op_name_classes)[源代码]

Split ops items by op type and sort them to sub-ops by name, then concat together.

参数:

op_name_classes -- a list of op modules

返回:

sorted op list , each item is a pair of op_name and op_class

data_juicer.config.config.update_op_process(cfg, parser)[源代码]
data_juicer.config.config.namespace_to_arg_list(namespace, prefix='', includes=None, excludes=None)[源代码]
data_juicer.config.config.config_backup(cfg: Namespace)[源代码]
data_juicer.config.config.display_config(cfg: Namespace)[源代码]
data_juicer.config.config.export_config(cfg: Namespace, path: str, format: str = 'yaml', skip_none: bool = True, skip_check: bool = True, overwrite: bool = False, multifile: bool = True)[源代码]

Save the config object, some params are from jsonargparse

参数:
  • cfg -- cfg object to save (Namespace type)

  • path -- the save path

  • format -- 'yaml', 'json', 'json_indented', 'parser_mode'

  • skip_none -- Whether to exclude entries whose value is None.

  • skip_check -- Whether to skip parser checking.

  • overwrite -- Whether to overwrite existing files.

  • multifile -- Whether to save multiple config files by using the __path__ metas.

返回:

data_juicer.config.config.merge_config(ori_cfg: Namespace, new_cfg: Namespace)[源代码]

Merge configuration from new_cfg into ori_cfg

参数:
  • ori_cfg -- the original configuration object, whose type is expected as namespace from jsonargparse

  • new_cfg -- the configuration object to be merged, whose type is expected as dict or namespace from jsonargparse

返回:

cfg_after_merge

data_juicer.config.config.prepare_side_configs(ori_config: str | Namespace | Dict)[源代码]
parse the config if ori_config is a string of a config file path with

yaml, yml or json format

参数:

ori_config -- a config dict or a string of a config file path with yaml, yml or json format

返回:

a config dict

data_juicer.config.config.get_init_configs(cfg: Namespace | Dict)[源代码]

set init configs of data-juicer for cfg

data_juicer.config.config.get_default_cfg()[源代码]

Get default config values from config_all.yaml

Module contents

data_juicer.config.init_configs(args: List[str] | None = None, which_entry: object | None = None)[源代码]
initialize the jsonargparse parser and parse configs from one of:
  1. POSIX-style commands line args;

  2. config files in yaml (json and jsonnet supersets);

  3. environment variables

  4. hard-coded defaults

参数:
  • args -- list of params, e.g., ['--config', 'cfg.yaml'], default None.

  • which_entry -- which entry to init configs (executor/analyzer)

返回:

a global cfg object used by the DefaultExecutor or Analyzer

data_juicer.config.get_init_configs(cfg: Namespace | Dict)[源代码]

set init configs of data-juicer for cfg

data_juicer.config.export_config(cfg: Namespace, path: str, format: str = 'yaml', skip_none: bool = True, skip_check: bool = True, overwrite: bool = False, multifile: bool = True)[源代码]

Save the config object, some params are from jsonargparse

参数:
  • cfg -- cfg object to save (Namespace type)

  • path -- the save path

  • format -- 'yaml', 'json', 'json_indented', 'parser_mode'

  • skip_none -- Whether to exclude entries whose value is None.

  • skip_check -- Whether to skip parser checking.

  • overwrite -- Whether to overwrite existing files.

  • multifile -- Whether to save multiple config files by using the __path__ metas.

返回:

data_juicer.config.merge_config(ori_cfg: Namespace, new_cfg: Namespace)[源代码]

Merge configuration from new_cfg into ori_cfg

参数:
  • ori_cfg -- the original configuration object, whose type is expected as namespace from jsonargparse

  • new_cfg -- the configuration object to be merged, whose type is expected as dict or namespace from jsonargparse

返回:

cfg_after_merge

data_juicer.config.prepare_side_configs(ori_config: str | Namespace | Dict)[源代码]
parse the config if ori_config is a string of a config file path with

yaml, yml or json format

参数:

ori_config -- a config dict or a string of a config file path with yaml, yml or json format

返回:

a config dict

data_juicer.config.get_default_cfg()[源代码]

Get default config values from config_all.yaml