data_juicer.core.analyzer module

class data_juicer.core.analyzer.Analyzer(cfg: Namespace | None = None)[源代码]

基类:object

This Analyzer class is used to analyze a specific dataset.

It will compute stats for all filter ops in the config file, apply multiple analysis (e.g. OverallAnalysis, ColumnWiseAnalysis, etc.) on these stats, and generate the analysis results (stats tables, distribution figures, etc.) to help users understand the input dataset better.

__init__(cfg: Namespace | None = None)[源代码]

Initialization method.

参数:

cfg -- optional jsonargparse Namespace dict.

run(dataset: Dataset | NestedDataset = None, load_data_np: Annotated[int, Gt(gt=0)] | None = None, skip_export: bool = False, skip_return: bool = False)[源代码]

Running the dataset analysis pipeline.

参数:
  • dataset -- a Dataset object to be analyzed.

  • load_data_np -- number of workers when loading the dataset.

  • skip_export -- whether export the results into disk

  • skip_return -- skip return for API called.

返回:

analyzed dataset.