data_juicer.core.analyzer module¶

class data_juicer.core.analyzer.Analyzer(cfg: Namespace | None = None)[source]¶

Bases: object

This Analyzer class is used to analyze a specific dataset.

It will compute stats for all filter ops in the config file, apply multiple analysis (e.g. OverallAnalysis, ColumnWiseAnalysis, etc.) on these stats, and generate the analysis results (stats tables, distribution figures, etc.) to help users understand the input dataset better.

__init__(cfg: Namespace | None = None)[source]¶

Initialization method.

Parameters:: cfg – optional jsonargparse Namespace dict.

run(dataset: Dataset | NestedDataset = None, load_data_np: Annotated[int, Gt(gt=0)] | None = None, skip_export: bool = False, skip_return: bool = False)[source]¶

Running the dataset analysis pipeline.

Parameters:

dataset – a Dataset object to be analyzed.
load_data_np – number of workers when loading the dataset.
skip_export – whether export the results into disk
skip_return – skip return for API called.

Returns:

analyzed dataset.