data_juicer.analysis.overall_analysis module

class data_juicer.analysis.overall_analysis.OverallAnalysis(dataset, output_path)[source]

Bases: object

Apply analysis on the overall stats, including mean, std, quantiles, etc.

__init__(dataset, output_path)[source]

Initialization method.

Parameters:
  • dataset – the dataset to be analyzed

  • output_path – path to store the analysis results.

refine_single_column(col)[source]
analyze(percentiles=[], num_proc=1, skip_export=False)[source]

Apply overall analysis on the whole dataset based on the describe method of pandas.

Parameters:
  • percentiles – percentiles to analyze

  • num_proc – number of processes to analyze the dataset

  • skip_export – whether export the results to disk

Returns:

the overall analysis result.