data_juicer.analysis.correlation_analysis module

data_juicer.analysis.correlation_analysis.draw_heatmap(data, row_labels, col_labels, ax=None, cbar_kw=None, cbarlabel='', **kwargs)[源代码]

Create a heatmap from a numpy array and two lists of labels.

参数:
  • data -- A 2D numpy array of shape (M, N).

  • row_labels -- A list or array of length M with the labels for the rows.

  • col_labels -- A list or array of length N with the labels for the columns.

  • ax -- A matplotlib.axes.Axes instance to which the heatmap is plotted. If not provided, use current Axes or create a new one. Optional.

  • cbar_kw -- A dictionary with arguments to matplotlib.Figure.colorbar. Optional.

  • cbarlabel -- The label for the colorbar. Optional.

  • **kwargs -- All other arguments are forwarded to imshow.

data_juicer.analysis.correlation_analysis.annotate_heatmap(im, data=None, valfmt='{x:.2f}', textcolors=('black', 'white'), threshold=None, **textkw)[源代码]

A function to annotate a heatmap.

参数:
  • im -- The AxesImage to be labeled.

  • data -- Data used to annotate. If None, the image's data is used. Optional.

  • valfmt -- The format of the annotations inside the heatmap. This should either use the string format method, e.g. "$ {x:.2f}", or be a matplotlib.ticker.Formatter. Optional.

  • textcolors -- A pair of colors. The first is used for values below a threshold, the second for those above. Optional.

  • threshold -- Value in data units according to which the colors from textcolors are applied. If None (the default) uses the middle of the colormap as separation. Optional.

  • **kwargs -- All other arguments are forwarded to each call to text used to create the text labels.

data_juicer.analysis.correlation_analysis.is_numeric_list_series(series)[源代码]

Whether a series is a numerical-list column.

class data_juicer.analysis.correlation_analysis.CorrelationAnalysis(dataset, output_path)[源代码]

基类:object

Analyze the correlations among different stats. Only for numerical stats.

__init__(dataset, output_path)[源代码]

Initialization method.

参数:
  • dataset -- the dataset to be analyzed

  • output_path -- path to store the analysis results

analyze(method='pearson', show=False, skip_export=False)[源代码]