data_juicer.analysis package

Submodules

data_juicer.analysis.collector module

data_juicer.analysis.column_wise_analysis module

data_juicer.analysis.column_wise_analysis.get_row_col(total_num, factor=2)[source]

Given the total number of stats figures, get the “best” number of rows and columns. This function is needed when we need to store all stats figures into one image.

Parameters:
  • total_num – Total number of stats figures

  • factor – Number of sub-figure types in each figure. In default, it’s 2, which means there are histogram and box plot for each stat figure

Returns:

“best” number of rows and columns, and the grid list

class data_juicer.analysis.column_wise_analysis.ColumnWiseAnalysis(dataset, output_path, overall_result=None, save_stats_in_one_file=True)[source]

Bases: object

Apply analysis on each column of stats respectively.

__init__(dataset, output_path, overall_result=None, save_stats_in_one_file=True)[source]

Initialization method

Parameters:
  • dataset – the dataset to be analyzed

  • output_path – path to store the analysis results

  • overall_result – optional precomputed overall stats result

  • save_stats_in_one_file – whether save all analysis figures of all stats into one image file

analyze(show_percentiles=False, show=False, skip_export=False)[source]

Apply analysis and draw the analysis figure for stats.

Parameters:
  • show_percentiles – whether to show the percentile line in each sub-figure. If it’s true, there will be several red lines to indicate the quantiles of the stats distributions

  • show – whether to show in a single window after drawing

  • skip_export – whether save the results into disk

Returns:

draw_hist(ax, data, save_path, percentiles=None, show=False)[source]

Draw the histogram for the data.

Parameters:
  • ax – the axes to draw

  • data – data to draw

  • save_path – the path to save the histogram figure

  • percentiles – the overall analysis result of the data including percentile information

  • show – whether to show in a single window after drawing

Returns:

draw_box(ax, data, save_path, percentiles=None, show=False)[source]

Draw the box plot for the data.

Parameters:
  • ax – the axes to draw

  • data – data to draw

  • save_path – the path to save the box figure

  • percentiles – the overall analysis result of the data including percentile information

  • show – whether to show in a single window after drawing

Returns:

draw_wordcloud(ax, data, save_path, show=False)[source]

data_juicer.analysis.correlation_analysis module

data_juicer.analysis.correlation_analysis.draw_heatmap(data, row_labels, col_labels, ax=None, cbar_kw=None, cbarlabel='', **kwargs)[source]

Create a heatmap from a numpy array and two lists of labels.

Parameters:
  • data – A 2D numpy array of shape (M, N).

  • row_labels – A list or array of length M with the labels for the rows.

  • col_labels – A list or array of length N with the labels for the columns.

  • ax – A matplotlib.axes.Axes instance to which the heatmap is plotted. If not provided, use current Axes or create a new one. Optional.

  • cbar_kw – A dictionary with arguments to matplotlib.Figure.colorbar. Optional.

  • cbarlabel – The label for the colorbar. Optional.

  • **kwargs – All other arguments are forwarded to imshow.

data_juicer.analysis.correlation_analysis.annotate_heatmap(im, data=None, valfmt='{x:.2f}', textcolors=('black', 'white'), threshold=None, **textkw)[source]

A function to annotate a heatmap.

Parameters:
  • im – The AxesImage to be labeled.

  • data – Data used to annotate. If None, the image’s data is used. Optional.

  • valfmt – The format of the annotations inside the heatmap. This should either use the string format method, e.g. “$ {x:.2f}”, or be a matplotlib.ticker.Formatter. Optional.

  • textcolors – A pair of colors. The first is used for values below a threshold, the second for those above. Optional.

  • threshold – Value in data units according to which the colors from textcolors are applied. If None (the default) uses the middle of the colormap as separation. Optional.

  • **kwargs – All other arguments are forwarded to each call to text used to create the text labels.

data_juicer.analysis.correlation_analysis.is_numeric_list_series(series)[source]

Whether a series is a numerical-list column.

class data_juicer.analysis.correlation_analysis.CorrelationAnalysis(dataset, output_path)[source]

Bases: object

Analyze the correlations among different stats. Only for numerical stats.

__init__(dataset, output_path)[source]

Initialization method.

Parameters:
  • dataset – the dataset to be analyzed

  • output_path – path to store the analysis results

analyze(method='pearson', show=False, skip_export=False)[source]

data_juicer.analysis.diversity_analysis module

data_juicer.analysis.diversity_analysis.find_root_verb_and_its_dobj(tree_root)[source]

Find the verb and its object closest to the root.

Parameters:

tree_root – the root of lexical tree

Returns:

valid verb and its object.

data_juicer.analysis.diversity_analysis.find_root_verb_and_its_dobj_in_string(nlp, s, first_sent=True)[source]

Find the verb and its object closest to the root of lexical tree of input string.

Parameters:
  • nlp – the diversity model to analyze the diversity strings

  • s – the string to be analyzed

  • first_sent – whether to analyze the first sentence in the input string only. If it’s true, return the analysis result of the first sentence no matter it’s valid or not. If it’s false, return the first valid result over all sentences

Returns:

valid verb and its object of this string

data_juicer.analysis.diversity_analysis.get_diversity(dataset, top_k_verbs=20, top_k_nouns=4, **kwargs)[source]

Given the lexical tree analysis result, return the diversity results.

Parameters:
  • dataset – lexical tree analysis result

  • top_k_verbs – only keep the top_k_verbs largest verb groups

  • top_k_nouns – only keep the top_k_nouns largest noun groups for each verb group

  • kwargs – extra args

Returns:

the diversity results

class data_juicer.analysis.diversity_analysis.DiversityAnalysis(dataset, output_path, lang_or_model='en')[source]

Bases: object

Apply diversity analysis for each sample and get an overall analysis result.

__init__(dataset, output_path, lang_or_model='en')[source]

Initialization method :param dataset: the dataset to be analyzed :param output_path: path to store the analysis results :param lang_or_model: the diversity model or a specific language used to load the diversity model.

compute(lang_or_model=None, column_name='text')[source]

Apply lexical tree analysis on each sample.

Parameters:
  • lang_or_model – the diversity model or a specific language used to load the diversity model

  • column_name – the name of column to be analyzed

Returns:

the analysis result.

analyze(lang_or_model=None, column_name='text', postproc_func=<function get_diversity>, **postproc_kwarg)[source]

Apply diversity analysis on the whole dataset.

Parameters:
  • lang_or_model – the diversity model or a specific language used to load the diversity model

  • column_name – the name of column to be analyzed

  • postproc_func – function to analyze diversity. In default, it’s function get_diversity

  • postproc_kwarg – arguments of the postproc_func

Returns:

data_juicer.analysis.measure module

class data_juicer.analysis.measure.Measure[source]

Bases: object

Base class for Measure distribution.

name = 'base'
measure(*args, **kwargs)[source]
class data_juicer.analysis.measure.KLDivMeasure[source]

Bases: Measure

Measure Kullback-Leibler divergence.

name = 'kl_divergence'
measure(p, q)[source]
class data_juicer.analysis.measure.JSDivMeasure[source]

Bases: Measure

Measure Jensen-Shannon divergence.

name = 'js_divergence'
measure(p, q)[source]
class data_juicer.analysis.measure.CrossEntropyMeasure[source]

Bases: Measure

Measure Cross-Entropy.

name = 'cross_entropy'
measure(p, q)[source]
class data_juicer.analysis.measure.EntropyMeasure[source]

Bases: Measure

Measure Entropy.

name = 'entropy'
measure(p)[source]
class data_juicer.analysis.measure.RelatedTTestMeasure[source]

Bases: Measure

Measure T-Test for two related distributions on their histogram of the same bins.

Ref: https://en.wikipedia.org/wiki/Student%27s_t-test

For continuous features or distributions, the input could be dataset stats list. For discrete features or distributions, the input could be the tags or the categories list.

name = 't-test'
static stats_to_hist(p, q)[source]
static category_to_hist(p, q)[source]
measure(p, q)[source]
Parameters:
  • p – the first feature or distribution. (stats/tags/categories)

  • q – the second feature or distribution. (stats/tags/categories)

Returns:

the T-Test results object – ([ref](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats._result_classes.TtestResult.html#scipy.stats._result_classes.TtestResult)) # noqa: E501

data_juicer.analysis.overall_analysis module

class data_juicer.analysis.overall_analysis.OverallAnalysis(dataset, output_path)[source]

Bases: object

Apply analysis on the overall stats, including mean, std, quantiles, etc.

__init__(dataset, output_path)[source]

Initialization method.

Parameters:
  • dataset – the dataset to be analyzed

  • output_path – path to store the analysis results.

refine_single_column(col)[source]
analyze(percentiles=[], num_proc=1, skip_export=False)[source]

Apply overall analysis on the whole dataset based on the describe method of pandas.

Parameters:
  • percentiles – percentiles to analyze

  • num_proc – number of processes to analyze the dataset

  • skip_export – whether export the results to disk

Returns:

the overall analysis result.

Module contents

class data_juicer.analysis.ColumnWiseAnalysis(dataset, output_path, overall_result=None, save_stats_in_one_file=True)[source]

Bases: object

Apply analysis on each column of stats respectively.

__init__(dataset, output_path, overall_result=None, save_stats_in_one_file=True)[source]

Initialization method

Parameters:
  • dataset – the dataset to be analyzed

  • output_path – path to store the analysis results

  • overall_result – optional precomputed overall stats result

  • save_stats_in_one_file – whether save all analysis figures of all stats into one image file

analyze(show_percentiles=False, show=False, skip_export=False)[source]

Apply analysis and draw the analysis figure for stats.

Parameters:
  • show_percentiles – whether to show the percentile line in each sub-figure. If it’s true, there will be several red lines to indicate the quantiles of the stats distributions

  • show – whether to show in a single window after drawing

  • skip_export – whether save the results into disk

Returns:

draw_hist(ax, data, save_path, percentiles=None, show=False)[source]

Draw the histogram for the data.

Parameters:
  • ax – the axes to draw

  • data – data to draw

  • save_path – the path to save the histogram figure

  • percentiles – the overall analysis result of the data including percentile information

  • show – whether to show in a single window after drawing

Returns:

draw_box(ax, data, save_path, percentiles=None, show=False)[source]

Draw the box plot for the data.

Parameters:
  • ax – the axes to draw

  • data – data to draw

  • save_path – the path to save the box figure

  • percentiles – the overall analysis result of the data including percentile information

  • show – whether to show in a single window after drawing

Returns:

draw_wordcloud(ax, data, save_path, show=False)[source]
class data_juicer.analysis.CorrelationAnalysis(dataset, output_path)[source]

Bases: object

Analyze the correlations among different stats. Only for numerical stats.

__init__(dataset, output_path)[source]

Initialization method.

Parameters:
  • dataset – the dataset to be analyzed

  • output_path – path to store the analysis results

analyze(method='pearson', show=False, skip_export=False)[source]
class data_juicer.analysis.DiversityAnalysis(dataset, output_path, lang_or_model='en')[source]

Bases: object

Apply diversity analysis for each sample and get an overall analysis result.

__init__(dataset, output_path, lang_or_model='en')[source]

Initialization method :param dataset: the dataset to be analyzed :param output_path: path to store the analysis results :param lang_or_model: the diversity model or a specific language used to load the diversity model.

compute(lang_or_model=None, column_name='text')[source]

Apply lexical tree analysis on each sample.

Parameters:
  • lang_or_model – the diversity model or a specific language used to load the diversity model

  • column_name – the name of column to be analyzed

Returns:

the analysis result.

analyze(lang_or_model=None, column_name='text', postproc_func=<function get_diversity>, **postproc_kwarg)[source]

Apply diversity analysis on the whole dataset.

Parameters:
  • lang_or_model – the diversity model or a specific language used to load the diversity model

  • column_name – the name of column to be analyzed

  • postproc_func – function to analyze diversity. In default, it’s function get_diversity

  • postproc_kwarg – arguments of the postproc_func

Returns:

class data_juicer.analysis.OverallAnalysis(dataset, output_path)[source]

Bases: object

Apply analysis on the overall stats, including mean, std, quantiles, etc.

__init__(dataset, output_path)[source]

Initialization method.

Parameters:
  • dataset – the dataset to be analyzed

  • output_path – path to store the analysis results.

refine_single_column(col)[source]
analyze(percentiles=[], num_proc=1, skip_export=False)[source]

Apply overall analysis on the whole dataset based on the describe method of pandas.

Parameters:
  • percentiles – percentiles to analyze

  • num_proc – number of processes to analyze the dataset

  • skip_export – whether export the results to disk

Returns:

the overall analysis result.