data_juicer.analysis package¶
Submodules¶
data_juicer.analysis.collector module¶
- class data_juicer.analysis.collector.TextTokenDistCollector(tokenizer)[source]¶
Bases:
object
Tokenize and collect distribution of tokens for given dataset with a specified tokenizer.
- __init__(tokenizer)[source]¶
Initialization method.
- Parameters:
tokenizer – tokenizer name on huggingface
- collect(data_path, text_key, num_proc=1) Categorical [source]¶
Tokenize and collect tokens distribution of input dataset :param data_path: path to input dataset. :param text_key: field keys that will be considered into token counts. :param num_proc: number of processes to count tokens. :return: token distribution.
data_juicer.analysis.column_wise_analysis module¶
- data_juicer.analysis.column_wise_analysis.get_row_col(total_num, factor=2)[source]¶
Given the total number of stats figures, get the “best” number of rows and columns. This function is needed when we need to store all stats figures into one image.
- Parameters:
total_num – Total number of stats figures
factor – Number of sub-figure types in each figure. In default, it’s 2, which means there are histogram and box plot for each stat figure
- Returns:
“best” number of rows and columns, and the grid list
- class data_juicer.analysis.column_wise_analysis.ColumnWiseAnalysis(dataset, output_path, overall_result=None, save_stats_in_one_file=True)[source]¶
Bases:
object
Apply analysis on each column of stats respectively.
- __init__(dataset, output_path, overall_result=None, save_stats_in_one_file=True)[source]¶
Initialization method
- Parameters:
dataset – the dataset to be analyzed
output_path – path to store the analysis results
overall_result – optional precomputed overall stats result
save_stats_in_one_file – whether save all analysis figures of all stats into one image file
- analyze(show_percentiles=False, show=False, skip_export=False)[source]¶
Apply analysis and draw the analysis figure for stats.
- Parameters:
show_percentiles – whether to show the percentile line in each sub-figure. If it’s true, there will be several red lines to indicate the quantiles of the stats distributions
show – whether to show in a single window after drawing
skip_export – whether save the results into disk
- Returns:
- draw_hist(ax, data, save_path, percentiles=None, show=False)[source]¶
Draw the histogram for the data.
- Parameters:
ax – the axes to draw
data – data to draw
save_path – the path to save the histogram figure
percentiles – the overall analysis result of the data including percentile information
show – whether to show in a single window after drawing
- Returns:
- draw_box(ax, data, save_path, percentiles=None, show=False)[source]¶
Draw the box plot for the data.
- Parameters:
ax – the axes to draw
data – data to draw
save_path – the path to save the box figure
percentiles – the overall analysis result of the data including percentile information
show – whether to show in a single window after drawing
- Returns:
data_juicer.analysis.diversity_analysis module¶
- data_juicer.analysis.diversity_analysis.find_root_verb_and_its_dobj(tree_root)[source]¶
Find the verb and its object closest to the root.
- Parameters:
tree_root – the root of lexical tree
- Returns:
valid verb and its object.
- data_juicer.analysis.diversity_analysis.find_root_verb_and_its_dobj_in_string(nlp, s, first_sent=True)[source]¶
Find the verb and its object closest to the root of lexical tree of input string.
- Parameters:
nlp – the diversity model to analyze the diversity strings
s – the string to be analyzed
first_sent – whether to analyze the first sentence in the input string only. If it’s true, return the analysis result of the first sentence no matter it’s valid or not. If it’s false, return the first valid result over all sentences
- Returns:
valid verb and its object of this string
- data_juicer.analysis.diversity_analysis.get_diversity(dataset, top_k_verbs=20, top_k_nouns=4, **kwargs)[source]¶
Given the lexical tree analysis result, return the diversity results.
- Parameters:
dataset – lexical tree analysis result
top_k_verbs – only keep the top_k_verbs largest verb groups
top_k_nouns – only keep the top_k_nouns largest noun groups for each verb group
kwargs – extra args
- Returns:
the diversity results
- class data_juicer.analysis.diversity_analysis.DiversityAnalysis(dataset, output_path, lang_or_model='en')[source]¶
Bases:
object
Apply diversity analysis for each sample and get an overall analysis result.
- __init__(dataset, output_path, lang_or_model='en')[source]¶
Initialization method :param dataset: the dataset to be analyzed :param output_path: path to store the analysis results :param lang_or_model: the diversity model or a specific language used to load the diversity model.
- compute(lang_or_model=None, column_name='text')[source]¶
Apply lexical tree analysis on each sample.
- Parameters:
lang_or_model – the diversity model or a specific language used to load the diversity model
column_name – the name of column to be analyzed
- Returns:
the analysis result.
- analyze(lang_or_model=None, column_name='text', postproc_func=<function get_diversity>, **postproc_kwarg)[source]¶
Apply diversity analysis on the whole dataset.
- Parameters:
lang_or_model – the diversity model or a specific language used to load the diversity model
column_name – the name of column to be analyzed
postproc_func – function to analyze diversity. In default, it’s function get_diversity
postproc_kwarg – arguments of the postproc_func
- Returns:
data_juicer.analysis.draw module¶
- data_juicer.analysis.draw.draw_heatmap(data, xlabels, ylables=None, figsize=None, triangle=False)[source]¶
Draw heatmap of input data with special lables.
- Parameters:
data – input data, now support [list, tuple, numpy array, ‘torch tensor’]
xlabels – x axis labels.
ylabels – y axis labels, if None, use xlabels.
figsize – figure size.
triangle – only display triangle.
- Returns:
a plot figure.
data_juicer.analysis.measure module¶
- class data_juicer.analysis.measure.Measure[source]¶
Bases:
object
Base class for Measure distribution.
- name = 'base'¶
- class data_juicer.analysis.measure.KLDivMeasure[source]¶
Bases:
Measure
Measure Kullback-Leibler divergence.
- name = 'kl_divergence'¶
- class data_juicer.analysis.measure.JSDivMeasure[source]¶
Bases:
Measure
Measure Jensen-Shannon divergence.
- name = 'js_divergence'¶
- class data_juicer.analysis.measure.CrossEntropyMeasure[source]¶
Bases:
Measure
Measure Cross-Entropy.
- name = 'cross_entropy'¶
- class data_juicer.analysis.measure.EntropyMeasure[source]¶
Bases:
Measure
Measure Entropy.
- name = 'entropy'¶
- class data_juicer.analysis.measure.RelatedTTestMeasure[source]¶
Bases:
Measure
Measure T-Test for two related distributions on their histogram of the same bins.
Ref: https://en.wikipedia.org/wiki/Student%27s_t-test
For continuous features or distributions, the input could be dataset stats list. For discrete features or distributions, the input could be the tags or the categories list.
- name = 't-test'¶
- measure(p, q)[source]¶
- Parameters:
p – the first feature or distribution. (stats/tags/categories)
q – the second feature or distribution. (stats/tags/categories)
- Returns:
the T-Test results object – ([ref](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats._result_classes.TtestResult.html#scipy.stats._result_classes.TtestResult)) # noqa: E501
data_juicer.analysis.overall_analysis module¶
- class data_juicer.analysis.overall_analysis.OverallAnalysis(dataset, output_path)[source]¶
Bases:
object
Apply analysis on the overall stats, including mean, std, quantiles, etc.
- __init__(dataset, output_path)[source]¶
Initialization method.
- Parameters:
dataset – the dataset to be analyzed
output_path – path to store the analysis results.
- analyze(percentiles=[], num_proc=1, skip_export=False)[source]¶
Apply overall analysis on the whole dataset based on the describe method of pandas.
- Parameters:
percentiles – percentiles to analyze
num_proc – number of processes to analyze the dataset
skip_export – whether export the results to disk
- Returns:
the overall analysis result.
Module contents¶
- class data_juicer.analysis.ColumnWiseAnalysis(dataset, output_path, overall_result=None, save_stats_in_one_file=True)[source]¶
Bases:
object
Apply analysis on each column of stats respectively.
- __init__(dataset, output_path, overall_result=None, save_stats_in_one_file=True)[source]¶
Initialization method
- Parameters:
dataset – the dataset to be analyzed
output_path – path to store the analysis results
overall_result – optional precomputed overall stats result
save_stats_in_one_file – whether save all analysis figures of all stats into one image file
- analyze(show_percentiles=False, show=False, skip_export=False)[source]¶
Apply analysis and draw the analysis figure for stats.
- Parameters:
show_percentiles – whether to show the percentile line in each sub-figure. If it’s true, there will be several red lines to indicate the quantiles of the stats distributions
show – whether to show in a single window after drawing
skip_export – whether save the results into disk
- Returns:
- draw_hist(ax, data, save_path, percentiles=None, show=False)[source]¶
Draw the histogram for the data.
- Parameters:
ax – the axes to draw
data – data to draw
save_path – the path to save the histogram figure
percentiles – the overall analysis result of the data including percentile information
show – whether to show in a single window after drawing
- Returns:
- draw_box(ax, data, save_path, percentiles=None, show=False)[source]¶
Draw the box plot for the data.
- Parameters:
ax – the axes to draw
data – data to draw
save_path – the path to save the box figure
percentiles – the overall analysis result of the data including percentile information
show – whether to show in a single window after drawing
- Returns:
- class data_juicer.analysis.DiversityAnalysis(dataset, output_path, lang_or_model='en')[source]¶
Bases:
object
Apply diversity analysis for each sample and get an overall analysis result.
- __init__(dataset, output_path, lang_or_model='en')[source]¶
Initialization method :param dataset: the dataset to be analyzed :param output_path: path to store the analysis results :param lang_or_model: the diversity model or a specific language used to load the diversity model.
- compute(lang_or_model=None, column_name='text')[source]¶
Apply lexical tree analysis on each sample.
- Parameters:
lang_or_model – the diversity model or a specific language used to load the diversity model
column_name – the name of column to be analyzed
- Returns:
the analysis result.
- analyze(lang_or_model=None, column_name='text', postproc_func=<function get_diversity>, **postproc_kwarg)[source]¶
Apply diversity analysis on the whole dataset.
- Parameters:
lang_or_model – the diversity model or a specific language used to load the diversity model
column_name – the name of column to be analyzed
postproc_func – function to analyze diversity. In default, it’s function get_diversity
postproc_kwarg – arguments of the postproc_func
- Returns:
- class data_juicer.analysis.OverallAnalysis(dataset, output_path)[source]¶
Bases:
object
Apply analysis on the overall stats, including mean, std, quantiles, etc.
- __init__(dataset, output_path)[source]¶
Initialization method.
- Parameters:
dataset – the dataset to be analyzed
output_path – path to store the analysis results.
- analyze(percentiles=[], num_proc=1, skip_export=False)[source]¶
Apply overall analysis on the whole dataset based on the describe method of pandas.
- Parameters:
percentiles – percentiles to analyze
num_proc – number of processes to analyze the dataset
skip_export – whether export the results to disk
- Returns:
the overall analysis result.