data_juicer.analysis.diversity_analysis module

data_juicer.analysis.diversity_analysis.find_root_verb_and_its_dobj(tree_root)[源代码]

Find the verb and its object closest to the root.

参数:

tree_root -- the root of lexical tree

返回:

valid verb and its object.

data_juicer.analysis.diversity_analysis.find_root_verb_and_its_dobj_in_string(nlp, s, first_sent=True)[源代码]

Find the verb and its object closest to the root of lexical tree of input string.

参数:
  • nlp -- the diversity model to analyze the diversity strings

  • s -- the string to be analyzed

  • first_sent -- whether to analyze the first sentence in the input string only. If it's true, return the analysis result of the first sentence no matter it's valid or not. If it's false, return the first valid result over all sentences

返回:

valid verb and its object of this string

data_juicer.analysis.diversity_analysis.get_diversity(dataset, top_k_verbs=20, top_k_nouns=4, **kwargs)[源代码]

Given the lexical tree analysis result, return the diversity results.

参数:
  • dataset -- lexical tree analysis result

  • top_k_verbs -- only keep the top_k_verbs largest verb groups

  • top_k_nouns -- only keep the top_k_nouns largest noun groups for each verb group

  • kwargs -- extra args

返回:

the diversity results

class data_juicer.analysis.diversity_analysis.DiversityAnalysis(dataset, output_path, lang_or_model='en')[源代码]

基类:object

Apply diversity analysis for each sample and get an overall analysis result.

__init__(dataset, output_path, lang_or_model='en')[源代码]

Initialization method :param dataset: the dataset to be analyzed :param output_path: path to store the analysis results :param lang_or_model: the diversity model or a specific language used to load the diversity model.

compute(lang_or_model=None, column_name='text')[源代码]

Apply lexical tree analysis on each sample.

参数:
  • lang_or_model -- the diversity model or a specific language used to load the diversity model

  • column_name -- the name of column to be analyzed

返回:

the analysis result.

analyze(lang_or_model=None, column_name='text', postproc_func=<function get_diversity>, **postproc_kwarg)[源代码]

Apply diversity analysis on the whole dataset.

参数:
  • lang_or_model -- the diversity model or a specific language used to load the diversity model

  • column_name -- the name of column to be analyzed

  • postproc_func -- function to analyze diversity. In default, it's function get_diversity

  • postproc_kwarg -- arguments of the postproc_func

返回: