data_juicer.ops.common package

Submodules

data_juicer.ops.common.helper_func module

class data_juicer.ops.common.helper_func.UnionFind[source]

Bases: object

__init__()[source]

Initialization method.

find(x)[source]
union(x, y)[source]
data_juicer.ops.common.helper_func.strip(document, strip_characters)[source]

Way faster than document.strip(strip_characters) since strip_characters is now a set instead of a str, and it contains a lot of elements (all the emojis).

Parameters:
  • document – document to be processed

  • strip_characters – characters used for stripping document

Returns:

stripped document

data_juicer.ops.common.helper_func.split_on_whitespace(document, new_line=False, tab=False)[source]

This method also removes concatenated spaces.

Parameters:
  • document – document to be splited

  • new_line – whether to split document with ‘\n’

  • tag – whether to split document with ‘\t’

Returns:

word list obtained after splitting document

data_juicer.ops.common.helper_func.split_on_newline_tab_whitespace(document)[source]

This method is used to split the document into different levels of sub- sentences.

First split on “\n”, then on “\t”, then on “ “. :param document: document to be splited :return: sentence list obtained after splitting document

data_juicer.ops.common.helper_func.merge_on_whitespace_tab_newline(sentences)[source]

This method is used to merge different levels of sub-sentences into one document. Invert the method split_on_newline_tab_whitespace. Removes concatenated separators.

Parameters:

sentences – sentence list to be merged

Returns:

document obtained after merging sub-sentences

data_juicer.ops.common.helper_func.words_augmentation(words, group_size, join_char)[source]

Augment words, especially for Chinese (without a space between words) and Vietnamese (with a space between syllables).

Parameters:
  • word – word list to be augmented

  • group_size – the size of word groups that need to be merged

  • join_char – characters to be added between word group

Returns:

word list after augment

data_juicer.ops.common.helper_func.get_words_from_document(document, token_func=None, new_line=True, tab=True)[source]

Get words from a document. Useful to compute ratios, like the stopwords ratio.

Parameters:
  • document – document that need to split words.

  • token_func – function of tokenizer, if specified, the function will be used for split document into different tokens.

  • new_line – whether to use ‘\n’ to split words.

  • tab – whether to use ‘\t’ to split words.

Returns:

word list obtained from document

data_juicer.ops.common.helper_func.words_refinement(words, lower_case=False, strip_chars=None, use_words_aug=False, words_aug_group_sizes=[2], words_aug_join_char='')[source]

Refine split words. Non reversible since the document is split on multiple characters, words are stripped of special characters and characters are converted to lower case.

Parameters:
  • words – the word list to be augmented

  • lower_case – whether to convert word to lowercase

  • strip_chars – chars that need to be stripped in words

  • use_words_aug – whether to use word augmentation

  • words_aug_group_sizes – the size of word groups that need to be merged

  • words_aug_join_char – characters to be added between word group

Returns:

refined words or word list

data_juicer.ops.common.helper_func.get_sentences_from_document(document, model_func=None)[source]

Get sentences from a document.

Parameters:
  • document – document that need to split sentences

  • model_func – function of sentence model, if specified, the function will be used for spliting document into different sentences.

Returns:

document with the sentences separated by ‘\n’

data_juicer.ops.common.helper_func.split_text_by_punctuation(text)[source]

Split text by any zh and en punctuation

Parameters:

text – text to be splitted.

Returns:

sub texts splitted by any zh and en punctuation

data_juicer.ops.common.special_characters module

Module contents

data_juicer.ops.common.get_sentences_from_document(document, model_func=None)[source]

Get sentences from a document.

Parameters:
  • document – document that need to split sentences

  • model_func – function of sentence model, if specified, the function will be used for spliting document into different sentences.

Returns:

document with the sentences separated by ‘\n’

data_juicer.ops.common.get_words_from_document(document, token_func=None, new_line=True, tab=True)[source]

Get words from a document. Useful to compute ratios, like the stopwords ratio.

Parameters:
  • document – document that need to split words.

  • token_func – function of tokenizer, if specified, the function will be used for split document into different tokens.

  • new_line – whether to use ‘\n’ to split words.

  • tab – whether to use ‘\t’ to split words.

Returns:

word list obtained from document

data_juicer.ops.common.merge_on_whitespace_tab_newline(sentences)[source]

This method is used to merge different levels of sub-sentences into one document. Invert the method split_on_newline_tab_whitespace. Removes concatenated separators.

Parameters:

sentences – sentence list to be merged

Returns:

document obtained after merging sub-sentences

data_juicer.ops.common.split_on_newline_tab_whitespace(document)[source]

This method is used to split the document into different levels of sub- sentences.

First split on “\n”, then on “\t”, then on “ “. :param document: document to be splited :return: sentence list obtained after splitting document

data_juicer.ops.common.split_on_whitespace(document, new_line=False, tab=False)[source]

This method also removes concatenated spaces.

Parameters:
  • document – document to be splited

  • new_line – whether to split document with ‘\n’

  • tag – whether to split document with ‘\t’

Returns:

word list obtained after splitting document

data_juicer.ops.common.strip(document, strip_characters)[source]

Way faster than document.strip(strip_characters) since strip_characters is now a set instead of a str, and it contains a lot of elements (all the emojis).

Parameters:
  • document – document to be processed

  • strip_characters – characters used for stripping document

Returns:

stripped document

data_juicer.ops.common.words_augmentation(words, group_size, join_char)[source]

Augment words, especially for Chinese (without a space between words) and Vietnamese (with a space between syllables).

Parameters:
  • word – word list to be augmented

  • group_size – the size of word groups that need to be merged

  • join_char – characters to be added between word group

Returns:

word list after augment

data_juicer.ops.common.words_refinement(words, lower_case=False, strip_chars=None, use_words_aug=False, words_aug_group_sizes=[2], words_aug_join_char='')[source]

Refine split words. Non reversible since the document is split on multiple characters, words are stripped of special characters and characters are converted to lower case.

Parameters:
  • words – the word list to be augmented

  • lower_case – whether to convert word to lowercase

  • strip_chars – chars that need to be stripped in words

  • use_words_aug – whether to use word augmentation

  • words_aug_group_sizes – the size of word groups that need to be merged

  • words_aug_join_char – characters to be added between word group

Returns:

refined words or word list

data_juicer.ops.common.split_text_by_punctuation(text)[source]

Split text by any zh and en punctuation

Parameters:

text – text to be splitted.

Returns:

sub texts splitted by any zh and en punctuation