data_juicer.ops.common.helper_func module¶
- data_juicer.ops.common.helper_func.strip(document, strip_characters)[源代码]¶
Way faster than document.strip(strip_characters) since strip_characters is now a set instead of a str, and it contains a lot of elements (all the emojis).
- 参数:
document -- document to be processed
strip_characters -- characters used for stripping document
- 返回:
stripped document
- data_juicer.ops.common.helper_func.split_on_whitespace(document, new_line=False, tab=False)[源代码]¶
This method also removes concatenated spaces.
- 参数:
document -- document to be split
new_line -- whether to split document with '\n'
tag -- whether to split document with '\t'
- 返回:
word list obtained after splitting document
- data_juicer.ops.common.helper_func.split_on_newline_tab_whitespace(document)[源代码]¶
This method is used to split the document into different levels of sub- sentences.
First split on "\n", then on "\t", then on " ". :param document: document to be split :return: sentence list obtained after splitting document
- data_juicer.ops.common.helper_func.merge_on_whitespace_tab_newline(sentences)[源代码]¶
This method is used to merge different levels of sub-sentences into one document. Invert the method split_on_newline_tab_whitespace. Removes concatenated separators.
- 参数:
sentences -- sentence list to be merged
- 返回:
document obtained after merging sub-sentences
- data_juicer.ops.common.helper_func.words_augmentation(words, group_size, join_char)[源代码]¶
Augment words, especially for Chinese (without a space between words) and Vietnamese (with a space between syllables).
- 参数:
word -- word list to be augmented
group_size -- the size of word groups that need to be merged
join_char -- characters to be added between word group
- 返回:
word list after augment
- data_juicer.ops.common.helper_func.get_words_from_document(document, token_func=None, new_line=True, tab=True)[源代码]¶
Get words from a document. Useful to compute ratios, like the stopwords ratio.
- 参数:
document -- document that need to split words.
token_func -- function of tokenizer, if specified, the function will be used for split document into different tokens.
new_line -- whether to use '\n' to split words.
tab -- whether to use '\t' to split words.
- 返回:
word list obtained from document
- data_juicer.ops.common.helper_func.words_refinement(words, lower_case=False, strip_chars=None, use_words_aug=False, words_aug_group_sizes=[2], words_aug_join_char='')[源代码]¶
Refine split words. Non reversible since the document is split on multiple characters, words are stripped of special characters and characters are converted to lower case.
- 参数:
words -- the word list to be augmented
lower_case -- whether to convert word to lowercase
strip_chars -- chars that need to be stripped in words
use_words_aug -- whether to use word augmentation
words_aug_group_sizes -- the size of word groups that need to be merged
words_aug_join_char -- characters to be added between word group
- 返回:
refined words or word list
- data_juicer.ops.common.helper_func.get_sentences_from_document(document, model_func=None)[源代码]¶
Get sentences from a document.
- 参数:
document -- document that need to split sentences
model_func -- function of sentence model, if specified, the function will be used for splitting document into different sentences.
- 返回:
document with the sentences separated by '\n'