data_juicer.ops.mapper.remove_repeat_sentences_mapper module¶

data_juicer.ops.mapper.remove_repeat_sentences_mapper.split_sentence(text)[源代码]¶

class data_juicer.ops.mapper.remove_repeat_sentences_mapper.RemoveRepeatSentencesMapper(lowercase: bool = False, ignore_special_character: bool = True, min_repeat_sentence_length: int = 2, *args, **kwargs)[源代码]¶

基类：Mapper

Mapper to remove repeat sentences in text samples.

This operator processes text samples to remove duplicate sentences. It splits the text into lines and then further splits each line into sentences. Sentences are considered duplicates if they are identical after optional case normalization and special character removal. The operator uses a hash set to track unique sentences. Sentences shorter than min_repeat_sentence_length are not deduplicated. If ignore_special_character is enabled, special characters (all except Chinese, letters, and numbers) are ignored when checking for duplicates. The resulting text is reassembled with unique sentences.

__init__(lowercase: bool = False, ignore_special_character: bool = True, min_repeat_sentence_length: int = 2, *args, **kwargs)[源代码]¶

Initialization method.

参数:

lowercase -- Whether to convert sample text to lower case
ignore_special_character -- Whether to ignore special characters when judging repeated sentences. Special characters are all characters except Chinese characters, letters and numbers.
min_repeat_sentence_length -- Sentences shorter than this length will not be deduplicated. If ignore_special_character is set to True, then special characters are not included in this length.
args -- extra args
kwargs -- extra args

process_batched(samples)[源代码]¶