data_juicer.ops.mapper.remove_repeat_sentences_mapper module

data_juicer.ops.mapper.remove_repeat_sentences_mapper.split_sentence(text)[源代码]
class data_juicer.ops.mapper.remove_repeat_sentences_mapper.RemoveRepeatSentencesMapper(lowercase: bool = False, ignore_special_character: bool = True, min_repeat_sentence_length: int = 2, *args, **kwargs)[源代码]

基类:Mapper

Mapper to remove repeat sentences in text samples.

__init__(lowercase: bool = False, ignore_special_character: bool = True, min_repeat_sentence_length: int = 2, *args, **kwargs)[源代码]

Initialization method.

参数:
  • lowercase -- Whether to convert sample text to lower case

  • ignore_special_character -- Whether to ignore special characters when judging repeated sentences. Special characters are all characters except Chinese characters, letters and numbers.

  • min_repeat_sentence_length -- Sentences shorter than this length will not be deduplicated. If ignore_special_character is set to True, then special characters are not included in this length.

  • args -- extra args

  • kwargs -- extra args

process_batched(samples)[源代码]