data_juicer.ops.mapper.remove_long_words_mapper module

class data_juicer.ops.mapper.remove_long_words_mapper.RemoveLongWordsMapper(min_len: int = 1, max_len: int = 9223372036854775807, *args, **kwargs)[源代码]

基类:Mapper

Mapper to remove long words within a specific range.

This operator filters out words in the text that are either shorter than the specified minimum length or longer than the specified maximum length. Words are first checked with their original length, and if they do not meet the criteria, they are stripped of special characters and re-evaluated. The key metric used is the character-based length of each word. The processed text retains only the words that fall within the defined length range. This operator processes text in batches for efficiency.

__init__(min_len: int = 1, max_len: int = 9223372036854775807, *args, **kwargs)[源代码]

Initialization method.

参数:
  • min_len -- The min mapper word length in this op, words will be filtered if their length is below this parameter.

  • max_len -- The max mapper word length in this op, words will be filtered if their length exceeds this parameter.

  • args -- extra args

  • kwargs -- extra args

should_keep_long_word(word)[源代码]
process_batched(samples)[源代码]