remove_long_words_mapper¶

Mapper to remove long words within a specific range.

This operator filters out words in the text that are either shorter than the specified minimum length or longer than the specified maximum length. Words are first checked with their original length, and if they do not meet the criteria, they are stripped of special characters and re-evaluated. The key metric used is the character-based length of each word. The processed text retains only the words that fall within the defined length range. This operator processes text in batches for efficiency.

映射器，移除特定范围内的长词。

该算子过滤掉文本中长度短于指定最小长度或长于指定最大长度的单词。首先检查单词的原始长度，如果不满足条件，则剥离特殊字符后重新评估。使用的关键指标是每个单词基于字符的长度。处理后的文本只保留符合定义长度范围的单词。该算子批量处理文本以提高效率。

Type 算子类型: mapper

Tags 标签: cpu, text

🔧 Parameter Configuration 参数配置¶

name 参数名	type 类型	default 默认值	desc 说明
`min_len`	<class ‘int’>	`1`	The min mapper word length in this op, words will be filtered if their length is below this parameter.
`max_len`	<class ‘int’>	`9223372036854775807`	The max mapper word length in this op, words will be filtered if their length exceeds this parameter.
`args`		`''`	extra args
`kwargs`		`''`	extra args

📊 Effect demonstration 效果演示¶

test_normal_case¶

RemoveLongWordsMapper(min_len=3, max_len=15)

📥 input data 输入数据¶

Sample 1: text

This paper proposed novel method LLM pretraining.

📤 output data 输出数据¶

Sample 1: text

This paper proposed novel method LLM pretraining.

✨ explanation 解释¶

This example demonstrates the operator’s behavior when all words in the text fall within the specified length range (3 to 15 characters). As a result, no words are removed from the input text, and the output is identical to the input. 这个例子展示了当文本中的所有单词都在指定的长度范围内（3到15个字符）时，算子的行为。因此，输入文本中没有单词被移除，输出与输入完全相同。

test_special_words_case¶

RemoveLongWordsMapper(min_len=3, max_len=15)

📥 input data 输入数据¶

Sample 1: text

This paper proposed a novel eqeqweqwewqenhq😊😠 method on LLM.

Sample 2: text

Sur la plateforme MT4, plusieurs manières d'accéder0123813976125

Sample 3: text

The Mona Lisa doesnÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢t have eyebrows.

📤 output data 输出数据¶

Sample 1: text

This paper proposed novel eqeqweqwewqenhq😊😠 method LLM.

Sample 2: text

Sur plateforme MT4, plusieurs manières d'accéder0123813976125

Sample 3: text

The Mona Lisa have eyebrows.

✨ explanation 解释¶

This example illustrates how the operator handles special characters and very long or short words. Words that do not initially meet the length criteria (like ‘doesnÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢t’ being too long) are stripped of special characters and re-evaluated. If they then fit the length criteria, they are kept; otherwise, they are removed. The presence of emojis and numbers does not affect their evaluation as long as the total character count is within the allowed range. 这个例子说明了算子如何处理特殊字符以及非常长或短的单词。最初不符合长度标准的单词（如’doesnÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢t’太长）会被去除特殊字符并重新评估。如果它们之后符合长度标准，则保留；否则，将被移除。只要总字符数在允许的范围内，表情符号和数字的存在不会影响它们的评估。

remove_long_words_mapper¶

🔧 Parameter Configuration 参数配置¶

📊 Effect demonstration 效果演示¶

test_normal_case¶

📥 input data 输入数据¶

📤 output data 输出数据¶

✨ explanation 解释¶

test_special_words_case¶

📥 input data 输入数据¶

📤 output data 输出数据¶

✨ explanation 解释¶

🔗 related links 相关链接¶