remove_repeat_sentences_mapper¶
Mapper to remove repeat sentences in text samples.
This operator processes text samples to remove duplicate sentences. It splits the text into lines and then further splits each line into sentences. Sentences are considered duplicates if they are identical after optional case normalization and special character removal. The operator uses a hash set to track unique sentences. Sentences shorter than min_repeat_sentence_length
are not deduplicated. If ignore_special_character
is enabled, special characters (all except Chinese, letters, and numbers) are ignored when checking for duplicates. The resulting text is reassembled with unique sentences.
映射器,移除文本样本中的重复句子。
该算子处理文本样本以移除重复的句子。它将文本拆分成行,然后进一步将每行拆分成句子。如果在可选的大小写规范化和特殊字符移除后句子完全相同,则认为它们是重复的。该算子使用哈希集合来跟踪唯一的句子。长度小于min_repeat_sentence_length
的句子不会去重。如果启用了ignore_special_character
,则在检查重复时忽略特殊字符(除了汉字、字母和数字之外的所有字符)。最终文本由唯一句子重新组装而成。
Type 算子类型: mapper
Tags 标签: cpu, text
🔧 Parameter Configuration 参数配置¶
name 参数名 |
type 类型 |
default 默认值 |
desc 说明 |
---|---|---|---|
|
<class ‘bool’> |
|
Whether to convert sample text to lower case |
|
<class ‘bool’> |
|
Whether to ignore special |
|
<class ‘int’> |
|
Sentences shorter than this |
|
|
extra args |
|
|
|
extra args |
📊 Effect demonstration 效果演示¶
test_text¶
RemoveRepeatSentencesMapper()
📥 input data 输入数据¶
['今天天气真不错,阳光明媚,适合出去散步。小明说:“今天天气真不错,我们去海边吧。” 小红回答说:“好主意!” 但是,小李觉得:“今天天气真不错,我们去爬山吧。” 今天天气真不错,阳光明媚,适合出去散步。昨天下了一整天的雨,今天终于放晴了。昨天下了一整天的雨,今天终于放晴了。', 'The quick brown fox jumps over the lazy dog. Isn\'t it amazing how a simple sentence can contain every letter of the alphabet? The quick brown fox jumps over the lazy dog. Speaking of weather, yesterday was quite dreary; however, today is absolutely delightful. Isn\'t it amazing how a simple sentence can contain every letter of the alphabet? "Let\'s seize the day," Tom exclaimed, full of enthusiasm. "Let\'s seize the day," Tom exclaimed, full of enthusi...
Show more 展开更多 (121 more chars)
['今天天气真不错,阳光明媚,适合出去散步。小明说:“今天天气真不错,我们去海边吧。” 小红回答说:“好主意!” 但是,小李觉得:“今天天气真不错,我们去爬山吧。” 今天天气真不错,阳光明媚,适合出去散步。昨天下了一整天的雨,今天终于放晴了。昨天下了一整天的雨,今天终于放晴了。', 'The quick brown fox jumps over the lazy dog. Isn\'t it amazing how a simple sentence can contain every letter of the alphabet? The quick brown fox jumps over the lazy dog. Speaking of weather, yesterday was quite dreary; however, today is absolutely delightful. Isn\'t it amazing how a simple sentence can contain every letter of the alphabet? "Let\'s seize the day," Tom exclaimed, full of enthusiasm. "Let\'s seize the day," Tom exclaimed, full of enthusiasm.', '我很开心 。但是你不开心 。我很开心 。\n你好呀!我很开心 。我好的。你好呀!', '默认配置下,长度低于2的句子不会被去重。去重?去重。去重!重。重...... 重! 1234?3215. 1234. 3. 3. 3']
📤 output data 输出数据¶
['今天天气真不错,阳光明媚,适合出去散步。小明说:“今天天气真不错,我们去海边吧。” 小红回答说:“好主意!” 但是,小李觉得:“今天天气真不错,我们去爬山吧。”昨天下了一整天的雨,今天终于放晴了。', 'The quick brown fox jumps over the lazy dog. Isn\'t it amazing how a simple sentence can contain every letter of the alphabet? Speaking of weather, yesterday was quite dreary; however, today is absolutely delightful. "Let\'s seize the day," Tom exclaimed, full of enthusiasm.', '我很开心 。但是你不开心 。\n你好呀!我好的。', '默认配置下,长度低于2的句子不会被去重。去重?重。重...... 重! 1234?3215. 3. 3. 3']
✨ explanation 解释¶
This example demonstrates the basic functionality of the RemoveRepeatSentencesMapper. It removes duplicate sentences from the input text, keeping only unique ones. In the first sample, the sentence ‘今天天气真不错,阳光明媚,适合出去散步。’ and ‘昨天下了一整天的雨,今天终于放晴了。’ are removed because they appear more than once. The operator does not modify short sentences (less than 2 characters) by default, which is why ‘重。’ and ‘3.’ are kept in the last sample. 这个例子展示了RemoveRepeatSentencesMapper的基本功能。它从输入文本中移除重复的句子,只保留唯一的句子。在第一个样本中,句子’今天天气真不错,阳光明媚,适合出去散步。’和’昨天下了一整天的雨,今天终于放晴了。’被移除了,因为它们出现了不止一次。默认情况下,算子不会修改短句(少于2个字符),这就是为什么最后的样本中’重。’和’3.’被保留下来的原因。
test_text2¶
RemoveRepeatSentencesMapper(lowercase=True, ignore_special_character=False, min_repeat_sentence_length=5)
📥 input data 输入数据¶
["Life is what happens when you're busy making other plans. John Lennon once said. Life is what happens when you're busy making other plans. This phrase has resonated with many people over the years. 人生就是当你忙于制定其他计划时发生的事情。对很多人来说,这句话引起了共鸣。", 'The quick brown fox jumps over the lazy dog. Isn\'t it amazing how a simple sentence can contain every letter of the alphabet? The quick brown fox jumps over the lazy dog. Speaking of weather, yesterday was quite dreary; however, today is absolutely delightful. Isn\'t it amazing how a simple sentence can contain every letter of the alphabet? "Let\'s seize t...
Show more 展开更多 (199 more chars)
["Life is what happens when you're busy making other plans. John Lennon once said. Life is what happens when you're busy making other plans. This phrase has resonated with many people over the years. 人生就是当你忙于制定其他计划时发生的事情。对很多人来说,这句话引起了共鸣。", 'The quick brown fox jumps over the lazy dog. Isn\'t it amazing how a simple sentence can contain every letter of the alphabet? The quick brown fox jumps over the lazy dog. Speaking of weather, yesterday was quite dreary; however, today is absolutely delightful. Isn\'t it amazing how a simple sentence can contain every letter of the alphabet? "Let\'s seize the day," Tom exclaimed, full of enthusiasm. "Let\'s seize the day," Tom exclaimed, full of enthusiasm.', '我很开心 。但是你不开心 。我很开心 。\n你好呀!我很开心 。我好的。你好呀!', '去重?去重。去重!重。重...... 重! 1234?3215. 1234. 3. 3. 3']
📤 output data 输出数据¶
["Life is what happens when you're busy making other plans. John Lennon once said. This phrase has resonated with many people over the years. 人生就是当你忙于制定其他计划时发生的事情。对很多人来说,这句话引起了共鸣。", 'The quick brown fox jumps over the lazy dog. Isn\'t it amazing how a simple sentence can contain every letter of the alphabet? Speaking of weather, yesterday was quite dreary; however, today is absolutely delightful. "Let\'s seize the day," Tom exclaimed, full of enthusiasm.', '我很开心 。但是你不开心 。\n你好呀!我好的。你好呀!', '去重?去重。去重!重。重...... 重! 1234?3215. 1234. 3. 3. 3']
✨ explanation 解释¶
In this example, we see how the operator behaves with specific parameters set: lowercase=True, ignore_special_character=False, and min_repeat_sentence_length=5. These settings make the operator more flexible. For instance, it considers sentences as duplicates even if their case differs, and it also includes special characters when checking for duplicates. Sentences shorter than 5 characters are not considered for deduplication. This is why in the third sample, ‘你好呀!’ appears twice in the output, as its length is less than 5. 在这个例子中,我们看到当设置特定参数:lowercase=True, ignore_special_character=False, 和 min_repeat_sentence_length=5时,算子的行为更加灵活。例如,即使大小写不同,它也会将句子视为重复,并且在检查重复时也包括特殊字符。长度小于5个字符的句子不考虑去重。这就是为什么在第三个样本中,’你好呀!’在输出中出现了两次,因为它的长度小于5。