remove_specific_chars_mapper

Removes specific characters from text samples.

This operator removes specified characters from the text. The characters to be removed can be provided as a string or a list of strings. If no characters are specified, the default set includes special and non-alphanumeric characters. The operator processes the text using a regular expression pattern that matches any of the specified characters and replaces them with an empty string. This is done in a batched manner for efficiency.

移除文本样本中的特定字符。

该算子从文本中移除指定的字符。要移除的字符可以作为字符串或字符串列表提供。如果没有指定字符,默认设置包括特殊字符和非字母数字字符。该算子使用正则表达式模式匹配任何指定的字符,并将其替换为空字符串。为了提高效率,这以批量方式进行。

Type 算子类型: mapper

Tags 标签: cpu, text

🔧 Parameter Configuration 参数配置

name 参数名

type 类型

default 默认值

desc 说明

chars_to_remove

typing.Union[str, typing.List[str]]

'◆●■►▼▲▴∆▻▷❖♡□'

a list or a string including all

args

''

extra args

kwargs

''

extra args

📊 Effect demonstration 效果演示

test_complete_html_text

RemoveSpecificCharsMapper()

📥 input data 输入数据

Sample 1: list
['这是一个干净的文本。Including Chinese and English.', '◆●■►▼▲▴∆▻▷❖♡□', '►This is a dirty text ▻ 包括中文和英文', '多个●■►▼这样的特殊字符可以►▼▲▴∆吗?', '未指定的●■☛₨➩►▼▲特殊字符会☻▷❖被删掉吗??']

📤 output data 输出数据

Sample 1: list
['这是一个干净的文本。Including Chinese and English.', '', 'This is a dirty text  包括中文和英文', '多个这样的特殊字符可以吗?', '未指定的☛₨➩特殊字符会☻被删掉吗??']

✨ explanation 解释

This example demonstrates how the operator removes specific special characters from the text, leaving only alphanumeric and some punctuation. In the first sample, no special characters are present, so the text remains unchanged. In the second sample, all characters are special, hence the resulting text is empty. The third to fifth samples show that only specified special characters are removed, while others remain. 这个示例展示了算子如何从文本中移除特定的特殊字符,只留下字母数字和某些标点符号。在第一个样本中,没有特殊字符,因此文本保持不变。在第二个样本中,所有的字符都是特殊字符,所以结果文本是空的。第三到第五个样本显示只有指定的特殊字符被移除,而其他字符则保留下来。