remove_non_chinese_character_mapper¶
Removes non-Chinese characters from text samples.
This mapper removes all characters that are not part of the Chinese character set.
It can optionally keep alphabets, numbers, and punctuation based on the configuration.
The removal is done using a regular expression pattern.
The pattern is constructed to exclude or include alphabets, numbers, and punctuation as specified.
The key metric for this operation is the presence of non-Chinese characters, which are removed.
The operator processes samples in a batched manner.
移除文本样本中的非汉字字符。
该映射器移除所有不属于汉字字符集的字符。
可根据配置选择性保留字母、数字和标点符号。
移除操作使用正则表达式模式进行。
模式构建时会根据指定情况排除或包含字母、数字和标点符号。
该操作的关键指标是存在非汉字字符,这些字符将被移除。
该算子以批量方式处理样本。
Type 算子类型: mapper
Tags 标签: cpu, text
🔧 Parameter Configuration 参数配置¶
name 参数名 |
type 类型 |
default 默认值 |
desc 说明 |
---|---|---|---|
|
<class ‘bool’> |
|
whether to keep alphabet |
|
<class ‘bool’> |
|
whether to keep number |
|
<class ‘bool’> |
|
whether to keep punctuation |
|
|
extra args |
|
|
|
extra args |
📊 Effect demonstration 效果演示¶
test_remove_non_chinese_character¶
RemoveNonChineseCharacterlMapper(True, True, True)
📥 input data 输入数据¶
['特殊的康熙部首或者扩展部首会被去除,⼏几⺇', '请问你是谁dasoidhao@1264fg.45om', 'ftp://exam匹配ple汉字ma-niè包括rdas繁體字h@hqbchd.ckdhnfes.cds', '👊 所有的非汉字a44sh都12@46h会被*&……*qb^4525去掉']
📤 output data 输出数据¶
['特殊的康熙部首或者扩展部首会被去除几', '请问你是谁', '匹配汉字包括繁體字', '所有的非汉字都会被去掉']
✨ explanation 解释¶
This example shows the operator removing all non-Chinese characters, including alphabets, numbers, and punctuation. The result contains only Chinese characters, which is useful when you want to keep only the Chinese text. 这个例子展示了算子移除所有非汉字字符,包括字母、数字和标点符号。结果只包含汉字,这在你只想保留中文文本时非常有用。
test_remove_non_chinese_character5¶
RemoveNonChineseCharacterlMapper(True, True, True)
📥 input data 输入数据¶
['特殊的康熙部首或者扩展部首会被去除,⼏几⺇', '请问你是谁dasoidhao@1264fg.45om', 'f://exam匹配ple汉12字ma-niè包括rdas繁88體字h@hqbchd.ds1', '👊 所有的非汉字a44sh都12@46h会被*&……*qb^4525去掉']
📤 output data 输出数据¶
['特殊的康熙部首或者扩展部首会被去除几', '请问你是谁dasoidhao1264fg45om', 'fexam匹配ple汉12字mani包括rdas繁88體字hhqbchdds1', '所有的非汉字a44sh都1246h会被qb4525去掉']
✨ explanation 解释¶
In this example, the operator keeps both alphabets and numbers while removing all other non-Chinese characters. This is useful for cases where you want to preserve some additional information along with the Chinese text, such as alphanumeric codes or file paths. 在这个例子中,算子保留了字母和数字,同时移除了其他所有非汉字字符。这在你想在中文文本之外还保留一些额外信息(如字母数字代码或文件路径)时非常有用。