remove_non_chinese_character_mapper

Removes non-Chinese characters from text samples.

This mapper removes all characters that are not part of the Chinese character set.

  • It can optionally keep alphabets, numbers, and punctuation based on the configuration.

  • The removal is done using a regular expression pattern.

  • The pattern is constructed to exclude or include alphabets, numbers, and punctuation as specified.

  • The key metric for this operation is the presence of non-Chinese characters, which are removed.

  • The operator processes samples in a batched manner.

移除文本样本中的非汉字字符。

该映射器移除所有不属于汉字字符集的字符。

  • 可根据配置选择性保留字母、数字和标点符号。

  • 移除操作使用正则表达式模式进行。

  • 模式构建时会根据指定情况排除或包含字母、数字和标点符号。

  • 该操作的关键指标是存在非汉字字符,这些字符将被移除。

  • 该算子以批量方式处理样本。

Type 算子类型: mapper

Tags 标签: cpu, text

🔧 Parameter Configuration 参数配置

name 参数名

type 类型

default 默认值

desc 说明

keep_alphabet

<class ‘bool’>

True

whether to keep alphabet

keep_number

<class ‘bool’>

True

whether to keep number

keep_punc

<class ‘bool’>

True

whether to keep punctuation

args

''

extra args

kwargs

''

extra args

📊 Effect demonstration 效果演示

test_remove_non_chinese_character

RemoveNonChineseCharacterlMapper(True, True, True)

📥 input data 输入数据

Sample 1: list
['特殊的康熙部首或者扩展部首会被去除,⼏几⺇', '请问你是谁dasoidhao@1264fg.45om', 'ftp://exam匹配ple汉字ma-niè包括rdas繁體字h@hqbchd.ckdhnfes.cds', '👊    所有的非汉字a44sh都12@46h会被*&……*qb^4525去掉']

📤 output data 输出数据

Sample 1: list
['特殊的康熙部首或者扩展部首会被去除几', '请问你是谁', '匹配汉字包括繁體字', '所有的非汉字都会被去掉']

✨ explanation 解释

This example shows the operator removing all non-Chinese characters, including alphabets, numbers, and punctuation. The result contains only Chinese characters, which is useful when you want to keep only the Chinese text. 这个例子展示了算子移除所有非汉字字符,包括字母、数字和标点符号。结果只包含汉字,这在你只想保留中文文本时非常有用。

test_remove_non_chinese_character5

RemoveNonChineseCharacterlMapper(True, True, True)

📥 input data 输入数据

Sample 1: list
['特殊的康熙部首或者扩展部首会被去除,⼏几⺇', '请问你是谁dasoidhao@1264fg.45om', 'f://exam匹配ple汉12字ma-niè包括rdas繁88體字h@hqbchd.ds1', '👊    所有的非汉字a44sh都12@46h会被*&……*qb^4525去掉']

📤 output data 输出数据

Sample 1: list
['特殊的康熙部首或者扩展部首会被去除几', '请问你是谁dasoidhao1264fg45om', 'fexam匹配ple汉12字mani包括rdas繁88體字hhqbchdds1', '所有的非汉字a44sh都1246h会被qb4525去掉']

✨ explanation 解释

In this example, the operator keeps both alphabets and numbers while removing all other non-Chinese characters. This is useful for cases where you want to preserve some additional information along with the Chinese text, such as alphanumeric codes or file paths. 在这个例子中,算子保留了字母和数字,同时移除了其他所有非汉字字符。这在你想在中文文本之外还保留一些额外信息(如字母数字代码或文件路径)时非常有用。