# remove_non_chinese_character_mapper Removes non-Chinese characters from text samples. This mapper removes all characters that are not part of the Chinese character set. - It can optionally keep alphabets, numbers, and punctuation based on the configuration. - The removal is done using a regular expression pattern. - The pattern is constructed to exclude or include alphabets, numbers, and punctuation as specified. - The key metric for this operation is the presence of non-Chinese characters, which are removed. - The operator processes samples in a batched manner. 移除文本样本中的非汉字字符。 该映射器移除所有不属于汉字字符集的字符。 - 可根据配置选择性保留字母、数字和标点符号。 - 移除操作使用正则表达式模式进行。 - 模式构建时会根据指定情况排除或包含字母、数字和标点符号。 - 该操作的关键指标是存在非汉字字符,这些字符将被移除。 - 该算子以批量方式处理样本。 Type 算子类型: **mapper** Tags 标签: cpu, text ## 🔧 Parameter Configuration 参数配置 | name 参数名 | type 类型 | default 默认值 | desc 说明 | |--------|------|--------|------| | `keep_alphabet` | | `True` | whether to keep alphabet | | `keep_number` | | `True` | whether to keep number | | `keep_punc` | | `True` | whether to keep punctuation | | `args` | | `''` | extra args | | `kwargs` | | `''` | extra args | ## 📊 Effect demonstration 效果演示 ### test_remove_non_chinese_character ```python RemoveNonChineseCharacterlMapper(True, True, True) ``` #### 📥 input data 输入数据
Sample 1: list
['特殊的康熙部首或者扩展部首会被去除,⼏几⺇', '请问你是谁dasoidhao@1264fg.45om', 'ftp://exam匹配ple汉字ma-niè包括rdas繁體字h@hqbchd.ckdhnfes.cds', '👊    所有的非汉字a44sh都12@46h会被*&……*qb^4525去掉']
#### 📤 output data 输出数据
Sample 1: list
['特殊的康熙部首或者扩展部首会被去除几', '请问你是谁', '匹配汉字包括繁體字', '所有的非汉字都会被去掉']
#### ✨ explanation 解释 This example shows the operator removing all non-Chinese characters, including alphabets, numbers, and punctuation. The result contains only Chinese characters, which is useful when you want to keep only the Chinese text. 这个例子展示了算子移除所有非汉字字符,包括字母、数字和标点符号。结果只包含汉字,这在你只想保留中文文本时非常有用。 ### test_remove_non_chinese_character5 ```python RemoveNonChineseCharacterlMapper(True, True, True) ``` #### 📥 input data 输入数据
Sample 1: list
['特殊的康熙部首或者扩展部首会被去除,⼏几⺇', '请问你是谁dasoidhao@1264fg.45om', 'f://exam匹配ple汉12字ma-niè包括rdas繁88體字h@hqbchd.ds1', '👊    所有的非汉字a44sh都12@46h会被*&……*qb^4525去掉']
#### 📤 output data 输出数据
Sample 1: list
['特殊的康熙部首或者扩展部首会被去除几', '请问你是谁dasoidhao1264fg45om', 'fexam匹配ple汉12字mani包括rdas繁88體字hhqbchdds1', '所有的非汉字a44sh都1246h会被qb4525去掉']
#### ✨ explanation 解释 In this example, the operator keeps both alphabets and numbers while removing all other non-Chinese characters. This is useful for cases where you want to preserve some additional information along with the Chinese text, such as alphanumeric codes or file paths. 在这个例子中,算子保留了字母和数字,同时移除了其他所有非汉字字符。这在你想在中文文本之外还保留一些额外信息(如字母数字代码或文件路径)时非常有用。 ## 🔗 related links 相关链接 - [source code 源代码](../../../data_juicer/ops/mapper/remove_non_chinese_character_mapper.py) - [unit test 单元测试](../../../tests/ops/mapper/test_remove_non_chinese_character_mapper.py) - [Return operator list 返回算子列表](../../Operators.md)