data_juicer.ops.mapper.remove_non_chinese_character_mapper module¶
- class data_juicer.ops.mapper.remove_non_chinese_character_mapper.RemoveNonChineseCharacterlMapper(keep_alphabet: bool = True, keep_number: bool = True, keep_punc: bool = True, *args, **kwargs)[source]¶
Bases:
Mapper
Removes non-Chinese characters from text samples.
This mapper removes all characters that are not part of the Chinese character set. - It can optionally keep alphabets, numbers, and punctuation based on the configuration. - The removal is done using a regular expression pattern. - The pattern is constructed to exclude or include alphabets, numbers, and punctuation
as specified.
The key metric for this operation is the presence of non-Chinese characters, which are removed.
The operator processes samples in a batched manner.
- __init__(keep_alphabet: bool = True, keep_number: bool = True, keep_punc: bool = True, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
keep_alphabet – whether to keep alphabet
keep_number – whether to keep number
keep_punc – whether to keep punctuation
args – extra args
kwargs – extra args