data_juicer.ops.mapper.remove_non_chinese_character_mapper module

class data_juicer.ops.mapper.remove_non_chinese_character_mapper.RemoveNonChineseCharacterlMapper(keep_alphabet: bool = True, keep_number: bool = True, keep_punc: bool = True, *args, **kwargs)[源代码]

基类:Mapper

Removes non-Chinese characters from text samples.

This mapper removes all characters that are not part of the Chinese character set. - It can optionally keep alphabets, numbers, and punctuation based on the configuration. - The removal is done using a regular expression pattern. - The pattern is constructed to exclude or include alphabets, numbers, and punctuation

as specified.

  • The key metric for this operation is the presence of non-Chinese characters, which are removed.

  • The operator processes samples in a batched manner.

__init__(keep_alphabet: bool = True, keep_number: bool = True, keep_punc: bool = True, *args, **kwargs)[源代码]

Initialization method.

参数:
  • keep_alphabet -- whether to keep alphabet

  • keep_number -- whether to keep number

  • keep_punc -- whether to keep punctuation

  • args -- extra args

  • kwargs -- extra args

process_batched(samples)[源代码]