data_juicer.ops.mapper.fix_unicode_mapper module

class data_juicer.ops.mapper.fix_unicode_mapper.FixUnicodeMapper(normalization: str = None, *args, **kwargs)[源代码]

基类:Mapper

Fixes unicode errors in text samples.

This operator corrects common unicode errors and normalizes the text to a specified Unicode normalization form. The default normalization form is 'NFC', but it can be set to 'NFKC', 'NFD', or 'NFKD' during initialization. It processes text samples in batches, applying the specified normalization to each sample. If an unsupported normalization form is provided, a ValueError is raised.

__init__(normalization: str = None, *args, **kwargs)[源代码]

Initialization method.

参数:
  • normalization -- the specified form of Unicode normalization mode, which can be one of ['NFC', 'NFKC', 'NFD', and 'NFKD'], default 'NFC'.

  • args -- extra args

  • kwargs -- extra args

process_batched(samples)[源代码]