data_juicer.ops.mapper.fix_unicode_mapper module¶
- class data_juicer.ops.mapper.fix_unicode_mapper.FixUnicodeMapper(normalization: str = None, *args, **kwargs)[源代码]¶
基类:
Mapper
Fixes unicode errors in text samples.
This operator corrects common unicode errors and normalizes the text to a specified Unicode normalization form. The default normalization form is 'NFC', but it can be set to 'NFKC', 'NFD', or 'NFKD' during initialization. It processes text samples in batches, applying the specified normalization to each sample. If an unsupported normalization form is provided, a ValueError is raised.