data_juicer.ops.mapper.punctuation_normalization_mapper module

class data_juicer.ops.mapper.punctuation_normalization_mapper.PunctuationNormalizationMapper(*args, **kwargs)[source]

Bases: Mapper

Normalizes unicode punctuations to their English equivalents in text samples.

This operator processes a batch of text samples and replaces any unicode punctuation with its corresponding English punctuation. The mapping includes common substitutions like “,” to “,”, “。” to “.”, and ““” to “. It iterates over each character in the text, replacing it if it is found in the predefined punctuation map. The result is a set of text samples with consistent punctuation formatting.

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]