data_juicer.ops.mapper.whitespace_normalization_mapper module

class data_juicer.ops.mapper.whitespace_normalization_mapper.WhitespaceNormalizationMapper(*args, **kwargs)[source]

Bases: Mapper

Normalizes various types of whitespace characters to standard spaces in text samples.

This mapper converts all non-standard whitespace characters, such as tabs and newlines, to the standard space character (’ ‘, 0x20). It also trims leading and trailing whitespace from the text. This ensures consistent spacing across all text samples, improving readability and consistency. The normalization process is based on a comprehensive list of whitespace characters, which can be found at https://en.wikipedia.org/wiki/Whitespace_character.

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]