whitespace_normalization_mapper

Normalizes various types of whitespace characters to standard spaces in text samples.

This mapper converts all non-standard whitespace characters, such as tabs and newlines, to the standard space character (’ ‘, 0x20). It also trims leading and trailing whitespace from the text. This ensures consistent spacing across all text samples, improving readability and consistency. The normalization process is based on a comprehensive list of whitespace characters, which can be found at https://en.wikipedia.org/wiki/Whitespace_character.

将文本样本中的各种空白字符标准化为空格。

该映射器将所有非标准空白字符(如制表符和换行符)转换为标准空格字符 (’ ‘, 0x20)。它还修剪文本前后的空白。这确保了所有文本样本的一致间距,提高了可读性和一致性。规范化过程基于全面的空白字符列表,可以在 https://en.wikipedia.org/wiki/Whitespace_character 找到。

Type 算子类型: mapper

Tags 标签: cpu, text

🔧 Parameter Configuration 参数配置

name 参数名

type 类型

default 默认值

desc 说明

args

''

extra args

kwargs

''

extra args

📊 Effect demonstration 效果演示

test_case

WhitespaceNormalizationMapper()

📥 input data 输入数据

Sample 1: list
['x \t\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\xa0\u202f\u205f\u3000\u200b\u200c\u200d\u2060\x84y']

📤 output data 输出数据

Sample 1: list
['x                       y']

✨ explanation 解释

This example demonstrates the operator’s ability to convert various non-standard whitespace characters, such as tabs and special spaces, into a standard space. In this case, all the unusual whitespace characters between ‘x’ and ‘y’ are replaced with a series of standard spaces, resulting in ‘x y’. This makes the text more consistent and readable. 此示例展示了算子将各种非标准空白字符(例如制表符和特殊空格)转换为标准空格的能力。在这个例子中,’x’ 和 ‘y’ 之间的所有不寻常的空白字符都被替换为一系列的标准空格,结果是 ‘x y’。这使得文本更加一致且易于阅读。