whitespace_normalization_mapper¶
Normalizes various types of whitespace characters to standard spaces in text samples.
This mapper converts all non-standard whitespace characters, such as tabs and newlines, to the standard space character (’ ‘, 0x20). It also trims leading and trailing whitespace from the text. This ensures consistent spacing across all text samples, improving readability and consistency. The normalization process is based on a comprehensive list of whitespace characters, which can be found at https://en.wikipedia.org/wiki/Whitespace_character.
将文本样本中的各种空白字符标准化为空格。
该映射器将所有非标准空白字符(如制表符和换行符)转换为标准空格字符 (’ ‘, 0x20)。它还修剪文本前后的空白。这确保了所有文本样本的一致间距,提高了可读性和一致性。规范化过程基于全面的空白字符列表,可以在 https://en.wikipedia.org/wiki/Whitespace_character 找到。
Type 算子类型: mapper
Tags 标签: cpu, text
🔧 Parameter Configuration 参数配置¶
name 参数名 |
type 类型 |
default 默认值 |
desc 说明 |
---|---|---|---|
|
|
extra args |
|
|
|
extra args |
📊 Effect demonstration 效果演示¶
test_case¶
WhitespaceNormalizationMapper()
📥 input data 输入数据¶
['x \t\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\xa0\u202f\u205f\u3000\u200b\u200c\u200d\u2060\x84y']
📤 output data 输出数据¶
['x y']
✨ explanation 解释¶
This example demonstrates the operator’s ability to convert various non-standard whitespace characters, such as tabs and special spaces, into a standard space. In this case, all the unusual whitespace characters between ‘x’ and ‘y’ are replaced with a series of standard spaces, resulting in ‘x y’. This makes the text more consistent and readable. 此示例展示了算子将各种非标准空白字符(例如制表符和特殊空格)转换为标准空格的能力。在这个例子中,’x’ 和 ‘y’ 之间的所有不寻常的空白字符都被替换为一系列的标准空格,结果是 ‘x y’。这使得文本更加一致且易于阅读。