clean_ip_mapper

Cleans IPv4 and IPv6 addresses from text samples.

This operator removes or replaces IPv4 and IPv6 addresses in the text. It uses a regular expression to identify and clean the IP addresses. By default, it replaces the IP addresses with an empty string, effectively removing them. The operator can be configured with a custom pattern and replacement string. If no pattern is provided, a default pattern for both IPv4 and IPv6 addresses is used. The operator processes samples in batches.

  • Uses a regular expression to find and clean IP addresses.

  • Replaces found IP addresses with a specified replacement string.

  • Default replacement string is an empty string, which removes the IP addresses.

  • Can use a custom regular expression pattern if provided.

  • Processes samples in batches for efficiency.

从文本样本中清理IPv4和IPv6地址。

此算子删除或替换文本中的IPv4和IPv6地址。它使用正则表达式来识别和清理IP地址。默认情况下,它将IP地址替换为空字符串,从而删除它们。可以通过自定义模式和替换字符串配置算子。如果没有提供模式,则使用默认的IPv4和IPv6地址模式。算子以批量方式处理样本。

  • 使用正则表达式查找并清理IP地址。

  • 将找到的IP地址替换为指定的替换字符串。

  • 默认替换字符串为空字符串,从而删除IP地址。

  • 如果提供了自定义正则表达式模式,则可以使用。

  • 以批量方式处理样本以提高效率。

Type 算子类型: mapper

Tags 标签: cpu, text

🔧 Parameter Configuration 参数配置

name 参数名

type 类型

default 默认值

desc 说明

pattern

typing.Optional[str]

None

regular expression pattern to search for within text.

repl

<class ‘str’>

''

replacement string, default is empty string.

args

''

extra args

kwargs

''

extra args

📊 Effect demonstration 效果演示

test_ipv4

CleanIpMapper()

📥 input data 输入数据

Sample 1: list
['test of ip 234.128.124.123', '34.0.124.123', 'ftp://example.com/188.46.244.216my-page.html', 'ft174.1421.237.246my']

📤 output data 输出数据

Sample 1: list
['test of ip ', '', 'ftp://example.com/my-page.html', 'ft174.1421.237.246my']

✨ explanation 解释

This example demonstrates the operator’s default behavior of removing IPv4 addresses from the text. The operator uses a regular expression to find and remove any IPv4 addresses, leaving the rest of the text unchanged. In the output, you can see that the IPv4 addresses have been removed, and the remaining text is preserved as it is. For instance, ‘234.128.124.123’ is removed, resulting in ‘test of ip ‘. 这个例子展示了算子的默认行为,即从文本中移除IPv4地址。算子使用正则表达式来查找并移除任何IPv4地址,而其余文本保持不变。在输出中,你可以看到IPv4地址已经被移除,剩余的文本被保留。例如,’234.128.124.123’ 被移除后,结果是 ‘test of ip ‘。

test_replace_ipv4

CleanIpMapper(repl='<IP>')

📥 input data 输入数据

Sample 1: list
['test of ip 234.128.124.123', '34.0.124.123', 'ftp://example.com/188.46.244.216my-page.html', 'ft174.1421.237.246my']

📤 output data 输出数据

Sample 1: list
['test of ip <IP>', '<IP>', 'ftp://example.com/<IP>my-page.html', 'ft174.1421.237.246my']

✨ explanation 解释

This example shows how the operator can be configured to replace IPv4 addresses with a custom string, ‘’, instead of removing them. The operator still uses a regular expression to identify the IPv4 addresses, but instead of deleting them, it replaces each occurrence with the specified string. This is useful for preserving the structure of the text while marking where IP addresses were located. In the output, you can see that each IPv4 address is replaced by ‘’. 这个例子展示了如何配置算子用自定义字符串 ‘’ 替换IPv4地址,而不是移除它们。算子仍然使用正则表达式来识别IPv4地址,但不是删除它们,而是用指定的字符串替换每个出现的地址。这对于在保留文本结构的同时标记IP地址的位置非常有用。在输出中,你可以看到每个IPv4地址都被替换为 ‘’。