clean_links_mapper¶
Mapper to clean links like http/https/ftp in text samples.
This operator removes or replaces URLs and other web links in the text. It uses a regular expression pattern to identify and remove links. By default, it replaces the identified links with an empty string, effectively removing them. The operator can be customized with a different pattern and replacement string. It processes samples in batches and modifies the text in place. If no links are found in a sample, it is left unchanged.
映射器用于清理文本样本中的http/https/ftp等链接。
此算子删除或替换文本中的URL和其他网络链接。它使用正则表达式模式来识别和删除链接。默认情况下,它将识别到的链接替换为空字符串,从而删除它们。可以通过不同的模式和替换字符串自定义算子。它以批量方式处理样本并在原地修改文本。如果样本中没有找到链接,则保持不变。
Type 算子类型: mapper
Tags 标签: cpu, text
🔧 Parameter Configuration 参数配置¶
name 参数名 |
type 类型 |
default 默认值 |
desc 说明 |
---|---|---|---|
|
typing.Optional[str] |
|
regular expression pattern to search for within text. |
|
<class ‘str’> |
|
replacement string, default is empty string. |
|
|
extra args |
|
|
|
extra args |
📊 Effect demonstration 效果演示¶
test_mixed_https_links_text¶
CleanLinksMapper()
📥 input data 输入数据¶
['This is a test,https://www.example.com/file.html?param1=value1¶m2=value2', '这是个测试,https://example.com/my-page.html?param1=value1¶m2=value2', '这是个测试,https://example.com']
📤 output data 输出数据¶
['This is a test,', '这是个测试,', '这是个测试,']
✨ explanation 解释¶
This example shows the operator removing HTTPS links from text that contains both plain text and a link. The operator identifies and removes the links, leaving the rest of the text intact. For example, ‘This is a test,https://www.example.com/file.html?param1=value1¶m2=value2’ becomes ‘This is a test,’ after processing. 这个示例展示了算子从同时包含纯文本和链接的文本中移除HTTPS链接。算子识别并移除这些链接,而保留其余文本不变。例如,’This is a test,https://www.example.com/file.html?param1=value1¶m2=value2’ 在处理后变为 ‘This is a test,’。
test_replace_links_text¶
CleanLinksMapper(repl='<LINKS>')
📥 input data 输入数据¶
['ftp://user:password@ftp.example.com:21/', 'This is a sample for test', 'abcd://ef is a sample for test', 'HTTP://example.com/my-page.html?param1=value1¶m2=value2']
📤 output data 输出数据¶
['<LINKS>', 'This is a sample for test', '<LINKS> is a sample for test', '<LINKS>']
✨ explanation 解释¶
This example demonstrates the operator replacing different types of links with a custom string ‘