# clean_links_mapper Mapper to clean links like http/https/ftp in text samples. This operator removes or replaces URLs and other web links in the text. It uses a regular expression pattern to identify and remove links. By default, it replaces the identified links with an empty string, effectively removing them. The operator can be customized with a different pattern and replacement string. It processes samples in batches and modifies the text in place. If no links are found in a sample, it is left unchanged. 映射器用于清理文本样本中的http/https/ftp等链接。 此算子删除或替换文本中的URL和其他网络链接。它使用正则表达式模式来识别和删除链接。默认情况下,它将识别到的链接替换为空字符串,从而删除它们。可以通过不同的模式和替换字符串自定义算子。它以批量方式处理样本并在原地修改文本。如果样本中没有找到链接,则保持不变。 Type 算子类型: **mapper** Tags 标签: cpu, text ## 🔧 Parameter Configuration 参数配置 | name 参数名 | type 类型 | default 默认值 | desc 说明 | |--------|------|--------|------| | `pattern` | typing.Optional[str] | `None` | regular expression pattern to search for within text. | | `repl` | | `''` | replacement string, default is empty string. | | `args` | | `''` | extra args | | `kwargs` | | `''` | extra args | ## 📊 Effect demonstration 效果演示 ### test_mixed_https_links_text ```python CleanLinksMapper() ``` #### 📥 input data 输入数据
Sample 1: list
['This is a test,https://www.example.com/file.html?param1=value1&param2=value2', '这是个测试,https://example.com/my-page.html?param1=value1&param2=value2', '这是个测试,https://example.com']
#### 📤 output data 输出数据
Sample 1: list
['This is a test,', '这是个测试,', '这是个测试,']
#### ✨ explanation 解释 This example shows the operator removing HTTPS links from text that contains both plain text and a link. The operator identifies and removes the links, leaving the rest of the text intact. For example, 'This is a test,https://www.example.com/file.html?param1=value1¶m2=value2' becomes 'This is a test,' after processing. 这个示例展示了算子从同时包含纯文本和链接的文本中移除HTTPS链接。算子识别并移除这些链接,而保留其余文本不变。例如,'This is a test,https://www.example.com/file.html?param1=value1¶m2=value2' 在处理后变为 'This is a test,'。 ### test_replace_links_text ```python CleanLinksMapper(repl='') ``` #### 📥 input data 输入数据
Sample 1: list
['ftp://user:password@ftp.example.com:21/', 'This is a sample for test', 'abcd://ef is a sample for test', 'HTTP://example.com/my-page.html?param1=value1&param2=value2']
#### 📤 output data 输出数据
Sample 1: list
['<LINKS>', 'This is a sample for test', '<LINKS> is a sample for test', '<LINKS>']
#### ✨ explanation 解释 This example demonstrates the operator replacing different types of links with a custom string ''. If a sample contains a link, it will be replaced by '', while samples without links remain unchanged. For instance, 'ftp://user:password@ftp.example.com:21/' is transformed into '', whereas 'This is a sample for test' stays as it is because it doesn't contain any links. 这个示例展示了算子使用自定义字符串''替换不同类型的链接。如果一个样本包含链接,它将被替换为'',而不含链接的样本则保持不变。例如,'ftp://user:password@ftp.example.com:21/' 被转换为 '',而 'This is a sample for test' 保持不变,因为它不包含任何链接。 ## 🔗 related links 相关链接 - [source code 源代码](../../../data_juicer/ops/mapper/clean_links_mapper.py) - [unit test 单元测试](../../../tests/ops/mapper/test_clean_links_mapper.py) - [Return operator list 返回算子列表](../../Operators.md)