Contents Menu Expand Light mode Dark mode Auto light/dark, in light mode Auto light/dark, in dark mode Skip to content
Data Juicer
Data Juicer

Tutorial

  • DJ-Cookbook
  • Installation Guide
  • Quick Start

docs

  • Operator Schemas 算子提要
  • Data Recipe Gallery
  • Dataset Configuration Guide
  • “Bad” Data Exhibition
  • DJ-SORA
  • DataJuicer-Agent
  • DJ_service
  • How-to Guide for Developers
  • Distributed Data Processing in Data-Juicer
  • Sandbox
  • Awesome Data-Model Co-Development of MLLMs

demos

  • Demos

tools

  • Distributed Fuzzy Deduplication Tools
  • Auto Evaluation Toolkit
  • GPT EVAL: Evaluate your model with OpenAI API
  • Evaluation Results Recorder
  • Format Conversion Tools
  • Multimodal Tools
  • Post Tuning Tools
  • Hyper-parameter Optimization for Data Recipe
  • Label Studio Service Utility
  • Metrics for video generation
  • VBench metrics
  • Postprocess tools
  • Preprocess Tools
  • Data Scoring

thirdparty

  • LLM Ecosystems
  • Third-party Model Library

API Reference

  • API Reference
    • data_juicer.core
    • data_juicer.ops
    • data_juicer.ops.filter
    • data_juicer.ops.mapper
    • data_juicer.ops.deduplicator
    • data_juicer.ops.selector
    • data_juicer.ops.common
    • data_juicer.analysis
    • data_juicer.config
    • data_juicer.format
en|main
Language
English 简体中文
Version
main v1.4.2 v1.4.1 v1.4.0
Back to top
View this page

clean_links_mapper¶

Mapper to clean links like http/https/ftp in text samples.

This operator removes or replaces URLs and other web links in the text. It uses a regular expression pattern to identify and remove links. By default, it replaces the identified links with an empty string, effectively removing them. The operator can be customized with a different pattern and replacement string. It processes samples in batches and modifies the text in place. If no links are found in a sample, it is left unchanged.

映射器用于清理文本样本中的http/https/ftp等链接。

此算子删除或替换文本中的URL和其他网络链接。它使用正则表达式模式来识别和删除链接。默认情况下,它将识别到的链接替换为空字符串,从而删除它们。可以通过不同的模式和替换字符串自定义算子。它以批量方式处理样本并在原地修改文本。如果样本中没有找到链接,则保持不变。

Type 算子类型: mapper

Tags 标签: cpu, text

🔧 Parameter Configuration 参数配置¶

name 参数名

type 类型

default 默认值

desc 说明

pattern

typing.Optional[str]

None

regular expression pattern to search for within text.

repl

<class ‘str’>

''

replacement string, default is empty string.

args

''

extra args

kwargs

''

extra args

📊 Effect demonstration 效果演示¶

test_mixed_https_links_text¶

CleanLinksMapper()

📥 input data 输入数据¶

Sample 1: list
['This is a test,https://www.example.com/file.html?param1=value1&param2=value2', '这是个测试,https://example.com/my-page.html?param1=value1&param2=value2', '这是个测试,https://example.com']

📤 output data 输出数据¶

Sample 1: list
['This is a test,', '这是个测试,', '这是个测试,']

✨ explanation 解释¶

This example shows the operator removing HTTPS links from text that contains both plain text and a link. The operator identifies and removes the links, leaving the rest of the text intact. For example, ‘This is a test,https://www.example.com/file.html?param1=value1&param2=value2’ becomes ‘This is a test,’ after processing. 这个示例展示了算子从同时包含纯文本和链接的文本中移除HTTPS链接。算子识别并移除这些链接,而保留其余文本不变。例如,’This is a test,https://www.example.com/file.html?param1=value1&param2=value2’ 在处理后变为 ‘This is a test,’。

test_replace_links_text¶

CleanLinksMapper(repl='<LINKS>')

📥 input data 输入数据¶

Sample 1: list
['ftp://user:password@ftp.example.com:21/', 'This is a sample for test', 'abcd://ef is a sample for test', 'HTTP://example.com/my-page.html?param1=value1&param2=value2']

📤 output data 输出数据¶

Sample 1: list
['<LINKS>', 'This is a sample for test', '<LINKS> is a sample for test', '<LINKS>']

✨ explanation 解释¶

This example demonstrates the operator replacing different types of links with a custom string ‘’. If a sample contains a link, it will be replaced by ‘’, while samples without links remain unchanged. For instance, ‘ftp://user:password@ftp.example.com:21/’ is transformed into ‘’, whereas ‘This is a sample for test’ stays as it is because it doesn’t contain any links. 这个示例展示了算子使用自定义字符串’’替换不同类型的链接。如果一个样本包含链接,它将被替换为’’,而不含链接的样本则保持不变。例如,’ftp://user:password@ftp.example.com:21/’ 被转换为 ‘’,而 ‘This is a sample for test’ 保持不变,因为它不包含任何链接。

🔗 related links 相关链接¶

  • source code 源代码

  • unit test 单元测试

  • Return operator list 返回算子列表

Copyright © 2024, Data-Juicer Team
Made with Sphinx and @pradyunsg's Furo
On this page
  • clean_links_mapper
    • 🔧 Parameter Configuration 参数配置
    • 📊 Effect demonstration 效果演示
      • test_mixed_https_links_text
        • 📥 input data 输入数据
        • 📤 output data 输出数据
        • ✨ explanation 解释
      • test_replace_links_text
        • 📥 input data 输入数据
        • 📤 output data 输出数据
        • ✨ explanation 解释
    • 🔗 related links 相关链接