Contents Menu Expand Light mode Dark mode Auto light/dark, in light mode Auto light/dark, in dark mode Skip to content
Data Juicer
Data Juicer

Tutorial

  • DJ-Cookbook
  • Installation Guide
  • Quick Start

docs

  • Operator Schemas 算子提要
  • Data Recipe Gallery
  • Dataset Configuration Guide
  • “Bad” Data Exhibition
  • DJ-SORA
  • DataJuicer-Agent
  • DJ_service
  • How-to Guide for Developers
  • Distributed Data Processing in Data-Juicer
  • Sandbox
  • Awesome Data-Model Co-Development of MLLMs

demos

  • Demos

tools

  • Distributed Fuzzy Deduplication Tools
  • Auto Evaluation Toolkit
  • GPT EVAL: Evaluate your model with OpenAI API
  • Evaluation Results Recorder
  • Format Conversion Tools
  • Multimodal Tools
  • Post Tuning Tools
  • Hyper-parameter Optimization for Data Recipe
  • Label Studio Service Utility
  • Metrics for video generation
  • VBench metrics
  • Postprocess tools
  • Preprocess Tools
  • Data Scoring

thirdparty

  • LLM Ecosystems
  • Third-party Model Library

API Reference

  • API Reference
    • data_juicer.core
    • data_juicer.ops
    • data_juicer.ops.filter
    • data_juicer.ops.mapper
    • data_juicer.ops.deduplicator
    • data_juicer.ops.selector
    • data_juicer.ops.common
    • data_juicer.analysis
    • data_juicer.config
    • data_juicer.format
en|main
Language
English 简体中文
Version
main v1.4.2 v1.4.1 v1.4.0
Back to top
View this page

clean_copyright_mapper¶

Cleans copyright comments at the beginning of text samples.

This operator removes copyright comments from the start of text samples. It identifies and strips multiline comments that contain the word “copyright” using a regular expression. It also greedily removes lines starting with comment markers like //, #, or -- at the beginning of the text, as these are often part of copyright headers. The operator processes each sample individually but can handle batches for efficiency.

清理文本样本开头的版权声明。

该算子从文本样本的开头删除版权声明。它使用正则表达式识别并删除包含“copyright”一词的多行注释。它还贪心地删除文本开头以注释标记如 //, # 或 -- 开头的行,因为这些通常是版权声明的一部分。该算子单独处理每个样本,但为了效率也可以批量处理。

Type 算子类型: mapper

Tags 标签: cpu, text

🔧 Parameter Configuration 参数配置¶

name 参数名

type 类型

default 默认值

desc 说明

args

''

extra args

kwargs

''

extra args

📊 Effect demonstration 效果演示¶

test_clean_copyright¶

CleanCopyrightMapper()

📥 input data 输入数据¶

Sample 1: list
['这是一段 /* 多行注释\n注释内容copyright\n*/ 的文本。另外还有一些 // 单行注释。', '如果多行/*注释中没有\n关键词,那么\n这部分注释也不会\n被清除*/\n会保留下来', '//if start with\n//that will be cleaned \n evenly', 'http://www.nasosnsncc.com', '#if start with\nthat will be cleaned \n#evenly', '--if start with\n--that will be cleaned \n#evenly']

📤 output data 输出数据¶

Sample 1: list
['这是一段  的文本。另外还有一些 // 单行注释。', '如果多行/*注释中没有\n关键词,那么\n这部分注释也不会\n被清除*/\n会保留下来', ' evenly', 'http://www.nasosnsncc.com', 'that will be cleaned \n#evenly', '']

✨ explanation 解释¶

This example demonstrates how the operator removes copyright comments, including both multi-line and single-line comments, from the start of text samples. Multi-line comments containing ‘copyright’ are stripped, and lines starting with ‘//’, ‘#’, or ‘–’ at the beginning of the text are also removed. The result shows that only the parts without these comment markers are kept. For instance, in the first sample, the multi-line comment with ‘copyright’ is removed, while the single-line comment remains because it’s not at the very start. In the last sample, all content is removed as it starts with a comment marker. 这个示例展示了算子如何从文本样本的开头移除版权注释,包括多行和单行注释。包含’copyright’的多行注释会被删除,同时位于文本开头且以’//’, ‘#’ 或 ‘–’ 开头的行也会被移除。结果显示,只有不带这些注释标记的部分被保留了下来。例如,在第一个样本中,带有’copyright’的多行注释被删除了,而单行注释因为不在最开始的位置所以被保留。在最后一个样本中,由于内容以注释标记开始,因此全部内容都被移除了。

🔗 related links 相关链接¶

  • source code 源代码

  • unit test 单元测试

  • Return operator list 返回算子列表

Copyright © 2024, Data-Juicer Team
Made with Sphinx and @pradyunsg's Furo
On this page
  • clean_copyright_mapper
    • 🔧 Parameter Configuration 参数配置
    • 📊 Effect demonstration 效果演示
      • test_clean_copyright
        • 📥 input data 输入数据
        • 📤 output data 输出数据
        • ✨ explanation 解释
    • 🔗 related links 相关链接