Contents Menu Expand Light mode Dark mode Auto light/dark, in light mode Auto light/dark, in dark mode Skip to content
Data Juicer
Data Juicer

Tutorial

  • DJ-Cookbook
  • Installation Guide
  • Quick Start

docs

  • Operator Schemas 算子提要
  • Data Recipe Gallery
  • Dataset Configuration Guide
  • “Bad” Data Exhibition
  • DJ-SORA
  • DataJuicer-Agent
  • DJ_service
  • How-to Guide for Developers
  • Distributed Data Processing in Data-Juicer
  • Sandbox
  • Awesome Data-Model Co-Development of MLLMs

demos

  • Demos

tools

  • Distributed Fuzzy Deduplication Tools
  • Auto Evaluation Toolkit
  • GPT EVAL: Evaluate your model with OpenAI API
  • Evaluation Results Recorder
  • Format Conversion Tools
  • Multimodal Tools
  • Post Tuning Tools
  • Hyper-parameter Optimization for Data Recipe
  • Label Studio Service Utility
  • Metrics for video generation
  • Postprocess tools
  • Preprocess Tools
  • Data Scoring

thirdparty

  • LLM Ecosystems
  • Third-party Model Library

API Reference

  • API Reference
    • data_juicer.core
    • data_juicer.ops
    • data_juicer.ops.filter
    • data_juicer.ops.mapper
    • data_juicer.ops.deduplicator
    • data_juicer.ops.selector
    • data_juicer.ops.common
    • data_juicer.analysis
    • data_juicer.config
    • data_juicer.format
en|v1.4.2
Language
English 简体中文
Version
main v1.4.3 v1.4.2 v1.4.1 v1.4.0
Back to top
View this page

data_juicer.ops.mapper.clean_copyright_mapper module¶

class data_juicer.ops.mapper.clean_copyright_mapper.CleanCopyrightMapper(*args, **kwargs)[source]¶

Bases: Mapper

Cleans copyright comments at the beginning of text samples.

This operator removes copyright comments from the start of text samples. It identifies and strips multiline comments that contain the word “copyright” using a regular expression. It also greedily removes lines starting with comment markers like //, #, or – at the beginning of the text, as these are often part of copyright headers. The operator processes each sample individually but can handle batches for efficiency.

__init__(*args, **kwargs)[source]¶

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]¶
Copyright © 2024, Data-Juicer Team
Made with Sphinx and @pradyunsg's Furo
On this page
  • data_juicer.ops.mapper.clean_copyright_mapper module
    • CleanCopyrightMapper
      • CleanCopyrightMapper.__init__()
      • CleanCopyrightMapper.process_batched()