Contents Menu Expand Light mode Dark mode Auto light/dark, in light mode Auto light/dark, in dark mode Skip to content
Data Juicer
Data Juicer

教程

  • DJ-Cookbook
  • 安装
  • 快速上手

帮助文档

  • Operator Schemas 算子提要
  • 数据菜谱Gallery
  • 数据集配置指南
  • “坏”数据展览
  • DJ-SORA
  • DataJuicer-Agent
  • DJ_服务化
  • 开发者指南
  • Data-Juicer 分布式数据处理
  • 沙盒实验室
  • Awesome Data-Model Co-Development of MLLMs

demos

  • 演示
  • 自动化评测:HELM 评测及可视化
  • 为LLM构造角色扮演的system prompt

工具

  • 分布式模糊去重工具
  • Auto Evaluation Toolkit
  • GPT EVAL:使用 OpenAI API 评测大模型
  • Evaluation Results Recorder
  • 格式转换工具
  • 多模态工具
  • 后微调工具
  • 数据菜谱的自动化超参优化
  • Label Studio Service Utility
  • 视频生成测评工具
  • VBench metrics
  • Postprocess tools
  • 预处理工具
  • 给数据打分

第三方

  • 大语言模型生态
  • 第三方模型库

API Reference

  • API Reference
    • data_juicer.core
    • data_juicer.ops
    • data_juicer.ops.filter
    • data_juicer.ops.mapper
    • data_juicer.ops.deduplicator
    • data_juicer.ops.selector
    • data_juicer.ops.common
    • data_juicer.analysis
    • data_juicer.config
    • data_juicer.format
zh-CN|v1.4.3
Language
English 简体中文
Version
main v1.4.3 v1.4.2 v1.4.1 v1.4.0
Back to top
View this page

data_juicer.ops.mapper.clean_copyright_mapper module¶

class data_juicer.ops.mapper.clean_copyright_mapper.CleanCopyrightMapper(*args, **kwargs)[源代码]¶

基类:Mapper

Cleans copyright comments at the beginning of text samples.

This operator removes copyright comments from the start of text samples. It identifies and strips multiline comments that contain the word "copyright" using a regular expression. It also greedily removes lines starting with comment markers like //, #, or -- at the beginning of the text, as these are often part of copyright headers. The operator processes each sample individually but can handle batches for efficiency.

__init__(*args, **kwargs)[源代码]¶

Initialization method.

参数:
  • args -- extra args

  • kwargs -- extra args

process_batched(samples)[源代码]¶
Copyright © 2024, Data-Juicer Team
Made with Sphinx and @pradyunsg's Furo
On this page
  • data_juicer.ops.mapper.clean_copyright_mapper module
    • CleanCopyrightMapper
      • CleanCopyrightMapper.__init__()
      • CleanCopyrightMapper.process_batched()