clean_copyright_mapper¶

Cleans copyright comments at the beginning of text samples.

This operator removes copyright comments from the start of text samples. It identifies and strips multiline comments that contain the word “copyright” using a regular expression. It also greedily removes lines starting with comment markers like //, #, or -- at the beginning of the text, as these are often part of copyright headers. The operator processes each sample individually but can handle batches for efficiency.

Type 算子类型: mapper

Tags 标签: cpu, text

🔧 Parameter Configuration 参数配置¶

name 参数名	type 类型	default 默认值	desc 说明
`args`		`''`	extra args
`kwargs`		`''`	extra args

📊 Effect demonstration 效果演示¶

test_clean_copyright¶

CleanCopyrightMapper()

📥 input data 输入数据¶

Sample 1: text

这是一段 /* 多行注释
注释内容copyright
*/ 的文本。另外还有一些 // 单行注释。

Sample 2: text

如果多行/*注释中没有
关键词,那么
这部分注释也不会
被清除*/
会保留下来

Sample 3: text

//if start with
//that will be cleaned 
 evenly

Sample 4: text

http://www.nasosnsncc.com

Sample 5: text

#if start with
that will be cleaned 
#evenly

Sample 6: text

--if start with
--that will be cleaned 
#evenly

📤 output data 输出数据¶

Sample 1: text

这是一段  的文本。另外还有一些 // 单行注释。

Sample 2: text

如果多行/*注释中没有
关键词,那么
这部分注释也不会
被清除*/
会保留下来

Sample 3: text

 evenly

Sample 4: text

http://www.nasosnsncc.com

Sample 5: text

that will be cleaned 
#evenly

Sample 6: empty

✨ explanation 解释¶

This example demonstrates how the operator removes copyright comments, including both multi-line and single-line comments, from the start of text samples. Multi-line comments containing ‘copyright’ are stripped, and lines starting with ‘//’, ‘#’, or ‘–’ at the beginning of the text are also removed. The result shows that only the parts without these comment markers are kept. For instance, in the first sample, the multi-line comment with ‘copyright’ is removed, while the single-line comment remains because it’s not at the very start. In the last sample, all content is removed as it starts with a comment marker. 这个示例展示了算子如何从文本样本的开头移除版权注释，包括多行和单行注释。包含’copyright’的多行注释会被删除，同时位于文本开头且以’//’, ‘#’ 或 ‘–’ 开头的行也会被移除。结果显示，只有不带这些注释标记的部分被保留了下来。例如，在第一个样本中，带有’copyright’的多行注释被删除了，而单行注释因为不在最开始的位置所以被保留。在最后一个样本中，由于内容以注释标记开始，因此全部内容都被移除了。

clean_copyright_mapper¶

🔧 Parameter Configuration 参数配置¶

📊 Effect demonstration 效果演示¶

test_clean_copyright¶

📥 input data 输入数据¶

📤 output data 输出数据¶

✨ explanation 解释¶

🔗 related links 相关链接¶