clean_copyright_mapper¶
Cleans copyright comments at the beginning of text samples.
This operator removes copyright comments from the start of text samples. It identifies and strips multiline comments that contain the word “copyright” using a regular expression. It also greedily removes lines starting with comment markers like //
, #
, or --
at the beginning of the text, as these are often part of copyright headers. The operator processes each sample individually but can handle batches for efficiency.
清理文本样本开头的版权声明。
该算子从文本样本的开头删除版权声明。它使用正则表达式识别并删除包含“copyright”一词的多行注释。它还贪心地删除文本开头以注释标记如 //
, #
或 --
开头的行,因为这些通常是版权声明的一部分。该算子单独处理每个样本,但为了效率也可以批量处理。
Type 算子类型: mapper
Tags 标签: cpu, text
🔧 Parameter Configuration 参数配置¶
name 参数名 |
type 类型 |
default 默认值 |
desc 说明 |
---|---|---|---|
|
|
extra args |
|
|
|
extra args |
📊 Effect demonstration 效果演示¶
test_clean_copyright¶
CleanCopyrightMapper()
📥 input data 输入数据¶
['这是一段 /* 多行注释\n注释内容copyright\n*/ 的文本。另外还有一些 // 单行注释。', '如果多行/*注释中没有\n关键词,那么\n这部分注释也不会\n被清除*/\n会保留下来', '//if start with\n//that will be cleaned \n evenly', 'http://www.nasosnsncc.com', '#if start with\nthat will be cleaned \n#evenly', '--if start with\n--that will be cleaned \n#evenly']
📤 output data 输出数据¶
['这是一段 的文本。另外还有一些 // 单行注释。', '如果多行/*注释中没有\n关键词,那么\n这部分注释也不会\n被清除*/\n会保留下来', ' evenly', 'http://www.nasosnsncc.com', 'that will be cleaned \n#evenly', '']
✨ explanation 解释¶
This example demonstrates how the operator removes copyright comments, including both multi-line and single-line comments, from the start of text samples. Multi-line comments containing ‘copyright’ are stripped, and lines starting with ‘//’, ‘#’, or ‘–’ at the beginning of the text are also removed. The result shows that only the parts without these comment markers are kept. For instance, in the first sample, the multi-line comment with ‘copyright’ is removed, while the single-line comment remains because it’s not at the very start. In the last sample, all content is removed as it starts with a comment marker. 这个示例展示了算子如何从文本样本的开头移除版权注释,包括多行和单行注释。包含’copyright’的多行注释会被删除,同时位于文本开头且以’//’, ‘#’ 或 ‘–’ 开头的行也会被移除。结果显示,只有不带这些注释标记的部分被保留了下来。例如,在第一个样本中,带有’copyright’的多行注释被删除了,而单行注释因为不在最开始的位置所以被保留。在最后一个样本中,由于内容以注释标记开始,因此全部内容都被移除了。