data_juicer.ops.mapper.remove_comments_mapper module¶
- class data_juicer.ops.mapper.remove_comments_mapper.RemoveCommentsMapper(doc_type: str | List[str] = 'tex', inline: bool = True, multiline: bool = True, *args, **kwargs)[源代码]¶
基类:
Mapper
Removes comments from documents, currently supporting only 'tex' format.
This operator removes inline and multiline comments from text samples. It supports both inline and multiline comment removal, controlled by the inline and multiline parameters. Currently, it is designed to work with 'tex' documents. The operator processes each sample in the batch and applies regular expressions to remove comments. The processed text is then updated in the original samples.
Inline comments are removed using the pattern [^]%.+$.
Multiline comments are removed using the pattern `^%.*
?`.
Important notes: - Only 'tex' document type is supported at present. - The operator processes the text in place and updates the original samples.
- __init__(doc_type: str | List[str] = 'tex', inline: bool = True, multiline: bool = True, *args, **kwargs)[源代码]¶
Initialization method.
- 参数:
doc_type -- Type of document to remove comments.
inline -- Whether to remove inline comments.
multiline -- Whether to remove multiline comments.
args -- extra args
kwargs -- extra args