data_juicer.ops.mapper.remove_header_mapper module

class data_juicer.ops.mapper.remove_header_mapper.RemoveHeaderMapper(drop_no_head: bool = True, *args, **kwargs)[源代码]

基类:Mapper

Removes headers at the beginning of documents in LaTeX samples.

This operator identifies and removes headers such as chapter, part, section, subsection, subsubsection, paragraph, and subparagraph. It uses a regular expression to match these headers. If a sample does not contain any headers and drop_no_head is set to True, the sample text will be removed. Otherwise, the sample remains unchanged. The operator processes samples in batches for efficiency.

__init__(drop_no_head: bool = True, *args, **kwargs)[源代码]

Initialization method.

参数:
  • drop_no_head -- whether to drop sample texts without headers.

  • args -- extra args

  • kwargs -- extra args

process_batched(samples)[源代码]