data_juicer.ops.mapper.remove_table_text_mapper module

class data_juicer.ops.mapper.remove_table_text_mapper.RemoveTableTextMapper(min_col: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=2), Le(le=20)])] = 2, max_col: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=2), Le(le=20)])] = 20, *args, **kwargs)[源代码]

基类:Mapper

Mapper to remove table texts from text samples.

This operator uses regular expressions to identify and remove tables from the text. It targets tables with a specified range of columns, defined by the minimum and maximum number of columns. The operator iterates over each sample, applying the regex pattern to remove tables that match the column criteria. The processed text, with tables removed, is then stored back in the sample. This operation is batched for efficiency.

__init__(min_col: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=2), Le(le=20)])] = 2, max_col: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=2), Le(le=20)])] = 20, *args, **kwargs)[源代码]

Initialization method.

参数:
  • min_col -- The min number of columns of table to remove.

  • max_col -- The max number of columns of table to remove.

  • args -- extra args

  • kwargs -- extra args

process_batched(samples)[源代码]