data_juicer.ops.mapper.extract_tables_from_html_mapper module

class data_juicer.ops.mapper.extract_tables_from_html_mapper.ExtractTablesFromHtmlMapper(tables_field_name: str = 'html_tables', retain_html_tags: bool = False, include_header: bool = True, *args, **kwargs)[源代码]

基类:Mapper

Mapper to extract tables from HTML content.

__init__(tables_field_name: str = 'html_tables', retain_html_tags: bool = False, include_header: bool = True, *args, **kwargs)[源代码]

Initialization method. :param tables_field_name: Field name to store the extracted tables. :param retain_html_tags: If True, retains HTML tags in the tables;

otherwise, removes them.

参数:

include_header --

If True, includes the table header;

otherwise, excludes it.

This parameter is effective

only when retain_html_tags is False

and applies solely to the extracted table content.

process_single(sample)[源代码]

For sample level, sample --> sample

参数:

sample -- sample to process

返回:

processed sample