data_juicer.ops.mapper.extract_tables_from_html_mapper module

class data_juicer.ops.mapper.extract_tables_from_html_mapper.ExtractTablesFromHtmlMapper(tables_field_name: str = 'html_tables', retain_html_tags: bool = False, include_header: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to extract tables from HTML content.

__init__(tables_field_name: str = 'html_tables', retain_html_tags: bool = False, include_header: bool = True, *args, **kwargs)[source]

Initialization method. :param tables_field_name: Field name to store the extracted tables. :param retain_html_tags: If True, retains HTML tags in the tables;

otherwise, removes them.

Parameters:

include_header

If True, includes the table header;

otherwise, excludes it.

This parameter is effective

only when retain_html_tags is False

and applies solely to the extracted table content.

process_single(sample)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample