data_juicer.ops.mapper.extract_tables_from_html_mapper module

class data_juicer.ops.mapper.extract_tables_from_html_mapper.ExtractTablesFromHtmlMapper(tables_field_name: str = 'html_tables', retain_html_tags: bool = False, include_header: bool = True, *args, **kwargs)[源代码]

基类:Mapper

Extracts tables from HTML content and stores them in a specified field.

This operator processes HTML content to extract tables. It can either retain or remove HTML tags based on the retain_html_tags parameter. If retain_html_tags is False, it can also include or exclude table headers based on the include_header parameter. The extracted tables are stored in the tables_field_name field within the sample's metadata. If no tables are found, an empty list is stored. If the tables have already been extracted, the operator will not reprocess the sample.

__init__(tables_field_name: str = 'html_tables', retain_html_tags: bool = False, include_header: bool = True, *args, **kwargs)[源代码]

Initialization method. :param tables_field_name: Field name to store the extracted tables. :param retain_html_tags: If True, retains HTML tags in the tables;

otherwise, removes them.

参数:

include_header --

If True, includes the table header;

otherwise, excludes it.

This parameter is effective

only when retain_html_tags is False

and applies solely to the extracted table content.

process_single(sample)[源代码]

For sample level, sample --> sample

参数:

sample -- sample to process

返回:

processed sample