data_juicer.ops.mapper.extract_tables_from_html_mapper module¶
- class data_juicer.ops.mapper.extract_tables_from_html_mapper.ExtractTablesFromHtmlMapper(tables_field_name: str = 'html_tables', retain_html_tags: bool = False, include_header: bool = True, *args, **kwargs)[source]¶
Bases:
Mapper
Extracts tables from HTML content and stores them in a specified field.
This operator processes HTML content to extract tables. It can either retain or remove HTML tags based on the retain_html_tags parameter. If retain_html_tags is False, it can also include or exclude table headers based on the include_header parameter. The extracted tables are stored in the tables_field_name field within the sample’s metadata. If no tables are found, an empty list is stored. If the tables have already been extracted, the operator will not reprocess the sample.
- __init__(tables_field_name: str = 'html_tables', retain_html_tags: bool = False, include_header: bool = True, *args, **kwargs)[source]¶
Initialization method. :param tables_field_name: Field name to store the extracted tables. :param retain_html_tags: If True, retains HTML tags in the tables;
otherwise, removes them.
- Parameters:
include_header –
- If True, includes the table header;
otherwise, excludes it.
- This parameter is effective
only when retain_html_tags is False
and applies solely to the extracted table content.