extract_tables_from_html_mapper¶
Extracts tables from HTML content and stores them in a specified field.
This operator processes HTML content to extract tables. It can either retain or remove HTML tags based on the retain_html_tags
parameter. If retain_html_tags
is False, it can also include or exclude table headers based on the include_header
parameter. The extracted tables are stored in the tables_field_name
field within the sample’s metadata. If no tables are found, an empty list is stored. If the tables have already been extracted, the operator will not reprocess the sample.
从HTML内容中提取表格并存储在指定字段中。
此算子处理HTML内容以提取表格。根据retain_html_tags
参数,它可以保留或移除HTML标签。如果retain_html_tags
为False,还可以根据include_header
参数选择包含或排除表格标题。提取的表格将存储在样本元数据中的tables_field_name
字段内。如果没有找到表格,则会存储一个空列表。如果表格已经被提取,算子将不会重新处理样本。
Type 算子类型: mapper
Tags 标签: cpu, text
🔧 Parameter Configuration 参数配置¶
name 参数名 |
type 类型 |
default 默认值 |
desc 说明 |
---|---|---|---|
|
<class ‘str’> |
|
Field name to store the extracted tables. |
|
<class ‘bool’> |
|
If True, retains HTML tags in the tables; |
|
<class ‘bool’> |
|
If True, includes the table header; |
|
|
||
|
|
📊 Effect demonstration 效果演示¶
not available 暂无