extract_tables_from_html_mapper¶
Extracts tables from HTML content and stores them in a specified field.
This operator processes HTML content to extract tables. It can either retain or remove HTML tags based on the retain_html_tags
parameter. If retain_html_tags
is False, it can also include or exclude table headers based on the include_header
parameter. The extracted tables are stored in the tables_field_name
field within the sample’s metadata. If no tables are found, an empty list is stored. If the tables have already been extracted, the operator will not reprocess the sample.
从HTML内容中提取表格并存储在指定字段中。
此算子处理HTML内容以提取表格。根据retain_html_tags
参数,它可以保留或移除HTML标签。如果retain_html_tags
为False,还可以根据include_header
参数选择包含或排除表格标题。提取的表格将存储在样本元数据中的tables_field_name
字段内。如果没有找到表格,则会存储一个空列表。如果表格已经被提取,算子将不会重新处理样本。
Type 算子类型: mapper
Tags 标签: cpu, text
🔧 Parameter Configuration 参数配置¶
name 参数名 |
type 类型 |
default 默认值 |
desc 说明 |
---|---|---|---|
|
<class ‘str’> |
|
Field name to store the extracted tables. |
|
<class ‘bool’> |
|
If True, retains HTML tags in the tables; otherwise, removes them. |
|
<class ‘bool’> |
|
If True, includes the table header; otherwise, excludes it. This parameter is effective only when |
|
|
||
|
|
📊 Effect demonstration 效果演示¶
test_extract_tables_include_header¶
ExtractTablesFromHtmlMapper(retain_html_tags=False, include_header=True)
📥 input data 输入数据¶
<!DOCTYPE html> <html lang="zh"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>表格示例</title> </head> <body> <h1>表格示例</h1> ...
Show more 展开更多 (934 more chars)
<!DOCTYPE html> <html lang="zh"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>表格示例</title> </head> <body> <h1>表格示例</h1> <table border="1"> <thead> <tr> <th>姓名</th> <th>年龄</th> <th>城市</th> </tr> </thead> <tbody> <tr> <td>张三</td> <td>25</td> <td>北京</td> </tr> <tr> <td>李四</td> <td>30</td> <td>上海</td> </tr> <tr> <td>王五</td> <td>28</td> <td>广州</td> </tr> </tbody> </table> </body> </html>
📤 output data 输出数据¶
<!DOCTYPE html> <html lang="zh"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>表格示例</title> </head> <body> <h1>表格示例</h1> ...
Show more 展开更多 (934 more chars)
<!DOCTYPE html> <html lang="zh"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>表格示例</title> </head> <body> <h1>表格示例</h1> <table border="1"> <thead> <tr> <th>姓名</th> <th>年龄</th> <th>城市</th> </tr> </thead> <tbody> <tr> <td>张三</td> <td>25</td> <td>北京</td> </tr> <tr> <td>李四</td> <td>30</td> <td>上海</td> </tr> <tr> <td>王五</td> <td>28</td> <td>广州</td> </tr> </tbody> </table> </body> </html>
✨ explanation 解释¶
This example shows how the operator extracts tables from HTML content, including the table headers. The input is a simple HTML string containing a table with headers and rows. The operator processes this input and extracts the table, storing it in the ‘html_tables’ field of the metadata. The output includes the original text and the extracted table, which retains the header information. 这个例子展示了算子如何从HTML内容中提取表格,包括表头。输入是一个包含带有表头和行的表格的简单HTML字符串。算子处理这个输入并提取表格,将其存储在元数据的’html_tables’字段中。输出包括原始文本和提取的表格,保留了表头信息。
test_no_tables¶
ExtractTablesFromHtmlMapper(retain_html_tags=False, include_header=True)
📥 input data 输入数据¶
<html><body>New testCase - No tables here!</body></html>
📤 output data 输出数据¶
<html><body>New testCase - No tables here!</body></html>
✨ explanation 解释¶
In this example, the input is an HTML document that does not contain any tables. The operator will process this input and, since there are no tables to extract, it stores an empty list in the ‘html_tables’ field of the metadata. The output data remains the same as the input data, with the addition of the empty ‘html_tables’ list in the metadata, indicating that no tables were found. 在这个例子中,输入是一个不包含任何表格的HTML文档。算子会处理这个输入,由于没有表格可以提取,它会在元数据的’html_tables’字段中存储一个空列表。输出数据与输入数据相同,在元数据中添加了一个空的’html_tables’列表,表明没有找到表格。