extract_tables_from_html_mapper

Extracts tables from HTML content and stores them in a specified field.

This operator processes HTML content to extract tables. It can either retain or remove HTML tags based on the retain_html_tags parameter. If retain_html_tags is False, it can also include or exclude table headers based on the include_header parameter. The extracted tables are stored in the tables_field_name field within the sample's metadata. If no tables are found, an empty list is stored. If the tables have already been extracted, the operator will not reprocess the sample.

从HTML内容中提取表格并存储在指定字段中。

此算子处理HTML内容以提取表格。根据retain_html_tags参数,它可以保留或移除HTML标签。如果retain_html_tags为False,还可以根据include_header参数选择包含或排除表格标题。提取的表格将存储在样本元数据中的tables_field_name字段内。如果没有找到表格,则会存储一个空列表。如果表格已经被提取,算子将不会重新处理样本。

Type 算子类型: mapper

Tags 标签: cpu, text

🔧 Parameter Configuration 参数配置

name 参数名

type 类型

default 默认值

desc 说明

tables_field_name

<class 'str'>

'html_tables'

Field name to store the extracted tables.

retain_html_tags

<class 'bool'>

False

If True, retains HTML tags in the tables; otherwise, removes them.

include_header

<class 'bool'>

True

If True, includes the table header; otherwise, excludes it. This parameter is effective only when retain_html_tags is False and applies solely to the extracted table content.

args

''

kwargs

''

📊 Effect demonstration 效果演示

test_extract_tables_include_header

ExtractTablesFromHtmlMapper(retain_html_tags=False, include_header=True)

📥 input data 输入数据

Sample 1: text
    <!DOCTYPE html>
            <html lang="zh">
            <head>
                <meta charset="UTF-8">
                <meta name="viewport" content="width=device-width, initial-scale=1.0">
                <title>表格示例</title>
            </head>
            <body>
                <h1>表格示例</h1>
...
Show more 展开更多 (934 more chars)
    <!DOCTYPE html>
            <html lang="zh">
            <head>
                <meta charset="UTF-8">
                <meta name="viewport" content="width=device-width, initial-scale=1.0">
                <title>表格示例</title>
            </head>
            <body>
                <h1>表格示例</h1>
                <table border="1">
                    <thead>
                        <tr>
                            <th>姓名</th>
                            <th>年龄</th>
                            <th>城市</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td>张三</td>
                            <td>25</td>
                            <td>北京</td>
                        </tr>
                        <tr>
                            <td>李四</td>
                            <td>30</td>
                            <td>上海</td>
                        </tr>
                        <tr>
                            <td>王五</td>
                            <td>28</td>
                            <td>广州</td>
                        </tr>
                    </tbody>
                </table>
            </body>
            </html>
    

📤 output data 输出数据

Sample 1: text
    <!DOCTYPE html>
            <html lang="zh">
            <head>
                <meta charset="UTF-8">
                <meta name="viewport" content="width=device-width, initial-scale=1.0">
                <title>表格示例</title>
            </head>
            <body>
                <h1>表格示例</h1>
...
Show more 展开更多 (934 more chars)
    <!DOCTYPE html>
            <html lang="zh">
            <head>
                <meta charset="UTF-8">
                <meta name="viewport" content="width=device-width, initial-scale=1.0">
                <title>表格示例</title>
            </head>
            <body>
                <h1>表格示例</h1>
                <table border="1">
                    <thead>
                        <tr>
                            <th>姓名</th>
                            <th>年龄</th>
                            <th>城市</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td>张三</td>
                            <td>25</td>
                            <td>北京</td>
                        </tr>
                        <tr>
                            <td>李四</td>
                            <td>30</td>
                            <td>上海</td>
                        </tr>
                        <tr>
                            <td>王五</td>
                            <td>28</td>
                            <td>广州</td>
                        </tr>
                    </tbody>
                </table>
            </body>
            </html>
    
__dj__meta__
html_tables[[['姓名', '年龄', '城市'], ['张三', '25', '北京'], ['李四', '30', '上海'], ['王五', '28', '广州']]]

✨ explanation 解释

This example shows how the operator extracts tables from HTML content, including the table headers. The input is a simple HTML string containing a table with headers and rows. The operator processes this input and extracts the table, storing it in the 'html_tables' field of the metadata. The output includes the original text and the extracted table, which retains the header information. 这个例子展示了算子如何从HTML内容中提取表格,包括表头。输入是一个包含带有表头和行的表格的简单HTML字符串。算子处理这个输入并提取表格,将其存储在元数据的'html_tables'字段中。输出包括原始文本和提取的表格,保留了表头信息。

test_no_tables

ExtractTablesFromHtmlMapper(retain_html_tags=False, include_header=True)

📥 input data 输入数据

Sample 1: text
<html><body>New testCase - No tables here!</body></html>

📤 output data 输出数据

Sample 1: text
<html><body>New testCase - No tables here!</body></html>
__dj__meta__
html_tables[]

✨ explanation 解释

In this example, the input is an HTML document that does not contain any tables. The operator will process this input and, since there are no tables to extract, it stores an empty list in the 'html_tables' field of the metadata. The output data remains the same as the input data, with the addition of the empty 'html_tables' list in the metadata, indicating that no tables were found. 在这个例子中,输入是一个不包含任何表格的HTML文档。算子会处理这个输入,由于没有表格可以提取,它会在元数据的'html_tables'字段中存储一个空列表。输出数据与输入数据相同,在元数据中添加了一个空的'html_tables'列表,表明没有找到表格。