clean_html_mapper¶
Cleans HTML code from text samples, converting HTML to plain text.
This operator processes text samples by removing HTML tags and converting HTML elements to a more readable format. Specifically, it replaces <li>
and <ol>
tags with newline and bullet points. The Selectolax HTML parser is used to extract the text content from the HTML. This operation is performed in a batched manner, making it efficient for large datasets.
将HTML代码从文本样本中清理,将HTML转换为纯文本。
此算子通过删除HTML标签并将HTML元素转换为更易读的格式来处理文本样本。具体来说,它将<li>
和<ol>
标签替换为换行符和项目符号。使用Selectolax HTML解析器从HTML中提取文本内容。此操作以批量方式执行,使其适用于大型数据集。
Type 算子类型: mapper
Tags 标签: cpu, text
🔧 Parameter Configuration 参数配置¶
name 参数名 |
type 类型 |
default 默认值 |
desc 说明 |
---|---|---|---|
|
|
extra args |
|
|
|
extra args |
📊 Effect demonstration 效果演示¶
test_complete_html_text¶
CleanHtmlMapper()
📥 input data 输入数据¶
['<header><nav><ul><tile>测试</title><li><a href="#">Home</a></li><li><a href="#">About</a></li><li><a href="#">Services</a></li><li><a href="#">Contact</a></li></ul></nav></header><main><h1>Welcome to My Website</h1><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.<button>Learn More</button></main><footer><p>© 2021 My Website. All Rights Reserved.</p></footer>']
📤 output data 输出数据¶
['测试\n*Home\n*About\n*Services\n*ContactWelcome to My WebsiteLorem ipsum dolor sit amet, consectetur adipiscing elit.Learn More© 2021 My Website. All Rights Reserved.']
✨ explanation 解释¶
This example demonstrates the operator’s ability to process a full HTML document, converting it into plain text. It removes all HTML tags and preserves the text content. The <li>
tags are replaced with bullet points, and other elements like headers and paragraphs are flattened into a continuous string. This is useful for extracting readable text from web pages.
此示例展示了算子处理完整HTML文档的能力,将其转换为纯文本。它移除所有HTML标签并保留文本内容。<li>
标签被替换为项目符号,而其他如标题和段落的元素则被展平成连续的字符串。这对于从网页中提取可读文本非常有用。
test_no_html_text¶
CleanHtmlMapper()
📥 input data 输入数据¶
['This is a test', '这是个测试', '12345678']
📤 output data 输出数据¶
['This is a test', '这是个测试', '12345678']
✨ explanation 解释¶
In this example, the input data does not contain any HTML tags. As a result, the operator simply returns the original text without making any changes. This case illustrates that the operator can handle plain text inputs effectively, ensuring that non-HTML content remains unchanged. 在此示例中,输入数据不包含任何HTML标签。因此,算子直接返回原始文本,不做任何更改。这个案例说明了算子可以有效处理纯文本输入,确保非HTML内容保持不变。