# clean_html_mapper
Cleans HTML code from text samples, converting HTML to plain text.
This operator processes text samples by removing HTML tags and converting HTML elements to a more readable format. Specifically, it replaces `
` and `` tags with newline and bullet points. The Selectolax HTML parser is used to extract the text content from the HTML. This operation is performed in a batched manner, making it efficient for large datasets.
将HTML代码从文本样本中清理,将HTML转换为纯文本。
此算子通过删除HTML标签并将HTML元素转换为更易读的格式来处理文本样本。具体来说,它将`- `和`
`标签替换为换行符和项目符号。使用Selectolax HTML解析器从HTML中提取文本内容。此操作以批量方式执行,使其适用于大型数据集。
Type 算子类型: **mapper**
Tags 标签: cpu, text
## 🔧 Parameter Configuration 参数配置
| name 参数名 | type 类型 | default 默认值 | desc 说明 |
|--------|------|--------|------|
| `args` | | `''` | extra args |
| `kwargs` | | `''` | extra args |
## 📊 Effect demonstration 效果演示
### test_complete_html_text
```python
CleanHtmlMapper()
```
#### 📥 input data 输入数据
['<header><nav><ul><tile>测试</title><li><a href="#">Home</a></li><li><a href="#">About</a></li><li><a href="#">Services</a></li><li><a href="#">Contact</a></li></ul></nav></header><main><h1>Welcome to My Website</h1><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.<button>Learn More</button></main><footer><p>© 2021 My Website. All Rights Reserved.</p></footer>']
#### 📤 output data 输出数据
['测试\n*Home\n*About\n*Services\n*ContactWelcome to My WebsiteLorem ipsum dolor sit amet, consectetur adipiscing elit.Learn More© 2021 My Website. All Rights Reserved.']
#### ✨ explanation 解释
This example demonstrates the operator's ability to process a full HTML document, converting it into plain text. It removes all HTML tags and preserves the text content. The `- ` tags are replaced with bullet points, and other elements like headers and paragraphs are flattened into a continuous string. This is useful for extracting readable text from web pages.
此示例展示了算子处理完整HTML文档的能力,将其转换为纯文本。它移除所有HTML标签并保留文本内容。`
- `标签被替换为项目符号,而其他如标题和段落的元素则被展平成连续的字符串。这对于从网页中提取可读文本非常有用。
### test_no_html_text
```python
CleanHtmlMapper()
```
#### 📥 input data 输入数据
['This is a test', '这是个测试', '12345678']
#### 📤 output data 输出数据
['This is a test', '这是个测试', '12345678']
#### ✨ explanation 解释
In this example, the input data does not contain any HTML tags. As a result, the operator simply returns the original text without making any changes. This case illustrates that the operator can handle plain text inputs effectively, ensuring that non-HTML content remains unchanged.
在此示例中,输入数据不包含任何HTML标签。因此,算子直接返回原始文本,不做任何更改。这个案例说明了算子可以有效处理纯文本输入,确保非HTML内容保持不变。
## 🔗 related links 相关链接
- [source code 源代码](../../../data_juicer/ops/mapper/clean_html_mapper.py)
- [unit test 单元测试](../../../tests/ops/mapper/test_clean_html_mapper.py)
- [Return operator list 返回算子列表](../../Operators.md)