# clean_html_mapper Cleans HTML code from text samples, converting HTML to plain text. This operator processes text samples by removing HTML tags and converting HTML elements to a more readable format. Specifically, it replaces `

` and `

`和`
1. ` tags are replaced with bullet points, and other elements like headers and paragraphs are flattened into a continuous string. This is useful for extracting readable text from web pages. 此示例展示了算子处理完整HTML文档的能力，将其转换为纯文本。它移除所有HTML标签并保留文本内容。`
2. `标签被替换为项目符号，而其他如标题和段落的元素则被展平成连续的字符串。这对于从网页中提取可读文本非常有用。 ### test_no_html_text ```python CleanHtmlMapper() ``` #### 📥 input data 输入数据
  Sample 1: list
```
['This is a test', '这是个测试', '12345678']
```
  #### 📤 output data 输出数据
  Sample 1: list
```
['This is a test', '这是个测试', '12345678']
```
  #### ✨ explanation 解释 In this example, the input data does not contain any HTML tags. As a result, the operator simply returns the original text without making any changes. This case illustrates that the operator can handle plain text inputs effectively, ensuring that non-HTML content remains unchanged. 在此示例中，输入数据不包含任何HTML标签。因此，算子直接返回原始文本，不做任何更改。这个案例说明了算子可以有效处理纯文本输入，确保非HTML内容保持不变。 ## 🔗 related links 相关链接 - [source code 源代码](../../../data_juicer/ops/mapper/clean_html_mapper.py) - [unit test 单元测试](../../../tests/ops/mapper/test_clean_html_mapper.py) - [Return operator list 返回算子列表](../../Operators.md)