frequency_specified_field_selector¶

Selector to filter samples based on the frequency of a specified field.

This operator selects samples based on the frequency of values in a specified field. The field can be multi-level, with keys separated by dots. It supports filtering by either a top ratio or a fixed number (topk) of the most frequent values. If both top_ratio and topk are provided, the one resulting in fewer samples is used. The sorting order can be controlled with the reverse parameter. The operator processes the dataset and returns a new dataset containing only the selected samples.

根据指定字段的频率筛选样本的选择器。

该算子根据指定字段的值频率选择样本。字段可以是多级的，键之间用点分隔。它支持按顶部比例或固定数量（topk）的最频繁值进行筛选。如果同时提供了 top_ratio 和 topk，则使用导致样本较少的那个。可以通过 reverse 参数控制排序顺序。该算子处理数据集并返回仅包含所选样本的新数据集。

Type 算子类型: selector

Tags 标签: cpu

🔧 Parameter Configuration 参数配置¶

name 参数名	type 类型	default 默认值	desc 说明
`field_key`	<class ‘str’>	`''`	Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by ‘.’.
`top_ratio`	typing.Optional[typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])]]	`None`	Ratio of selected top specified field value, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.
`topk`	typing.Optional[typing.Annotated[int, Gt(gt=0)]]	`None`	Number of selected top specified field value, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.
`reverse`	<class ‘bool’>	`True`	Determine the sorting rule, if reverse=True, then sort in descending order.
`args`		`''`	extra args
`kwargs`		`''`	extra args

📊 Effect demonstration 效果演示¶

test_topratio_select¶

FrequencySpecifiedFieldSelector(field_key='meta.suffix', top_ratio=0.3, topk=5, reverse=True)

📥 input data 输入数据¶

Sample 1: text

Today is Sun

count

101

meta

suffix

.pdf

key1

key2

count

Sample 2: text

a v s e c s f e f g a a a

count

meta

suffix

.docx

key1

key2

count

243

count

Sample 3: text

中文也是一个字算一个长度

count

162

meta

suffix

.txt

key1

key2

count

None

count

Sample 4: text

，。、„”“«»１」「《》´∶：？！

count

None

meta

suffix

.html

key1

key2

count

Sample 5: text

他的英文名字叫Harry Potter

count

meta

suffix

.pdf

key1

key2

count

551

count

Sample 6: text

这是一个测试

count

None

meta

suffix

.py

key1

key2

count

Sample 7: text

我出生于2023年12月15日

count

None

meta

suffix

.java

key1

key2

count

354.32

count

Sample 8: text

emoji表情测试下😊，😸31231

count

meta

suffix

.html

key1

key2

count

354.32

count

Sample 9: text

a=1
b
c=1+2+3+5
d=6

count

178

meta

suffix

.pdf

key1

key2

count

Sample 10: text

使用片段分词器对每个页面进行分词，使用语言

count

666

meta

suffix

.xml

key1

key2

count

📤 output data 输出数据¶

Sample 1: text

Today is Sun

count

101

meta

suffix

.pdf

key1

key2

count

Sample 2: text

他的英文名字叫Harry Potter

count

meta

suffix

.pdf

key1

key2

count

551

count

Sample 3: text

a=1
b
c=1+2+3+5
d=6

count

178

meta

suffix

.pdf

key1

key2

count

Sample 4: text

，。、„”“«»１」「《》´∶：？！

count

None

meta

suffix

.html

key1

key2

count

Sample 5: text

emoji表情测试下😊，😸31231

count

meta

suffix

.html

key1

key2

count

354.32

count

✨ explanation 解释¶

The operator selects samples based on the frequency of ‘meta.suffix’ field, using a top ratio of 0.3 and a topk of 5, with reverse sorting. The target list contains the most frequent suffixes, ‘.pdf’ and ‘.html’, according to the specified criteria, while others are removed. 算子根据’meta.suffix’字段的频率选择样本，使用0.3的顶部比例和5的topk，并按降序排列。目标列表包含根据指定标准最频繁的后缀’.pdf’和’.html’，而其他则被移除。

test_reverse_select¶

FrequencySpecifiedFieldSelector(field_key='meta.key1.key2.count', top_ratio=0.4, topk=2, reverse=False)

📥 input data 输入数据¶

Sample 1: text

Today is Sun

count

101

meta

suffix

.pdf

key1

key2

count

Sample 2: text

a v s e c s f e f g a a a

count

meta

suffix

.docx

key1

key2

count

243

count

Sample 3: text

中文也是一个字算一个长度

count

162

meta

suffix

.txt

key1

key2

count

None

count

Sample 4: text

，。、„”“«»１」「《》´∶：？！

count

None

meta

suffix

.html

key1

key2

count

Sample 5: text

他的英文名字叫Harry Potter

count

meta

suffix

.pdf

key1

key2

count

551

count

Sample 6: text

这是一个测试

count

None

meta

suffix

.py

key1

key2

count

Sample 7: text

我出生于2023年12月15日

count

None

meta

suffix

.java

key1

key2

count

354.32

count

Sample 8: text

emoji表情测试下😊，😸31231

count

meta

suffix

.html

key1

key2

count

354.32

count

Sample 9: text

a=1
b
c=1+2+3+5
d=6

count

178

meta

suffix

.pdf

key1

key2

count

Sample 10: text

使用片段分词器对每个页面进行分词，使用语言

count

666

meta

suffix

.xml

key1

key2

count

📤 output data 输出数据¶

Sample 1: text

a v s e c s f e f g a a a

count

meta

suffix

.docx

key1

key2

count

243

count

Sample 2: text

中文也是一个字算一个长度

count

162

meta

suffix

.txt

key1

key2

count

None

count

✨ explanation 解释¶

This example demonstrates the use of the operator with reverse set to False, selecting the least frequent values in the ‘meta.key1.key2.count’ field. Only two samples with the least frequent or null values in this field are kept, while all others are removed. 此示例展示了将reverse设置为False时算子的使用情况，选择’meta.key1.key2.count’字段中最不频繁的值。仅保留该字段中值最不频繁或为空的两个样本，其余全部被移除。

frequency_specified_field_selector¶

🔧 Parameter Configuration 参数配置¶

📊 Effect demonstration 效果演示¶

test_topratio_select¶

📥 input data 输入数据¶

📤 output data 输出数据¶

✨ explanation 解释¶

test_reverse_select¶

📥 input data 输入数据¶

📤 output data 输出数据¶

✨ explanation 解释¶

🔗 related links 相关链接¶