frequency_specified_field_selector

Selector to filter samples based on the frequency of a specified field.

This operator selects samples based on the frequency of values in a specified field. The field can be multi-level, with keys separated by dots. It supports filtering by either a top ratio or a fixed number (topk) of the most frequent values. If both top_ratio and topk are provided, the one resulting in fewer samples is used. The sorting order can be controlled with the reverse parameter. The operator processes the dataset and returns a new dataset containing only the selected samples.

根据指定字段的频率筛选样本的选择器。

该算子根据指定字段的值频率选择样本。字段可以是多级的,键之间用点分隔。它支持按顶部比例或固定数量(topk)的最频繁值进行筛选。如果同时提供了 top_ratio 和 topk,则使用导致样本较少的那个。可以通过 reverse 参数控制排序顺序。该算子处理数据集并返回仅包含所选样本的新数据集。

Type 算子类型: selector

Tags 标签: cpu

🔧 Parameter Configuration 参数配置

name 参数名

type 类型

default 默认值

desc 说明

field_key

<class ‘str’>

''

Selector based on the specified value

top_ratio

typing.Optional[typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])]]

None

Ratio of selected top specified field value,

topk

typing.Optional[typing.Annotated[int, Gt(gt=0)]]

None

Number of selected top specified field value,

reverse

<class ‘bool’>

True

Determine the sorting rule, if reverse=True,

args

''

extra args

kwargs

''

extra args

📊 Effect demonstration 效果演示

test_topratio_select

FrequencySpecifiedFieldSelector(field_key='meta.suffix', top_ratio=0.3, topk=5, reverse=True)

📥 input data 输入数据

Sample 1: text
Today is Sun
count101
meta{'suffix': '.pdf', 'key1': {'key2': {'count': 34}, 'count': 5}}
Sample 2: text
a v s e c s f e f g a a a  
count16
meta{'suffix': '.docx', 'key1': {'key2': {'count': 243}, 'count': 63}}
Sample 3: text
中文也是一个字算一个长度
count162
meta{'suffix': '.txt', 'key1': {'key2': {'count': None}, 'count': 23}}
Sample 4: text
,。、„”“«»1」「《》´∶:?!
countNone
meta{'suffix': '.html', 'key1': {'key2': {'count': 18}, 'count': 48}}
Sample 5: text
他的英文名字叫Harry Potter
count88
meta{'suffix': '.pdf', 'key1': {'key2': {'count': 551}, 'count': 78}}
Sample 6: text
这是一个测试
countNone
meta{'suffix': '.py', 'key1': {'key2': {'count': 89}, 'count': 3}}
Sample 7: text
我出生于2023年12月15日
countNone
meta{'suffix': '.java', 'key1': {'key2': {'count': 354.32}, 'count': 67}}
Sample 8: text
emoji表情测试下😊,😸31231
count2
meta{'suffix': '.html', 'key1': {'key2': {'count': 354.32}, 'count': 32}}
Sample 9: text
a=1
b
c=1+2+3+5
d=6
count178
meta{'suffix': '.pdf', 'key1': {'key2': {'count': 33}, 'count': 33}}
Sample 10: text
使用片段分词器对每个页面进行分词,使用语言
count666
meta{'suffix': '.xml', 'key1': {'key2': {'count': 18}, 'count': 48}}

📤 output data 输出数据

Sample 1: text
Today is Sun
count101
meta{'suffix': '.pdf', 'key1': {'key2': {'count': 34}, 'count': 5}}
Sample 2: text
他的英文名字叫Harry Potter
count88
meta{'suffix': '.pdf', 'key1': {'key2': {'count': 551}, 'count': 78}}
Sample 3: text
a=1
b
c=1+2+3+5
d=6
count178
meta{'suffix': '.pdf', 'key1': {'key2': {'count': 33}, 'count': 33}}
Sample 4: text
,。、„”“«»1」「《》´∶:?!
countNone
meta{'suffix': '.html', 'key1': {'key2': {'count': 18}, 'count': 48}}
Sample 5: text
emoji表情测试下😊,😸31231
count2
meta{'suffix': '.html', 'key1': {'key2': {'count': 354.32}, 'count': 32}}

✨ explanation 解释

The operator selects samples based on the frequency of ‘meta.suffix’ field, using a top ratio of 0.3 and a topk of 5, with reverse sorting. The target list contains the most frequent suffixes, ‘.pdf’ and ‘.html’, according to the specified criteria, while others are removed. 算子根据’meta.suffix’字段的频率选择样本,使用0.3的顶部比例和5的topk,并按降序排列。目标列表包含根据指定标准最频繁的后缀’.pdf’和’.html’,而其他则被移除。

test_reverse_select

FrequencySpecifiedFieldSelector(field_key='meta.key1.key2.count', top_ratio=0.4, topk=2, reverse=False)

📥 input data 输入数据

Sample 1: text
Today is Sun
count101
meta{'suffix': '.pdf', 'key1': {'key2': {'count': 34}, 'count': 5}}
Sample 2: text
a v s e c s f e f g a a a  
count16
meta{'suffix': '.docx', 'key1': {'key2': {'count': 243}, 'count': 63}}
Sample 3: text
中文也是一个字算一个长度
count162
meta{'suffix': '.txt', 'key1': {'key2': {'count': None}, 'count': 23}}
Sample 4: text
,。、„”“«»1」「《》´∶:?!
countNone
meta{'suffix': '.html', 'key1': {'key2': {'count': 34}, 'count': 48}}
Sample 5: text
他的英文名字叫Harry Potter
count88
meta{'suffix': '.pdf', 'key1': {'key2': {'count': 551}, 'count': 78}}
Sample 6: text
这是一个测试
countNone
meta{'suffix': '.py', 'key1': {'key2': {'count': 89}, 'count': 3}}
Sample 7: text
我出生于2023年12月15日
countNone
meta{'suffix': '.java', 'key1': {'key2': {'count': 354.32}, 'count': 67}}
Sample 8: text
emoji表情测试下😊,😸31231
count2
meta{'suffix': '.html', 'key1': {'key2': {'count': 354.32}, 'count': 32}}
Sample 9: text
a=1
b
c=1+2+3+5
d=6
count178
meta{'suffix': '.pdf', 'key1': {'key2': {'count': 34}, 'count': 33}}
Sample 10: text
使用片段分词器对每个页面进行分词,使用语言
count666
meta{'suffix': '.xml', 'key1': {'key2': {'count': 18}, 'count': 48}}

📤 output data 输出数据

Sample 1: text
a v s e c s f e f g a a a  
count16
meta{'suffix': '.docx', 'key1': {'key2': {'count': 243}, 'count': 63}}
Sample 2: text
中文也是一个字算一个长度
count162
meta{'suffix': '.txt', 'key1': {'key2': {'count': None}, 'count': 23}}

✨ explanation 解释

This example demonstrates the use of the operator with reverse set to False, selecting the least frequent values in the ‘meta.key1.key2.count’ field. Only two samples with the least frequent or null values in this field are kept, while all others are removed. 此示例展示了将reverse设置为False时算子的使用情况,选择’meta.key1.key2.count’字段中最不频繁的值。仅保留该字段中值最不频繁或为空的两个样本,其余全部被移除。