topk_specified_field_selector

Selects top samples based on the sorted values of a specified field.

This operator selects the top samples from a dataset based on the values of a specified field. The field can be multi-level, with keys separated by dots. The selection is based on either a specified ratio of the dataset or a fixed number of top samples. If both top_ratio and topk are provided, the one resulting in fewer samples is used. The sorting order can be ascending or descending, controlled by the reverse parameter. The key metric is the value of the specified field, and the operator uses this to determine which samples to keep.

根据指定字段的排序值选择顶部样本。

该算子根据指定字段的值从数据集中选择顶部样本。字段可以是多级的,键之间用点分隔。选择基于数据集的指定比例或固定数量的顶部样本。如果同时提供了 top_ratiotopk,则使用导致样本数量较少的那个。排序顺序可以是升序或降序,由 reverse 参数控制。关键指标是指定字段的值,该算子使用此值来确定保留哪些样本。

Type 算子类型: selector

Tags 标签: cpu

🔧 Parameter Configuration 参数配置

name 参数名

type 类型

default 默认值

desc 说明

field_key

<class ‘str’>

''

Selector based on the specified value

top_ratio

typing.Optional[typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])]]

None

Ratio of selected top samples, samples will be

topk

typing.Optional[typing.Annotated[int, Gt(gt=0)]]

None

Number of selected top sample, samples will be

reverse

<class ‘bool’>

True

Determine the sorting rule, if reverse=True,

args

''

extra args

kwargs

''

extra args

📊 Effect demonstration 效果演示

test_topratio_select

TopkSpecifiedFieldSelector(field_key='meta.key1.key2.count', top_ratio=0.2, topk=5, reverse=True)

📥 input data 输入数据

Sample 1: text
Today is Sun
count101
meta{'suffix': '.pdf', 'key1': {'key2': {'count': 34}, 'count': 5}}
Sample 2: text
a v s e c s f e f g a a a  
count16
meta{'suffix': '.docx', 'key1': {'key2': {'count': 243}, 'count': 63}}
Sample 3: text
中文也是一个字算一个长度
count162
meta{'suffix': '.txt', 'key1': {'key2': {'count': None}, 'count': 23}}
Sample 4: text
,。、„”“«»1」「《》´∶:?!
countNone
meta{'suffix': '.html', 'key1': {'key2': {'count': 18}, 'count': 48}}
Sample 5: text
他的英文名字叫Harry Potter
count88
meta{'suffix': '.pdf', 'key1': {'key2': {'count': 551}, 'count': 78}}
Sample 6: text
这是一个测试
countNone
meta{'suffix': '.py', 'key1': {'key2': {'count': 89}, 'count': 3}}
Sample 7: text
我出生于2023年12月15日
countNone
meta{'suffix': '.java', 'key1': {'key2': {'count': 354.32}, 'count': 67}}
Sample 8: text
emoji表情测试下😊,😸31231
count2
meta{'suffix': '.html', 'key1': {'key2': {'count': 354.32}, 'count': 32}}
Sample 9: text
a=1
b
c=1+2+3+5
d=6
count178
meta{'suffix': '.pdf', 'key1': {'key2': {'count': 33}, 'count': 33}}
Sample 10: text
使用片段分词器对每个页面进行分词,使用语言
count666
meta{'suffix': '.xml', 'key1': {'key2': {'count': 18}, 'count': 48}}

📤 output data 输出数据

Sample 1: text
他的英文名字叫Harry Potter
count88
meta{'suffix': '.pdf', 'key1': {'key2': {'count': 551}, 'count': 78}}
Sample 2: text
我出生于2023年12月15日
countNone
meta{'suffix': '.java', 'key1': {'key2': {'count': 354.32}, 'count': 67}}

✨ explanation 解释

The operator selects the top 20% of samples based on the ‘meta.key1.key2.count’ field, in descending order. The target list contains the two highest values for this field (551 and 354.32). 算子根据’meta.key1.key2.count’字段,按降序选择前20%的样本。目标列表包含了该字段值最高的两个样本(551和354.32)。

test_reverse_select

TopkSpecifiedFieldSelector(field_key='meta.key1.key2.count', top_ratio=0.5, topk=3, reverse=False)

📥 input data 输入数据

Sample 1: text
Today is Sun
count101
meta{'suffix': '.pdf', 'key1': {'key2': {'count': 34}, 'count': 5}}
Sample 2: text
a v s e c s f e f g a a a  
count16
meta{'suffix': '.docx', 'key1': {'key2': {'count': 243}, 'count': 63}}
Sample 3: text
中文也是一个字算一个长度
count162
meta{'suffix': '.txt', 'key1': {'key2': {'count': None}, 'count': 23}}
Sample 4: text
,。、„”“«»1」「《》´∶:?!
countNone
meta{'suffix': '.html', 'key1': {'key2': {'count': 18}, 'count': 48}}
Sample 5: text
他的英文名字叫Harry Potter
count88
meta{'suffix': '.pdf', 'key1': {'key2': {'count': 551}, 'count': 78}}
Sample 6: text
这是一个测试
countNone
meta{'suffix': '.py', 'key1': {'key2': {'count': 89}, 'count': 3}}
Sample 7: text
我出生于2023年12月15日
countNone
meta{'suffix': '.java', 'key1': {'key2': {'count': 354.32}, 'count': 67}}
Sample 8: text
emoji表情测试下😊,😸31231
count2
meta{'suffix': '.html', 'key1': {'key2': {'count': 354.32}, 'count': 32}}
Sample 9: text
a=1
b
c=1+2+3+5
d=6
count178
meta{'suffix': '.pdf', 'key1': {'key2': {'count': 33}, 'count': 33}}
Sample 10: text
使用片段分词器对每个页面进行分词,使用语言
count666
meta{'suffix': '.xml', 'key1': {'key2': {'count': 2}, 'count': 48}}

📤 output data 输出数据

Sample 1: text
使用片段分词器对每个页面进行分词,使用语言
count666
meta{'suffix': '.xml', 'key1': {'key2': {'count': 2}, 'count': 48}}
Sample 2: text
,。、„”“«»1」「《》´∶:?!
countNone
meta{'suffix': '.html', 'key1': {'key2': {'count': 18}, 'count': 48}}
Sample 3: text
a=1
b
c=1+2+3+5
d=6
count178
meta{'suffix': '.pdf', 'key1': {'key2': {'count': 33}, 'count': 33}}

✨ explanation 解释

The operator selects the bottom 50% or 3 samples (whichever is fewer) based on the ‘meta.key1.key2.count’ field, in ascending order. The target list contains the three lowest non-null values for this field (2, 18, and 33). 算子根据’meta.key1.key2.count’字段,按升序选择后50%或3个样本(取较少者)。目标列表包含了该字段中最低的三个非空值(2、18和33)。