range_specified_field_selector¶

Selects a range of samples based on the sorted values of a specified field.

This operator selects samples whose values for a specified field fall within a given range. The range can be defined using percentiles or ranks, and the operator will use the more inclusive bounds if both are provided. The field values are first sorted in ascending order, and then the samples are selected based on the lower and upper bounds. If no bounds are provided, the original dataset is returned. The operator ensures that the specified field exists in the dataset and handles multi-level fields by separating keys with dots.

根据指定字段的排序值选择一个范围内的样本。

该算子选择指定字段的值在给定范围内的样本。范围可以使用百分位数或排名来定义，如果两者都提供，则使用更包容的边界。首先按升序对字段值进行排序，然后根据下界和上界选择样本。如果没有提供边界，则返回原始数据集。该算子确保指定的字段存在于数据集中，并通过点分隔多级字段。

Type 算子类型: selector

Tags 标签: cpu

🔧 Parameter Configuration 参数配置¶

name 参数名	type 类型	default 默认值	desc 说明
`field_key`	<class ‘str’>	`''`	Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by ‘.’.
`lower_percentile`	typing.Optional[typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])]]	`None`	The lower bound of the percentile to be sample, samples will be selected if their specified field values are greater than this lower bound. When both lower_percentile and lower_rank are set, the value corresponding to the larger number of samples will be applied.
`upper_percentile`	typing.Optional[typing.Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])]]	`None`	The upper bound of the percentile to be sample, samples will be selected if their specified field values are less or equal to the upper bound. When both upper_percentile and upper_rank are set, the value corresponding to the smaller number of samples will be applied.
`lower_rank`	typing.Optional[typing.Annotated[int, Gt(gt=0)]]	`None`	The lower bound of the rank to be sample, samples will be selected if their specified field values are greater than this lower bound. When both lower_percentile and lower_rank are set, the value corresponding to the larger number of samples will be applied.
`upper_rank`	typing.Optional[typing.Annotated[int, Gt(gt=0)]]	`None`	The upper bound of the rank to be sample, samples will be selected if their specified field values are less or equal to the upper bound. When both upper_percentile and upper_rank are set, the value corresponding to the smaller number of samples will be applied.
`args`		`''`	extra args
`kwargs`		`''`	extra args

📊 Effect demonstration 效果演示¶

test_percentile_select¶

RangeSpecifiedFieldSelector(field_key='meta.key1.count', lower_percentile=0.78, upper_percentile=0.9, lower_rank=5, upper_rank=10)

📥 input data 输入数据¶

Sample 1: text

Today is Sun

count

101

meta

suffix

.pdf

key1

key2

count

Sample 2: text

a v s e c s f e f g a a a

count

meta

suffix

.docx

key1

key2

count

243

count

Sample 3: text

中文也是一个字算一个长度

count

162

meta

suffix

.txt

key1

key2

count

None

count

Sample 4: text

，。、„”“«»１」「《》´∶：？！

count

None

meta

suffix

.html

key1

key2

count

Sample 5: text

他的英文名字叫Harry Potter

count

meta

suffix

.pdf

key1

key2

count

551

count

Sample 6: text

这是一个测试

count

None

meta

suffix

.py

key1

key2

count

Sample 7: text

我出生于2023年12月15日

count

None

meta

suffix

.java

key1

key2

count

354.32

count

Sample 8: text

emoji表情测试下😊，😸31231

count

meta

suffix

.html

key1

key2

count

354.32

count

Sample 9: text

a=1
b
c=1+2+3+5
d=6

count

178

meta

suffix

.pdf

key1

key2

count

Sample 10: text

使用片段分词器对每个页面进行分词，使用语言

count

666

meta

suffix

.xml

key1

key2

count

📤 output data 输出数据¶

Sample 1: text

a v s e c s f e f g a a a

count

meta

suffix

.docx

key1

key2

count

243

count

Sample 2: text

我出生于2023年12月15日

count

None

meta

suffix

.java

key1

key2

count

354.32

count

✨ explanation 解释¶

The operator selects samples based on the ‘meta.key1.count’ field, filtering for values that fall between the 78th and 90th percentiles or ranks 5 to 10. The resulting dataset contains two items with ‘meta.key1.count’ values of 63 and 67, which are within the specified range. 算子根据’meta.key1.count’字段选择样本，筛选出位于第78到第90百分位之间或排名在5到10之间的值。结果数据集包含两个项目的’meta.key1.count’值分别为63和67，在指定范围内。

test_list_select¶

RangeSpecifiedFieldSelector(field_key='meta.key1.key2.count', lower_percentile=0.0, upper_percentile=0.5, lower_rank=2, upper_rank=4)

📥 input data 输入数据¶

Sample 1: text

Today is Sun

count

101

meta

suffix

.pdf

key1

key2

count

[34.0]

count

Sample 2: text

a v s e c s f e f g a a a

count

meta

suffix

.docx

key1

key2

count

[243.0]

count

Sample 3: text

中文也是一个字算一个长度

count

162

meta

suffix

.txt

key1

key2

count

[]

count

Sample 4: text

，。、„”“«»１」「《》´∶：？！

count

None

meta

suffix

.html

key1

key2

count

None

count

Sample 5: text

他的英文名字叫Harry Potter

count

meta

suffix

.pdf

key1

key2

count

[551.0]

count

Sample 6: text

这是一个测试

count

None

meta

suffix

.py

key1

key2

count

[89.0]

count

Sample 7: text

我出生于2023年12月15日

count

None

meta

suffix

.java

key1

key2

count

[354.32]

count

Sample 8: text

emoji表情测试下😊，😸31231

count

meta

suffix

.html

key1

key2

count

[354.32]

count

Sample 9: text

a=1
b
c=1+2+3+5
d=6

count

178

meta

suffix

.pdf

key1

key2

count

[33.0, 33.0]

count

Sample 10: text

使用片段分词器对每个页面进行分词，使用语言

count

666

meta

suffix

.xml

key1

key2

count

[2.0, 2.0]

count

📤 output data 输出数据¶

Sample 1: text

a=1
b
c=1+2+3+5
d=6

count

178

meta

suffix

.pdf

key1

key2

count

[33.0, 33.0]

count

Sample 2: text

使用片段分词器对每个页面进行分词，使用语言

count

666

meta

suffix

.xml

key1

key2

count

[2.0, 2.0]

count

✨ explanation 解释¶

This test demonstrates selecting samples where the ‘meta.key1.key2.count’ is a list, using percentile and rank boundaries from 0% to 50% and ranks 2 to 4. The target list includes samples with lists [33.0, 33.0] and [2.0, 2.0], as they meet the selection criteria. 此测试展示了当’meta.key1.key2.count’为列表时，使用从0％到50％的百分位数边界以及排名2至4来选择样本的情况。目标列表包括具有[33.0, 33.0]和[2.0, 2.0]列表的样本，因为它们符合选择条件。

range_specified_field_selector¶

🔧 Parameter Configuration 参数配置¶

📊 Effect demonstration 效果演示¶

test_percentile_select¶

📥 input data 输入数据¶

📤 output data 输出数据¶

✨ explanation 解释¶

test_list_select¶

📥 input data 输入数据¶

📤 output data 输出数据¶

✨ explanation 解释¶

🔗 related links 相关链接¶