data_juicer.ops.selector¶

class data_juicer.ops.selector.FrequencySpecifiedFieldSelector(field_key: str = '', top_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, topk: Annotated[int, Gt(gt=0)] | None = None, reverse: bool = True, *args, **kwargs)[源代码]¶

基类：Selector

Selector to filter samples based on the frequency of a specified field.

This operator selects samples based on the frequency of values in a specified field. The field can be multi-level, with keys separated by dots. It supports filtering by either a top ratio or a fixed number (topk) of the most frequent values. If both top_ratio and topk are provided, the one resulting in fewer samples is used. The sorting order can be controlled with the reverse parameter. The operator processes the dataset and returns a new dataset containing only the selected samples.

__init__(field_key: str = '', top_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, topk: Annotated[int, Gt(gt=0)] | None = None, reverse: bool = True, *args, **kwargs)[源代码]¶

Initialization method.

参数:

field_key -- Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by '.'.
top_ratio -- Ratio of selected top specified field value, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.
topk -- Number of selected top specified field value, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.
reverse -- Determine the sorting rule, if reverse=True, then sort in descending order.
args -- extra args
kwargs -- extra args

process(dataset)[源代码]¶

Dataset --> dataset.

参数:: dataset -- input dataset
返回:: selected dataset.

class data_juicer.ops.selector.RandomSelector(select_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, select_num: Annotated[int, Gt(gt=0)] | None = None, *args, **kwargs)[源代码]¶

基类：Selector

Randomly selects a subset of samples from the dataset.

This operator randomly selects a subset of samples based on either a specified ratio or a fixed number. If both select_ratio and select_num are provided, the one that results in fewer samples is used. The selection is skipped if the dataset has only one or no samples. The random_sample function is used to perform the actual sampling.

select_ratio: The ratio of samples to select (0 to 1).
select_num: The exact number of samples to select.
If neither select_ratio nor select_num is set, the dataset remains unchanged.

__init__(select_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, select_num: Annotated[int, Gt(gt=0)] | None = None, *args, **kwargs)[源代码]¶

Initialization method.

参数:

select_ratio -- The ratio to select. When both select_ratio and select_num are set, the value corresponding to the smaller number of samples will be applied.
select_num -- The number of samples to select. When both select_ratio and select_num are set, the value corresponding to the smaller number of samples will be applied.
args -- extra args
kwargs -- extra args

process(dataset)[源代码]¶

Dataset --> dataset.

参数:: dataset -- input dataset
返回:: selected dataset.

class data_juicer.ops.selector.RangeSpecifiedFieldSelector(field_key: str = '', lower_percentile: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, upper_percentile: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, lower_rank: Annotated[int, Gt(gt=0)] | None = None, upper_rank: Annotated[int, Gt(gt=0)] | None = None, *args, **kwargs)[源代码]¶

基类：Selector

Selects a range of samples based on the sorted values of a specified field.

This operator selects samples whose values for a specified field fall within a given range. The range can be defined using percentiles or ranks, and the operator will use the more inclusive bounds if both are provided. The field values are first sorted in ascending order, and then the samples are selected based on the lower and upper bounds. If no bounds are provided, the original dataset is returned. The operator ensures that the specified field exists in the dataset and handles multi-level fields by separating keys with dots.

__init__(field_key: str = '', lower_percentile: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, upper_percentile: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, lower_rank: Annotated[int, Gt(gt=0)] | None = None, upper_rank: Annotated[int, Gt(gt=0)] | None = None, *args, **kwargs)[源代码]¶

Initialization method.

参数:

field_key -- Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by '.'.
lower_percentile -- The lower bound of the percentile to be sample, samples will be selected if their specified field values are greater than this lower bound. When both lower_percentile and lower_rank are set, the value corresponding to the larger number of samples will be applied.
upper_percentile -- The upper bound of the percentile to be sample, samples will be selected if their specified field values are less or equal to the upper bound. When both upper_percentile and upper_rank are set, the value corresponding to the smaller number of samples will be applied.
lower_rank -- The lower bound of the rank to be sample, samples will be selected if their specified field values are greater than this lower bound. When both lower_percentile and lower_rank are set, the value corresponding to the larger number of samples will be applied.
upper_rank -- The upper bound of the rank to be sample, samples will be selected if their specified field values are less or equal to the upper bound. When both upper_percentile and upper_rank are set, the value corresponding to the smaller number of samples will be applied.
args -- extra args
kwargs -- extra args

process(dataset)[源代码]¶

Dataset --> dataset.

参数:: dataset -- input dataset
返回:: selected dataset.

class data_juicer.ops.selector.TagsSpecifiedFieldSelector(field_key: str = '', target_tags: List[str] = None, *args, **kwargs)[源代码]¶

基类：Selector

Selector to filter samples based on the tags of a specified field.

This operator selects samples where the value of the specified field matches one of the target tags. The field can be multi-level, with levels separated by dots (e.g., 'level1.level2'). The operator checks if the specified field exists in the dataset and if the field value is a string, number, or None. If the field value matches any of the target tags, the sample is kept. The selection is case-sensitive.

The field_key parameter specifies the field to check.
The target_tags parameter is a list of tags to match against the field value.
If the dataset has fewer than two samples or if field_key is empty, the dataset is returned unchanged.

__init__(field_key: str = '', target_tags: List[str] = None, *args, **kwargs)[源代码]¶

Initialization method.

参数:

field_key -- Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by '.'.
target_tags -- Target tags to be select.
args -- extra args
kwargs -- extra args

process(dataset)[源代码]¶

Dataset --> dataset.

参数:: dataset -- input dataset
返回:: selected dataset.

class data_juicer.ops.selector.TopkSpecifiedFieldSelector(field_key: str = '', top_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, topk: Annotated[int, Gt(gt=0)] | None = None, reverse: bool = True, *args, **kwargs)[源代码]¶

基类：Selector

Selects top samples based on the sorted values of a specified field.

This operator selects the top samples from a dataset based on the values of a specified field. The field can be multi-level, with keys separated by dots. The selection is based on either a specified ratio of the dataset or a fixed number of top samples. If both top_ratio and topk are provided, the one resulting in fewer samples is used. The sorting order can be ascending or descending, controlled by the reverse parameter. The key metric is the value of the specified field, and the operator uses this to determine which samples to keep.

Initialization method.

参数:

field_key -- Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by '.'.
top_ratio -- Ratio of selected top samples, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.
topk -- Number of selected top sample, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.
reverse -- Determine the sorting rule, if reverse=True, then sort in descending order.
args -- extra args
kwargs -- extra args

process(dataset)[源代码]¶

Dataset --> dataset.

参数:: dataset -- input dataset
返回:: selected dataset.