data_juicer.ops.selector.range_specified_field_selector module

class data_juicer.ops.selector.range_specified_field_selector.RangeSpecifiedFieldSelector(field_key: str = '', lower_percentile: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, upper_percentile: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, lower_rank: Annotated[int, Gt(gt=0)] | None = None, upper_rank: Annotated[int, Gt(gt=0)] | None = None, *args, **kwargs)[source]

Bases: Selector

Selects a range of samples based on the sorted values of a specified field.

This operator selects samples whose values for a specified field fall within a given range. The range can be defined using percentiles or ranks, and the operator will use the more inclusive bounds if both are provided. The field values are first sorted in ascending order, and then the samples are selected based on the lower and upper bounds. If no bounds are provided, the original dataset is returned. The operator ensures that the specified field exists in the dataset and handles multi-level fields by separating keys with dots.

__init__(field_key: str = '', lower_percentile: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, upper_percentile: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, lower_rank: Annotated[int, Gt(gt=0)] | None = None, upper_rank: Annotated[int, Gt(gt=0)] | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • field_key – Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by ‘.’.

  • lower_percentile – The lower bound of the percentile to be sample, samples will be selected if their specified field values are greater than this lower bound. When both lower_percentile and lower_rank are set, the value corresponding to the larger number of samples will be applied.

  • upper_percentile – The upper bound of the percentile to be sample, samples will be selected if their specified field values are less or equal to the upper bound. When both upper_percentile and upper_rank are set, the value corresponding to the smaller number of samples will be applied.

  • lower_rank – The lower bound of the rank to be sample, samples will be selected if their specified field values are greater than this lower bound. When both lower_percentile and lower_rank are set, the value corresponding to the larger number of samples will be applied.

  • upper_rank – The upper bound of the rank to be sample, samples will be selected if their specified field values are less or equal to the upper bound. When both upper_percentile and upper_rank are set, the value corresponding to the smaller number of samples will be applied.

  • args – extra args

  • kwargs – extra args

process(dataset)[source]

Dataset –> dataset.

Parameters:

dataset – input dataset

Returns:

selected dataset.