data_juicer.ops.selector.topk_specified_field_selector module

class data_juicer.ops.selector.topk_specified_field_selector.TopkSpecifiedFieldSelector(field_key: str = '', top_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, topk: Annotated[int, Gt(gt=0)] | None = None, reverse: bool = True, *args, **kwargs)[source]

Bases: Selector

Selects top samples based on the sorted values of a specified field.

This operator selects the top samples from a dataset based on the values of a specified field. The field can be multi-level, with keys separated by dots. The selection is based on either a specified ratio of the dataset or a fixed number of top samples. If both top_ratio and topk are provided, the one resulting in fewer samples is used. The sorting order can be ascending or descending, controlled by the reverse parameter. The key metric is the value of the specified field, and the operator uses this to determine which samples to keep.

__init__(field_key: str = '', top_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, topk: Annotated[int, Gt(gt=0)] | None = None, reverse: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • field_key – Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by ‘.’.

  • top_ratio – Ratio of selected top samples, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.

  • topk – Number of selected top sample, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.

  • reverse – Determine the sorting rule, if reverse=True, then sort in descending order.

  • args – extra args

  • kwargs – extra args

process(dataset)[source]

Dataset –> dataset.

Parameters:

dataset – input dataset

Returns:

selected dataset.