data_juicer.ops.selector.frequency_specified_field_selector module

class data_juicer.ops.selector.frequency_specified_field_selector.FrequencySpecifiedFieldSelector(field_key: str = '', top_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, topk: Annotated[int, Gt(gt=0)] | None = None, reverse: bool = True, *args, **kwargs)[源代码]

基类:Selector

Selector to filter samples based on the frequency of a specified field.

This operator selects samples based on the frequency of values in a specified field. The field can be multi-level, with keys separated by dots. It supports filtering by either a top ratio or a fixed number (topk) of the most frequent values. If both top_ratio and topk are provided, the one resulting in fewer samples is used. The sorting order can be controlled with the reverse parameter. The operator processes the dataset and returns a new dataset containing only the selected samples.

__init__(field_key: str = '', top_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, topk: Annotated[int, Gt(gt=0)] | None = None, reverse: bool = True, *args, **kwargs)[源代码]

Initialization method.

参数:
  • field_key -- Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by '.'.

  • top_ratio -- Ratio of selected top specified field value, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.

  • topk -- Number of selected top specified field value, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.

  • reverse -- Determine the sorting rule, if reverse=True, then sort in descending order.

  • args -- extra args

  • kwargs -- extra args

process(dataset)[源代码]

Dataset --> dataset.

参数:

dataset -- input dataset

返回:

selected dataset.