data_juicer.ops.selector

class data_juicer.ops.selector.FrequencySpecifiedFieldSelector(field_key: str = '', top_ratio: float[float] | None = None, topk: int[int] | None = None, reverse: bool = True, *args, **kwargs)[source]

Bases: Selector

Selector to select samples based on the sorted frequency of specified field.

__init__(field_key: str = '', top_ratio: float[float] | None = None, topk: int[int] | None = None, reverse: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • field_key – Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by ‘.’.

  • top_ratio – Ratio of selected top specified field value, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.

  • topk – Number of selected top specified field value, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.

  • reverse – Determine the sorting rule, if reverse=True, then sort in descending order.

  • args – extra args

  • kwargs – extra args

process(dataset)[source]

Dataset –> dataset.

Parameters:

dataset – input dataset

Returns:

selected dataset.

class data_juicer.ops.selector.RandomSelector(select_ratio: float[float] | None = None, select_num: int[int] | None = None, *args, **kwargs)[source]

Bases: Selector

Selector to random select samples.

__init__(select_ratio: float[float] | None = None, select_num: int[int] | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • select_ratio – The ratio to select. When both select_ratio and select_num are set, the value corresponding to the smaller number of samples will be applied.

  • select_num – The number of samples to select. When both select_ratio and select_num are set, the value corresponding to the smaller number of samples will be applied.

  • args – extra args

  • kwargs – extra args

process(dataset)[source]

Dataset –> dataset.

Parameters:

dataset – input dataset

Returns:

selected dataset.

class data_juicer.ops.selector.RangeSpecifiedFieldSelector(field_key: str = '', lower_percentile: float[float] | None = None, upper_percentile: float[float] | None = None, lower_rank: int[int] | None = None, upper_rank: int[int] | None = None, *args, **kwargs)[source]

Bases: Selector

Selector to select a range of samples based on the sorted specified field value from smallest to largest.

__init__(field_key: str = '', lower_percentile: float[float] | None = None, upper_percentile: float[float] | None = None, lower_rank: int[int] | None = None, upper_rank: int[int] | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • field_key – Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by ‘.’.

  • lower_percentile – The lower bound of the percentile to be sample, samples will be selected if their specified field values are greater than this lower bound. When both lower_percentile and lower_rank are set, the value corresponding to the larger number of samples will be applied.

  • upper_percentile – The upper bound of the percentile to be sample, samples will be selected if their specified field values are less or equal to the upper bound. When both upper_percentile and upper_rank are set, the value corresponding to the smaller number of samples will be applied.

  • lower_rank – The lower bound of the rank to be sample, samples will be selected if their specified field values are greater than this lower bound. When both lower_percentile and lower_rank are set, the value corresponding to the larger number of samples will be applied.

  • upper_rank – The upper bound of the rank to be sample, samples will be selected if their specified field values are less or equal to the upper bound. When both upper_percentile and upper_rank are set, the value corresponding to the smaller number of samples will be applied.

  • args – extra args

  • kwargs – extra args

process(dataset)[source]

Dataset –> dataset.

Parameters:

dataset – input dataset

Returns:

selected dataset.

class data_juicer.ops.selector.TopkSpecifiedFieldSelector(field_key: str = '', top_ratio: float[float] | None = None, topk: int[int] | None = None, reverse: bool = True, *args, **kwargs)[source]

Bases: Selector

Selector to select top samples based on the sorted specified field value.

__init__(field_key: str = '', top_ratio: float[float] | None = None, topk: int[int] | None = None, reverse: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • field_key – Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by ‘.’.

  • top_ratio – Ratio of selected top samples, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.

  • topk – Number of selected top sample, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.

  • reverse – Determine the sorting rule, if reverse=True, then sort in descending order.

  • args – extra args

  • kwargs – extra args

process(dataset)[source]

Dataset –> dataset.

Parameters:

dataset – input dataset

Returns:

selected dataset.