data_juicer.ops.selector.random_selector module

class data_juicer.ops.selector.random_selector.RandomSelector(select_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, select_num: Annotated[int, Gt(gt=0)] | None = None, *args, **kwargs)[源代码]

基类:Selector

Randomly selects a subset of samples from the dataset.

This operator randomly selects a subset of samples based on either a specified ratio or a fixed number. If both select_ratio and select_num are provided, the one that results in fewer samples is used. The selection is skipped if the dataset has only one or no samples. The random_sample function is used to perform the actual sampling.

  • select_ratio: The ratio of samples to select (0 to 1).

  • select_num: The exact number of samples to select.

  • If neither select_ratio nor select_num is set, the dataset remains unchanged.

__init__(select_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, select_num: Annotated[int, Gt(gt=0)] | None = None, *args, **kwargs)[源代码]

Initialization method.

参数:
  • select_ratio -- The ratio to select. When both select_ratio and select_num are set, the value corresponding to the smaller number of samples will be applied.

  • select_num -- The number of samples to select. When both select_ratio and select_num are set, the value corresponding to the smaller number of samples will be applied.

  • args -- extra args

  • kwargs -- extra args

process(dataset)[源代码]

Dataset --> dataset.

参数:

dataset -- input dataset

返回:

selected dataset.