data_juicer.ops.selector package¶

Submodules¶

data_juicer.ops.selector.frequency_specified_field_selector module¶

class data_juicer.ops.selector.frequency_specified_field_selector.FrequencySpecifiedFieldSelector(field_key: str = '', top_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, topk: Annotated[int, Gt(gt=0)] | None = None, reverse: bool = True, *args, **kwargs)[源代码]¶

基类：Selector

Selector to select samples based on the sorted frequency of specified field.

__init__(field_key: str = '', top_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, topk: Annotated[int, Gt(gt=0)] | None = None, reverse: bool = True, *args, **kwargs)[源代码]¶

Initialization method.

参数:

field_key -- Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by '.'.
top_ratio -- Ratio of selected top specified field value, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.
topk -- Number of selected top specified field value, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.
reverse -- Determine the sorting rule, if reverse=True, then sort in descending order.
args -- extra args
kwargs -- extra args

process(dataset)[源代码]¶

Dataset --> dataset.

参数:: dataset -- input dataset
返回:: selected dataset.

data_juicer.ops.selector.random_selector module¶

class data_juicer.ops.selector.random_selector.RandomSelector(select_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, select_num: Annotated[int, Gt(gt=0)] | None = None, *args, **kwargs)[源代码]¶

基类：Selector

Selector to random select samples.

__init__(select_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, select_num: Annotated[int, Gt(gt=0)] | None = None, *args, **kwargs)[源代码]¶

Initialization method.

参数:

select_ratio -- The ratio to select. When both select_ratio and select_num are set, the value corresponding to the smaller number of samples will be applied.
select_num -- The number of samples to select. When both select_ratio and select_num are set, the value corresponding to the smaller number of samples will be applied.
args -- extra args
kwargs -- extra args

process(dataset)[源代码]¶

Dataset --> dataset.

参数:: dataset -- input dataset
返回:: selected dataset.

data_juicer.ops.selector.range_specified_field_selector module¶

class data_juicer.ops.selector.range_specified_field_selector.RangeSpecifiedFieldSelector(field_key: str = '', lower_percentile: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, upper_percentile: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, lower_rank: Annotated[int, Gt(gt=0)] | None = None, upper_rank: Annotated[int, Gt(gt=0)] | None = None, *args, **kwargs)[源代码]¶

基类：Selector

Selector to select a range of samples based on the sorted specified field value from smallest to largest.

__init__(field_key: str = '', lower_percentile: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, upper_percentile: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, lower_rank: Annotated[int, Gt(gt=0)] | None = None, upper_rank: Annotated[int, Gt(gt=0)] | None = None, *args, **kwargs)[源代码]¶

Initialization method.

参数:

field_key -- Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by '.'.
lower_percentile -- The lower bound of the percentile to be sample, samples will be selected if their specified field values are greater than this lower bound. When both lower_percentile and lower_rank are set, the value corresponding to the larger number of samples will be applied.
upper_percentile -- The upper bound of the percentile to be sample, samples will be selected if their specified field values are less or equal to the upper bound. When both upper_percentile and upper_rank are set, the value corresponding to the smaller number of samples will be applied.
lower_rank -- The lower bound of the rank to be sample, samples will be selected if their specified field values are greater than this lower bound. When both lower_percentile and lower_rank are set, the value corresponding to the larger number of samples will be applied.
upper_rank -- The upper bound of the rank to be sample, samples will be selected if their specified field values are less or equal to the upper bound. When both upper_percentile and upper_rank are set, the value corresponding to the smaller number of samples will be applied.
args -- extra args
kwargs -- extra args

process(dataset)[源代码]¶

Dataset --> dataset.

参数:: dataset -- input dataset
返回:: selected dataset.

data_juicer.ops.selector.tags_specified_field_selector module¶

class data_juicer.ops.selector.tags_specified_field_selector.TagsSpecifiedFieldSelector(field_key: str = '', target_tags: List[str] | None = None, *args, **kwargs)[源代码]¶

基类：Selector

Selector to select samples based on the tags of specified field.

__init__(field_key: str = '', target_tags: List[str] | None = None, *args, **kwargs)[源代码]¶

Initialization method.

参数:

field_key -- Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by '.'.
target_tags -- Target tags to be select.
args -- extra args
kwargs -- extra args

process(dataset)[源代码]¶

Dataset --> dataset.

参数:: dataset -- input dataset
返回:: selected dataset.

data_juicer.ops.selector.topk_specified_field_selector module¶

class data_juicer.ops.selector.topk_specified_field_selector.TopkSpecifiedFieldSelector(field_key: str = '', top_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, topk: Annotated[int, Gt(gt=0)] | None = None, reverse: bool = True, *args, **kwargs)[源代码]¶

基类：Selector

Selector to select top samples based on the sorted specified field value.

Initialization method.

参数:

field_key -- Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by '.'.
top_ratio -- Ratio of selected top samples, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.
topk -- Number of selected top sample, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.
reverse -- Determine the sorting rule, if reverse=True, then sort in descending order.
args -- extra args
kwargs -- extra args

process(dataset)[源代码]¶

Dataset --> dataset.

参数:: dataset -- input dataset
返回:: selected dataset.

Module contents¶

class data_juicer.ops.selector.FrequencySpecifiedFieldSelector(field_key: str = '', top_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, topk: Annotated[int, Gt(gt=0)] | None = None, reverse: bool = True, *args, **kwargs)[源代码]¶

基类：Selector

Selector to select samples based on the sorted frequency of specified field.

Initialization method.

参数:

field_key -- Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by '.'.
top_ratio -- Ratio of selected top specified field value, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.
topk -- Number of selected top specified field value, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.
reverse -- Determine the sorting rule, if reverse=True, then sort in descending order.
args -- extra args
kwargs -- extra args

process(dataset)[源代码]¶

Dataset --> dataset.

参数:: dataset -- input dataset
返回:: selected dataset.

class data_juicer.ops.selector.RandomSelector(select_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, select_num: Annotated[int, Gt(gt=0)] | None = None, *args, **kwargs)[源代码]¶

基类：Selector

Selector to random select samples.

Initialization method.

参数:

select_ratio -- The ratio to select. When both select_ratio and select_num are set, the value corresponding to the smaller number of samples will be applied.
select_num -- The number of samples to select. When both select_ratio and select_num are set, the value corresponding to the smaller number of samples will be applied.
args -- extra args
kwargs -- extra args

process(dataset)[源代码]¶

Dataset --> dataset.

参数:: dataset -- input dataset
返回:: selected dataset.

class data_juicer.ops.selector.RangeSpecifiedFieldSelector(field_key: str = '', lower_percentile: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, upper_percentile: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, lower_rank: Annotated[int, Gt(gt=0)] | None = None, upper_rank: Annotated[int, Gt(gt=0)] | None = None, *args, **kwargs)[源代码]¶

基类：Selector

Selector to select a range of samples based on the sorted specified field value from smallest to largest.

Initialization method.

参数:

field_key -- Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by '.'.
lower_percentile -- The lower bound of the percentile to be sample, samples will be selected if their specified field values are greater than this lower bound. When both lower_percentile and lower_rank are set, the value corresponding to the larger number of samples will be applied.
upper_percentile -- The upper bound of the percentile to be sample, samples will be selected if their specified field values are less or equal to the upper bound. When both upper_percentile and upper_rank are set, the value corresponding to the smaller number of samples will be applied.
lower_rank -- The lower bound of the rank to be sample, samples will be selected if their specified field values are greater than this lower bound. When both lower_percentile and lower_rank are set, the value corresponding to the larger number of samples will be applied.
upper_rank -- The upper bound of the rank to be sample, samples will be selected if their specified field values are less or equal to the upper bound. When both upper_percentile and upper_rank are set, the value corresponding to the smaller number of samples will be applied.
args -- extra args
kwargs -- extra args

process(dataset)[源代码]¶

Dataset --> dataset.

参数:: dataset -- input dataset
返回:: selected dataset.

class data_juicer.ops.selector.TagsSpecifiedFieldSelector(field_key: str = '', target_tags: List[str] | None = None, *args, **kwargs)[源代码]¶

基类：Selector

Selector to select samples based on the tags of specified field.

__init__(field_key: str = '', target_tags: List[str] | None = None, *args, **kwargs)[源代码]¶

Initialization method.

参数:

field_key -- Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by '.'.
target_tags -- Target tags to be select.
args -- extra args
kwargs -- extra args

process(dataset)[源代码]¶

Dataset --> dataset.

参数:: dataset -- input dataset
返回:: selected dataset.

class data_juicer.ops.selector.TopkSpecifiedFieldSelector(field_key: str = '', top_ratio: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] | None = None, topk: Annotated[int, Gt(gt=0)] | None = None, reverse: bool = True, *args, **kwargs)[源代码]¶

基类：Selector

Selector to select top samples based on the sorted specified field value.

Initialization method.

参数:

field_key -- Selector based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by '.'.
top_ratio -- Ratio of selected top samples, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.
topk -- Number of selected top sample, samples will be selected if their specified field values are within this parameter. When both topk and top_ratio are set, the value corresponding to the smaller number of samples will be applied.
reverse -- Determine the sorting rule, if reverse=True, then sort in descending order.
args -- extra args
kwargs -- extra args

process(dataset)[源代码]¶

Dataset --> dataset.

参数:: dataset -- input dataset
返回:: selected dataset.