data_juicer.ops.filter.specified_field_filter module¶

class data_juicer.ops.filter.specified_field_filter.SpecifiedFieldFilter(field_key: str = '', target_value: List = [], *args, **kwargs)[source]¶

Bases: Filter

Filter samples based on the specified field information.

This operator checks if the value of a specified field in each sample is within a given target value range. If the field value is not within the target range, the sample is filtered out. The field can be a multi-level key, with levels separated by dots. The target value is a list of acceptable values for the field. If the field value is not a list or tuple, it is converted to a list for comparison. Samples are retained if all values in the field match any of the target values.

Uses the ‘field_key’ and ‘target_value’ parameters.
Supports multi-level field keys, e.g., ‘level1.level2’.
Converts non-list/tuple field values to a list for comparison.

__init__(field_key: str = '', target_value: List = [], *args, **kwargs)[source]¶

Initialization method.

Parameters:

field_key – Filter based on the specified value corresponding to the target key. The target key corresponding to multi-level field information need to be separated by ‘.’.
target_value – The range of specified field information corresponding to the samples that need to be retained.
args – extra args
kwargs – extra args

compute_stats_single(sample)[source]¶

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:

sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]¶

For sample level, sample –> Boolean.

Parameters:: sample – sample to decide whether to filter
Returns:: true for keeping and false for filtering