data_juicer.ops.filter.general_field_filter module

class data_juicer.ops.filter.general_field_filter.GeneralFieldFilter(filter_condition: str = '', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples based on a general field filter condition.

The filter condition is a string that can include logical operators (and/or) and chain comparisons. For example: “10 < num <= 30 and text != ‘nothing here’ and __dj__meta__.a == 3”. The condition is evaluated for each sample, and only samples that meet the condition are kept. The result of the filter condition is stored in the sample’s stats under the key ‘general_field_filter_condition’. If the filter condition is empty or already computed, the sample is not re-evaluated.

__init__(filter_condition: str = '', *args, **kwargs)[source]

Initialization method. :param filter_condition: The filter condition as a string.

It can include logical operators (and/or) and chain comparisons. For example: “10 < num <= 30 and text != ‘nothing here’ and __dj__meta__.a == 3”.

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample: Dict) bool[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.general_field_filter.ExpressionTransformer(sample: Dict)[source]

Bases: NodeVisitor

__init__(sample: Dict)[source]
visit_BoolOp(node: BoolOp) bool[source]
visit_Compare(node: Compare) bool[source]
visit_Name(node: Name) Any[source]
visit_Attribute(node: Attribute) Any[source]
visit_Constant(node: Constant) Any[source]
generic_visit(node: AST) None[source]

Called if no explicit visitor function exists for a node.

transform(ast_tree: Expression) bool[source]