data_juicer.ops.filter.general_field_filter module

class data_juicer.ops.filter.general_field_filter.GeneralFieldFilter(filter_condition: str = '', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples based on a general field filter condition. The filter condition is a string that can include logical operators and chain comparisons.

__init__(filter_condition: str = '', *args, **kwargs)[source]

Initialization method. :param filter_condition: The filter condition as a string.

It can include logical operators (and/or) and chain comparisons. For example: “10 < num <= 30 and text != ‘nothing here’ and __dj__meta__.a == 3”.

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample: Dict) bool[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering

class data_juicer.ops.filter.general_field_filter.ExpressionTransformer(sample: Dict)[source]

Bases: NodeVisitor

__init__(sample: Dict)[source]
visit_BoolOp(node: BoolOp) bool[source]
visit_Compare(node: Compare) bool[source]
visit_Name(node: Name) Any[source]
visit_Attribute(node: Attribute) Any[source]
visit_Constant(node: Constant) Any[source]
generic_visit(node: AST) None[source]

Called if no explicit visitor function exists for a node.

transform(ast_tree: Expression) bool[source]