data_juicer.ops.filter.text_action_filter module¶
- class data_juicer.ops.filter.text_action_filter.TextActionFilter(lang: str = 'en', min_action_num: int = 1, *args, **kwargs)[source]¶
Bases:
Filter
Filter to keep texts that contain a minimum number of actions.
This operator uses a Spacy model to detect actions in the text. It keeps samples if the number of detected actions meets or exceeds the specified minimum. The supported languages are English (‘en’) and Chinese (‘zh’). The ‘num_action’ statistic is computed and cached for each sample. Actions are identified based on part-of-speech (POS) tags and specific tags for verbs.
- __init__(lang: str = 'en', min_action_num: int = 1, *args, **kwargs)[source]¶
Initialization method.
- Parameters:
lang – language of the text in the samples. ‘en’ for detection of actions in English and ‘zh’ for detection of actions in Chinese.
min_action_num – The min action number in the filtering. samples will be filtered if their action number in the text is below this parameter.
- compute_stats_single(sample, context=False)[source]¶
Compute stats for the sample which is used as a metric to decide whether to filter this sample.
- Parameters:
sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.
- Returns:
sample with computed stats