data_juicer.ops.filter.suffix_filter module

class data_juicer.ops.filter.suffix_filter.SuffixFilter(suffixes: str | List[str] = [], *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with specified suffix.

__init__(suffixes: str | List[str] = [], *args, **kwargs)[源代码]

Initialization method.

参数:
  • suffixes -- the suffix of text that will be keep. For example: '.txt', 'txt' or ['txt', '.pdf', 'docx']

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering