data_juicer.ops.filter.suffix_filter module

class data_juicer.ops.filter.suffix_filter.SuffixFilter(suffixes: str | List[str] = [], *args, **kwargs)[源代码]

基类:Filter

Filter to keep samples with specified suffix.

This operator retains samples that have a suffix matching any of the provided suffixes. If no suffixes are specified, all samples are kept. The key metric 'keep' is computed based on whether the sample's suffix matches the specified list. The 'suffix' field of each sample is checked against the list of allowed suffixes. If the suffix matches, the sample is kept; otherwise, it is filtered out.

__init__(suffixes: str | List[str] = [], *args, **kwargs)[源代码]

Initialization method.

参数:
  • suffixes -- the suffix of text that will be keep. For example: '.txt', 'txt' or ['txt', '.pdf', 'docx']

  • args -- extra args

  • kwargs -- extra args

compute_stats_single(sample)[源代码]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

参数:
  • sample -- input sample.

  • context -- whether to store context information of intermediate vars in the sample temporarily.

返回:

sample with computed stats

process_single(sample)[源代码]

For sample level, sample --> Boolean.

参数:

sample -- sample to decide whether to filter

返回:

true for keeping and false for filtering