data_juicer.ops.filter.audio_duration_filter module¶

class data_juicer.ops.filter.audio_duration_filter.AudioDurationFilter(min_duration: int = 0, max_duration: int = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[source]¶

Bases: Filter

Keep data samples whose audio durations are within a specified range.

This operator filters data samples based on the duration of their audio files. It keeps samples where the audio duration is between a minimum and maximum value, in seconds. The operator supports two strategies for keeping samples: ‘any’ (keep if any audio meets the condition) or ‘all’ (keep only if all audios meet the condition). The audio duration is computed using the librosa library. If the audio duration has already been computed, it is retrieved from the sample’s stats under the key ‘audio_duration’. If no audio is present in the sample, an empty array is stored in the stats.

__init__(min_duration: int = 0, max_duration: int = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[source]¶

Initialization method.

Parameters:

min_duration – The min audio duration to keep samples in seconds. It’s 0 by default.
max_duration – The max audio duration to keep samples in seconds. It’s sys.maxsize by default.
any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all audios. ‘any’: keep this sample if any audios meet the condition. ‘all’: keep this sample only if all audios meet the condition.
args – extra args
kwargs – extra args

compute_stats_single(sample, context=False)[source]¶

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:

sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]¶

For sample level, sample –> Boolean.

Parameters:: sample – sample to decide whether to filter
Returns:: true for keeping and false for filtering