data_juicer.ops.filter.video_duration_filter module

class data_juicer.ops.filter.video_duration_filter.VideoDurationFilter(min_duration: float = 0, max_duration: float = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Keep data samples whose videos’ durations are within a specified range.

This operator filters data samples based on the duration of their associated videos. It keeps samples where the video durations fall within a specified minimum and maximum range. The filtering strategy can be set to ‘any’ or ‘all’: - ‘any’: Keep the sample if any of its videos meet the duration criteria. - ‘all’: Keep the sample only if all of its videos meet the duration criteria. The video durations are computed and stored in the ‘video_duration’ field of the sample’s stats. If no videos are present, an empty array is stored.

__init__(min_duration: float = 0, max_duration: float = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_duration – The min video duration to keep samples in seconds. It’s 0 by default.

  • max_duration – The max video duration to keep samples in seconds. It’s sys.maxsize by default.

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all videos. ‘any’: keep this sample if any videos meet the condition. ‘all’: keep this sample only if all videos meet the condition.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering