data_juicer.ops.filter.image_shape_filter module

class data_juicer.ops.filter.image_shape_filter.ImageShapeFilter(min_width: int = 1, max_width: int = 9223372036854775807, min_height: int = 1, max_height: int = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Filter

Filter to keep samples with image shape (width, height) within specific ranges.

This operator filters samples based on the width and height of images. It keeps samples where the image dimensions fall within the specified ranges. The operator supports two strategies: ‘any’ and ‘all’. In ‘any’ mode, a sample is kept if at least one image meets the criteria. In ‘all’ mode, all images in the sample must meet the criteria for the sample to be kept. The image width and height are stored in the ‘image_width’ and ‘image_height’ fields of the sample’s stats. If no images are present in the sample, the corresponding stats fields will be empty arrays.

__init__(min_width: int = 1, max_width: int = 9223372036854775807, min_height: int = 1, max_height: int = 9223372036854775807, any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_width – The min width to keep samples.

  • max_width – The max width to keep samples.

  • min_height – The min height to keep samples.

  • max_height – The max height to keep samples.

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all images. ‘any’: keep this sample if any images meet the condition. ‘all’: keep this sample only if all images meet the condition.

  • args – extra args

  • kwargs – extra args

compute_stats_single(sample, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering