data_juicer.ops.deduplicator.ray_document_deduplicator module¶

class data_juicer.ops.deduplicator.ray_document_deduplicator.RayDocumentDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶

Bases: RayBasicDeduplicator

Deduplicates samples at the document level using exact matching in Ray distributed mode.

This operator computes a hash for each document and filters out duplicates based on exact matches. The hash is calculated from the text content, which can be optionally converted to lowercase and stripped of non-alphabet characters. The key metric used for deduplication is the MD5 hash of the processed text. If the lowercase parameter is set, the text is converted to lowercase before hashing. If ignore_non_character is enabled, all non-alphabet characters, including whitespaces, digits, and punctuation, are removed. The operator supports two backends: ‘ray_actor’ and ‘redis’, with the default being ‘ray_actor’.

__init__(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[source]¶: Initialization method. :param backend: the backend for dedup, either ‘ray_actor’ or ‘redis’ :param redis_address: the address of redis server :param lowercase: Whether to convert sample text to lower case :param ignore_non_character: Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations :param args: extra args :param kwargs: extra args.

calculate_hash(sample, context=False)[source]¶: Calculate hash value for the sample.