data_juicer.ops.deduplicator.ray_document_deduplicator module

class data_juicer.ops.deduplicator.ray_document_deduplicator.RayDocumentDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[源代码]

基类:RayBasicDeduplicator

Deduplicator to deduplicate samples at document-level using exact matching.

__init__(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', lowercase: bool = False, ignore_non_character: bool = False, *args, **kwargs)[源代码]

Initialization method. :param backend: the backend for dedup, either 'ray_actor' or 'redis' :param redis_address: the address of redis server :param lowercase: Whether to convert sample text to lower case :param ignore_non_character: Whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations :param args: extra args :param kwargs: extra args.

calculate_hash(sample, context=False)[源代码]

Calculate hash value for the sample.