data_juicer.ops.deduplicator.ray_basic_deduplicator module¶
- data_juicer.ops.deduplicator.ray_basic_deduplicator.get_remote_dedup_set()[源代码]¶
Get the remote version of DedupSet with Ray decorator applied at runtime.
- class data_juicer.ops.deduplicator.ray_basic_deduplicator.Backend(*args, **kwargs)[源代码]¶
基类:
ABC
Backend for deduplicator.
- class data_juicer.ops.deduplicator.ray_basic_deduplicator.ActorBackend(dedup_set_num: int, RemoteDedupSet=None)[源代码]¶
基类:
Backend
Ray actor backend for deduplicator.
- class data_juicer.ops.deduplicator.ray_basic_deduplicator.RedisBackend(redis_address: str)[源代码]¶
基类:
Backend
Redis backend for deduplicator.
- class data_juicer.ops.deduplicator.ray_basic_deduplicator.RayBasicDeduplicator(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', *args, **kwargs)[源代码]¶
基类:
Filter
A basic exact matching deduplicator for RAY. Although its functionality is deduplication, it is implemented as Filter sub-class.
- EMPTY_HASH_VALUE = 'EMPTY'¶
- __init__(backend: str = 'ray_actor', redis_address: str = 'redis://localhost:6379', *args, **kwargs)[源代码]¶
Initialization. :param backend: the backend for dedup, either 'ray_actor' or 'redis' :param redis_address: the address of redis server :param args: extra args :param kwargs: extra args
- compute_stats_single(sample, context=False)[源代码]¶
Compute stats for the sample which is used as a metric to decide whether to filter this sample.
- 参数:
sample -- input sample.
context -- whether to store context information of intermediate vars in the sample temporarily.
- 返回:
sample with computed stats