data_juicer.core.executor.ray_executor module¶
- class data_juicer.core.executor.ray_executor.RayExecutor(cfg: Namespace | None = None)[源代码]¶
基类:
ExecutorBase
Executor based on Ray.
Run Data-Juicer data processing in a distributed cluster.
Support Filter, Mapper and Exact Deduplicator operators for now.
Only support loading .json files.
Advanced functions such as checkpoint, tracer are not supported.
- __init__(cfg: Namespace | None = None)[源代码]¶
Initialization method.
- 参数:
cfg -- optional config dict.
- run(load_data_np: Annotated[int, Gt(gt=0)] | None = None, skip_export: bool = False, skip_return: bool = False)[源代码]¶
Running the dataset process pipeline
- 参数:
load_data_np -- number of workers when loading the dataset.
skip_export -- whether export the results into disk
skip_return -- skip return for API called.
- 返回:
processed dataset.