data_juicer.core.executor.ray_executor module

class data_juicer.core.executor.ray_executor.TempDirManager(tmp_dir)[源代码]

基类:object

__init__(tmp_dir)[源代码]
class data_juicer.core.executor.ray_executor.RayExecutor(cfg: Namespace | None = None)[源代码]

基类:ExecutorBase

Executor based on Ray.

Run Data-Juicer data processing in a distributed cluster.

  1. Support Filter, Mapper and Exact Deduplicator operators for now.

  2. Only support loading .json files.

  3. Advanced functions such as checkpoint, tracer are not supported.

__init__(cfg: Namespace | None = None)[源代码]

Initialization method.

参数:

cfg -- optional config dict.

run(load_data_np: Annotated[int, Gt(gt=0)] | None = None, skip_export: bool = False, skip_return: bool = False)[源代码]

Running the dataset process pipeline

参数:
  • load_data_np -- number of workers when loading the dataset.

  • skip_export -- whether export the results into disk

  • skip_return -- skip return for API called.

返回:

processed dataset.