data_juicer.core.executor¶
- class data_juicer.core.executor.ExecutorFactory[source]¶
Bases:
object
- static create_executor(executor_type: str) DefaultExecutor | RayExecutor [source]¶
- class data_juicer.core.executor.DefaultExecutor(cfg: Namespace | None = None)[source]¶
Bases:
ExecutorBase
This Executor class is used to process a specific dataset.
It will load the dataset and unify the format, then apply all the ops in the config file in order and generate a processed dataset.
- __init__(cfg: Namespace | None = None)[source]¶
Initialization method.
- Parameters:
cfg – optional jsonargparse Namespace.
- run(dataset: Dataset | NestedDataset | None = None, load_data_np: Annotated[int, Gt(gt=0)] | None = None, skip_return=False)[source]¶
Running the dataset process pipeline.
- Parameters:
dataset – a Dataset object to be executed.
load_data_np – number of workers when loading the dataset.
skip_return – skip return for API called.
- Returns:
processed dataset.
- sample_data(dataset_to_sample: Dataset | None = None, load_data_np=None, sample_ratio: float = 1.0, sample_algo: str = 'uniform', **kwargs)[source]¶
Sample a subset from the given dataset. TODO add support other than LocalExecutor
- Parameters:
dataset_to_sample – Dataset to sample from. If None, will use the formatter linked by the executor. Default is None.
load_data_np – number of workers when loading the dataset.
sample_ratio – The ratio of the sample size to the original dataset size. Default is 1.0 (no sampling).
sample_algo – Sampling algorithm to use. Options are “uniform”, “frequency_specified_field_selector”, or “topk_specified_field_selector”. Default is “uniform”.
- Returns:
A sampled Dataset.