data_juicer.core.data.load_strategy module¶
- class data_juicer.core.data.load_strategy.StrategyKey(executor_type: str, data_type: str, data_source: str)[source]¶
Bases:
object
Immutable key for strategy registration with wildcard support
- executor_type: str¶
- data_type: str¶
- data_source: str¶
- matches(other: StrategyKey) bool [source]¶
Check if this key matches another key with wildcard support
Supports Unix-style wildcards: - ‘*’ matches any string - ‘?’ matches any single character - ‘[seq]’ matches any character in seq - ‘[!seq]’ matches any character not in seq
- __init__(executor_type: str, data_type: str, data_source: str) None ¶
- class data_juicer.core.data.load_strategy.DataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]¶
Bases:
ABC
,ConfigValidator
abstract class for data load strategy
- class data_juicer.core.data.load_strategy.DataLoadStrategyRegistry[source]¶
Bases:
object
Flexible strategy registry with wildcard matching
- classmethod get_strategy_class(executor_type: str, data_type: str, data_source: str) Type[DataLoadStrategy] | None [source]¶
Retrieve the most specific matching strategy
Matching priority: 1. Exact match 2. Wildcard matches from most specific to most general
- classmethod register(executor_type: str, data_type: str, data_source: str)[source]¶
Decorator for registering data load strategies with wildcard support
- Parameters:
executor_type – Type of executor (e.g., ‘default’, ‘ray’)
data_type – Type of data (e.g., ‘local’, ‘remote’)
data_source – Specific data source (e.g., ‘arxiv’, ‘s3’)
- Returns:
Decorator function
- class data_juicer.core.data.load_strategy.RayDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]¶
Bases:
DataLoadStrategy
abstract class for data load strategy for RayExecutor
- class data_juicer.core.data.load_strategy.DefaultDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]¶
Bases:
DataLoadStrategy
abstract class for data load strategy for LocalExecutor
- class data_juicer.core.data.load_strategy.RayLocalJsonDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]¶
Bases:
RayDataLoadStrategy
- CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}¶
- class data_juicer.core.data.load_strategy.RayHuggingfaceDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]¶
Bases:
RayDataLoadStrategy
- CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}¶
- class data_juicer.core.data.load_strategy.DefaultLocalDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]¶
Bases:
DefaultDataLoadStrategy
data load strategy for on disk data for LocalExecutor rely on AutoFormatter for actual data loading
- CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}¶
- class data_juicer.core.data.load_strategy.DefaultHuggingfaceDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]¶
Bases:
DefaultDataLoadStrategy
data load strategy for Huggingface dataset for LocalExecutor
- CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'optional_fields': ['split', 'limit', 'name', 'data_files', 'data_dir'], 'required_fields': ['path']}¶
- class data_juicer.core.data.load_strategy.DefaultModelScopeDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]¶
Bases:
DefaultDataLoadStrategy
data load strategy for ModelScope dataset for LocalExecutor
- class data_juicer.core.data.load_strategy.DefaultArxivDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]¶
Bases:
DefaultDataLoadStrategy
data load strategy for arxiv dataset for LocalExecutor
- CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}¶
- class data_juicer.core.data.load_strategy.DefaultWikiDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]¶
Bases:
DefaultDataLoadStrategy
data load strategy for wiki dataset for LocalExecutor
- CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}¶
- class data_juicer.core.data.load_strategy.DefaultCommonCrawlDataLoadStrategy(ds_config: Dict, cfg: Namespace)[source]¶
Bases:
DefaultDataLoadStrategy
data load strategy for commoncrawl dataset for LocalExecutor
- CONFIG_VALIDATION_RULES = {'custom_validators': {'end_snapshot': <function validate_snapshot_format>, 'start_snashot': <function validate_snapshot_format>, 'url_limit': <function DefaultCommonCrawlDataLoadStrategy.<lambda>>}, 'field_types': {'end_snapshot': <class 'str'>, 'start_snapshot': <class 'str'>}, 'optional_fields': ['aws', 'url_limit'], 'required_fields': ['start_snapshot', 'end_snapshot']}¶