data_juicer.core.data.load_strategy module

class data_juicer.core.data.load_strategy.StrategyKey(executor_type: str, data_type: str, data_source: str)[源代码]

基类:object

Immutable key for strategy registration with wildcard support

executor_type: str
data_type: str
data_source: str
matches(other: StrategyKey) bool[源代码]

Check if this key matches another key with wildcard support

Supports Unix-style wildcards: - '*' matches any string - '?' matches any single character - '[seq]' matches any character in seq - '[!seq]' matches any character not in seq

__init__(executor_type: str, data_type: str, data_source: str) None
class data_juicer.core.data.load_strategy.DataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:ABC, ConfigValidator

abstract class for data load strategy

__init__(ds_config: Dict, cfg: Namespace)[源代码]
abstractmethod load_data(**kwargs) DJDataset[源代码]

Need to be implemented in the

class data_juicer.core.data.load_strategy.DataLoadStrategyRegistry[源代码]

基类:object

Flexible strategy registry with wildcard matching

classmethod get_strategy_class(executor_type: str, data_type: str, data_source: str) Type[DataLoadStrategy] | None[源代码]

Retrieve the most specific matching strategy

Matching priority: 1. Exact match 2. Wildcard matches from most specific to most general

classmethod register(executor_type: str, data_type: str, data_source: str)[源代码]

Decorator for registering data load strategies with wildcard support

参数:
  • executor_type -- Type of executor (e.g., 'default', 'ray')

  • data_type -- Type of data (e.g., 'local', 'remote')

  • data_source -- Specific data source (e.g., 'arxiv', 's3')

返回:

Decorator function

class data_juicer.core.data.load_strategy.RayDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:DataLoadStrategy

abstract class for data load strategy for RayExecutor

abstractmethod load_data(**kwargs) DJDataset[源代码]

Need to be implemented in the

class data_juicer.core.data.load_strategy.DefaultDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:DataLoadStrategy

abstract class for data load strategy for LocalExecutor

abstractmethod load_data(**kwargs) DJDataset[源代码]

Need to be implemented in the

class data_juicer.core.data.load_strategy.RayLocalJsonDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:RayDataLoadStrategy

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}
load_data(**kwargs)[源代码]

Need to be implemented in the

class data_juicer.core.data.load_strategy.RayHuggingfaceDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:RayDataLoadStrategy

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}
load_data(**kwargs)[源代码]

Need to be implemented in the

class data_juicer.core.data.load_strategy.DefaultLocalDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:DefaultDataLoadStrategy

data load strategy for on disk data for LocalExecutor rely on AutoFormatter for actual data loading

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}
load_data(**kwargs)[源代码]

Need to be implemented in the

class data_juicer.core.data.load_strategy.DefaultHuggingfaceDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:DefaultDataLoadStrategy

data load strategy for Huggingface dataset for LocalExecutor

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'optional_fields': ['split', 'limit', 'name', 'data_files', 'data_dir'], 'required_fields': ['path']}
load_data(**kwargs)[源代码]

Need to be implemented in the

class data_juicer.core.data.load_strategy.DefaultModelScopeDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:DefaultDataLoadStrategy

data load strategy for ModelScope dataset for LocalExecutor

load_data(**kwargs)[源代码]

Need to be implemented in the

class data_juicer.core.data.load_strategy.DefaultArxivDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:DefaultDataLoadStrategy

data load strategy for arxiv dataset for LocalExecutor

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}
load_data(**kwargs)[源代码]

Need to be implemented in the

class data_juicer.core.data.load_strategy.DefaultWikiDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:DefaultDataLoadStrategy

data load strategy for wiki dataset for LocalExecutor

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}
load_data(**kwargs)[源代码]

Need to be implemented in the

class data_juicer.core.data.load_strategy.DefaultCommonCrawlDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:DefaultDataLoadStrategy

data load strategy for commoncrawl dataset for LocalExecutor

CONFIG_VALIDATION_RULES = {'custom_validators': {'end_snapshot': <function validate_snapshot_format>, 'start_snashot': <function validate_snapshot_format>, 'url_limit': <function DefaultCommonCrawlDataLoadStrategy.<lambda>>}, 'field_types': {'end_snapshot': <class 'str'>, 'start_snapshot': <class 'str'>}, 'optional_fields': ['aws', 'url_limit'], 'required_fields': ['start_snapshot', 'end_snapshot']}
load_data(**kwargs)[源代码]

Need to be implemented in the