data_juicer.core.data package¶

Submodules¶

data_juicer.core.data.config_validator module¶

exception data_juicer.core.data.config_validator.ConfigValidationError[源代码]¶

基类：Exception

Custom exception for validation errors

class data_juicer.core.data.config_validator.ConfigValidator[源代码]¶

基类：object

Mixin class for configuration validation

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {}, 'optional_fields': [], 'required_fields': []}¶

validate_config(ds_config: Dict) → None[源代码]¶

Validate the configuration dictionary.

参数:: ds_config -- Configuration dictionary to validate
抛出:: ValidationError -- If validation fails

data_juicer.core.data.data_validator module¶

class data_juicer.core.data.data_validator.DataValidator(config: Dict)[源代码]¶

基类：ABC

Base class for data validation

__init__(config: Dict)[源代码]¶

abstract validate(dataset: DJDataset) → None[源代码]¶

Validate dataset content

参数:: dataset -- The dataset to validate
抛出:: DataValidationError -- If validation fails

exception data_juicer.core.data.data_validator.DataValidationError[源代码]¶

基类：Exception

Custom exception for data validation errors

class data_juicer.core.data.data_validator.DataValidatorRegistry[源代码]¶

基类：object

Registry for data validators

classmethod register(validator_type: str)[源代码]¶

classmethod get_validator(validator_type: str) → Type[DataValidator] | None[源代码]¶

class data_juicer.core.data.data_validator.BaseConversationValidator(config: Dict)[源代码]¶

基类：DataValidator

Base class for conversation validators

__init__(config: Dict)[源代码]¶

validate(dataset: DJDataset) → None[源代码]¶: Base validation for all conversation formats

abstract validate_conversation(data: Dict) → None[源代码]¶: Validate specific conversation format

class data_juicer.core.data.data_validator.SwiftMessagesValidator(config: Dict)[源代码]¶

基类：BaseConversationValidator

Validator for Swift Messages conversation format.

This validator ensures conversations follow the Swift Messages format with proper message structure and role assignments.

参数:

config (Dict) --

Configuration dictionary containing: min_turns (int, optional): Minimum number of messages.

Defaults to 1.

max_turns (int, optional): Maximum number of messages.: Defaults to 100.
sample_size (int, optional): Number of samples to validate.: Defaults to 100.

Example Format:

{
    "messages": [
        {"role": "system", "content": "<system>"},
        {"role": "user", "content": "<query>"},
        {"role": "assistant", "content": "<response>"},
        ...
    ]
}

抛出:: DataValidationError -- If validation fails due to: - Missing 'messages' field - Invalid message structure - Invalid role values - Missing content - Message count outside allowed range

validate_conversation(data: Dict) → None[源代码]¶: Validate specific conversation format

class data_juicer.core.data.data_validator.DataJuicerFormatValidator(config: Dict)[源代码]¶

基类：BaseConversationValidator

Validator for Data-Juicer default conversation format.

This validator ensures conversations follow the Data-Juicer format with proper fields and structure.

参数:

config (Dict) --

Configuration dictionary containing: min_turns (int, optional): Minimum number of conversation turns.

Defaults to 1.

max_turns (int, optional): Maximum number of conversation turns.: Defaults to 100.
sample_size (int, optional): Number of samples to validate.: Defaults to 100.

Example Format:

{
    "system": "<system>",  # Optional
    "instruction": "<query-inst>",
    "query": "<query2>",
    "response": "<response2>",
    "history": [  # Optional
        ["<query1>", "<response1>"],
        ...
    ]
}

抛出:: DataValidationError -- If validation fails due to: - Missing required fields - Invalid field types - Invalid conversation structure - Turn count outside allowed range

validate_conversation(data: Dict) → None[源代码]¶: Validate specific conversation format

class data_juicer.core.data.data_validator.CodeDataValidator(config: Dict)[源代码]¶

基类：DataValidator

Validator for code data

__init__(config: Dict)[源代码]¶

validate(dataset: DJDataset) → None[源代码]¶

Validate dataset content

参数:: dataset -- The dataset to validate
抛出:: DataValidationError -- If validation fails

class data_juicer.core.data.data_validator.RequiredFieldsValidator(config: Dict)[源代码]¶

基类：DataValidator

Validator that checks for required fields in dataset.

This validator ensures that specified fields exist in the dataset and optionally checks their types and missing value ratios.

参数:: config (Dict) -- Configuration dictionary containing: required_fields (List[str]): List of field names that must exist field_types (Dict[str, type], optional): Map of field names to expected types allow_missing (float, optional): Maximum ratio of missing values allowed. Defaults to 0.0.

Example Config:

{
    "required_fields": ["field1", "field2"],
    "field_types": {"field1": str, "field2": int},
    "allow_missing": 0.0
}

抛出:: DataValidationError -- If validation fails

__init__(config: Dict)[源代码]¶

Initialize validator with config

参数:: config -- Dict containing: - required_fields: List of field names that must exist - field_types: Optional map of field names to expected types - allow_missing: Optional float for max ratio missing allowed

validate(dataset: DJDataset) → None[源代码]¶

Validate dataset has required fields with correct types

参数:: dataset -- NestedDataset or RayDataset to validate
抛出:: DataValidationError -- If validation fails

data_juicer.core.data.dataset_builder module¶

class data_juicer.core.data.dataset_builder.DatasetBuilder(cfg: Namespace, executor_type: str = 'default')[源代码]¶

基类：object

DatasetBuilder is a class that builds a dataset from a configuration.

__init__(cfg: Namespace, executor_type: str = 'default')[源代码]¶

load_dataset(**kwargs) → DJDataset[源代码]¶

classmethod load_dataset_by_generated_config(generated_dataset_config)[源代码]¶: load dataset by generated config

data_juicer.core.data.dataset_builder.rewrite_cli_datapath(dataset_path, max_sample_num=None) → List[源代码]¶

rewrite the dataset_path from CLI into proper dataset config format that is compatible with YAML config style; retrofitting CLI input of local files and huggingface path

参数:

dataset_path -- a dataset file or a dataset dir or a list of them, e.g. <w1> ds1.jsonl <w2> ds2_dir <w3> ds3_file.json
max_sample_num -- the maximum number of samples to load

返回:

list of dataset configs

data_juicer.core.data.dataset_builder.parse_cli_datapath(dataset_path) → Tuple[List[str], List[float]][源代码]¶

Split every dataset path and its weight.

参数:: dataset_path -- a dataset file or a dataset dir or a list of them, e.g. <w1> ds1.jsonl <w2> ds2_dir <w3> ds3_file.json
返回:: list of dataset path and list of weights

data_juicer.core.data.dataset_builder.get_sample_numbers(weights, max_sample_num)[源代码]¶

data_juicer.core.data.dj_dataset module¶

class data_juicer.core.data.dj_dataset.DJDataset[源代码]¶

基类：ABC

Base dataset of DJ

abstract process(operators, *, exporter=None, checkpointer=None, tracer=None) → DJDataset[源代码]¶: process a list of operators on the dataset.

abstract schema() → Schema[源代码]¶

Get dataset schema.

返回:: Dataset schema containing column names and types
返回类型:: Schema

abstract get(k: int) → List[Dict[str, Any]][源代码]¶

Get k rows from the dataset.

参数:: k -- Number of rows to take
返回:: A list of rows from the dataset.
返回类型:: List[Any]

abstract get_column(column: str, k: int | None = None) → List[Any][源代码]¶

Get values from a specific column/field, optionally limited to first k rows.

参数:

column -- Name of the column to retrieve
k -- Optional number of rows to return. If None, returns all rows

返回:

List of values from the specified column

抛出:

KeyError -- If column doesn't exist in dataset
ValueError -- If k is negative

data_juicer.core.data.dj_dataset.wrap_func_with_nested_access(f)[源代码]¶

Before conducting actual function f, wrap its args and kargs into nested ones.

参数:: f -- function to be wrapped.
返回:: wrapped function

data_juicer.core.data.dj_dataset.nested_obj_factory(obj)[源代码]¶

Use nested classes to wrap the input object.

参数:: obj -- object to be nested.
返回:: nested object

class data_juicer.core.data.dj_dataset.NestedQueryDict(*args, **kargs)[源代码]¶

基类：dict

Enhanced dict for better usability.

__init__(*args, **kargs)[源代码]¶

class data_juicer.core.data.dj_dataset.NestedDatasetDict(*args, **kargs)[源代码]¶

基类：DatasetDict

Enhanced HuggingFace-DatasetDict for better usability and efficiency.

__init__(*args, **kargs)[源代码]¶

map(**args)[源代码]¶: Override the map func, which is called by most common operations, such that the processed samples can be accessed by nested manner.

class data_juicer.core.data.dj_dataset.NestedDataset(*args, **kargs)[源代码]¶

基类：Dataset, DJDataset

Enhanced HuggingFace-Dataset for better usability and efficiency.

__init__(*args, **kargs)[源代码]¶

schema() → Schema[源代码]¶: Get dataset schema.

get(k: int) → List[Dict[str, Any]][源代码]¶: Get k rows from the dataset.

get_column(column: str, k: int | None = None) → List[Any][源代码]¶

Get column values from HuggingFace dataset.

参数:

column -- Name of the column to retrieve
k -- Optional number of rows to return. If None, returns all rows

返回:

List of values from the specified column

抛出:

KeyError -- If column doesn't exist
ValueError -- If k is negative

process(operators, *, work_dir=None, exporter=None, checkpointer=None, tracer=None, adapter=None, open_monitor=True)[源代码]¶: process a list of operators on the dataset.

update_args(args, kargs, is_filter=False)[源代码]¶

map(*args, **kargs)[源代码]¶: Override the map func, which is called by most common operations, such that the processed samples can be accessed by nested manner.

filter(*args, **kargs)[源代码]¶: Override the filter func, which is called by most common operations, such that the processed samples can be accessed by nested manner.

select(*args, **kargs)[源代码]¶: Override the select func, such that selected samples can be accessed by nested manner.

classmethod from_dict(*args, **kargs)[源代码]¶: Override the from_dict func, which is called by most from_xx constructors, such that the constructed dataset object is NestedDataset.

add_column(*args, **kargs)[源代码]¶: Override the add column func, such that the processed samples can be accessed by nested manner.

select_columns(*args, **kargs)[源代码]¶: Override the select columns func, such that the processed samples can be accessed by nested manner.

remove_columns(*args, **kargs)[源代码]¶: Override the remove columns func, such that the processed samples can be accessed by nested manner.

cleanup_cache_files()[源代码]¶: Override the cleanup_cache_files func, clear raw and compressed cache files.

static load_from_disk(*args, **kargs)[源代码]¶

Loads a dataset that was previously saved using [save_to_disk] from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem.

参数:

dataset_path (path-like) -- Path (e.g. "dataset/train") or remote URI (e.g. "s3//my-bucket/dataset/train") of the dataset directory where the dataset will be loaded from.
keep_in_memory (bool, defaults to None) -- Whether to copy the dataset in-memory. If None, the dataset will not be copied in-memory unless explicitly enabled by setting datasets.config.IN_MEMORY_MAX_SIZE to nonzero. See more details in the [improve performance](../cache#improve-performance) section.
storage_options (dict, optional) --
Key/value pairs to be passed on to the file-system backend, if any.

<Added version="2.8.0"/>

返回:

If dataset_path is a path of a dataset directory, the dataset requested.
If dataset_path is a path of a dataset dict directory, a datasets.DatasetDict with each split.

返回类型:

[Dataset] or [DatasetDict]

Example:

`py >>> ds = load_from_disk("path/to/dataset/directory") `

data_juicer.core.data.dj_dataset.nested_query(root_obj: NestedDatasetDict | NestedDataset | NestedQueryDict, key)[源代码]¶

Find item from a given object, by first checking flatten layer, then checking nested layers.

参数:

root_obj -- the object
key -- the stored item to be queried, e.g., "meta" or "meta.date"

返回:

data_juicer.core.data.dj_dataset.add_same_content_to_new_column(sample, new_column_name, initial_value=None)[源代码]¶: A helper function to speed up add_column function. Apply map on this function in parallel instead of using add_column. :param sample: a single sample to add this new column/field. :param new_column_name: the name of this new column/field. :param initial_value: the initial value of this new column/field.

data_juicer.core.data.load_strategy module¶

class data_juicer.core.data.load_strategy.StrategyKey(executor_type: str, data_type: str, data_source: str)[源代码]¶

基类：object

Immutable key for strategy registration with wildcard support

executor_type: str¶

data_type: str¶

data_source: str¶

matches(other: StrategyKey) → bool[源代码]¶

Check if this key matches another key with wildcard support

Supports Unix-style wildcards: - '*' matches any string - '?' matches any single character - '[seq]' matches any character in seq - '[!seq]' matches any character not in seq

__init__(executor_type: str, data_type: str, data_source: str) → None¶

class data_juicer.core.data.load_strategy.DataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]¶

基类：ABC, ConfigValidator

abstract class for data load strategy

__init__(ds_config: Dict, cfg: Namespace)[源代码]¶

abstract load_data(**kwargs) → DJDataset[源代码]¶

class data_juicer.core.data.load_strategy.DataLoadStrategyRegistry[源代码]¶

基类：object

Flexible strategy registry with wildcard matching

classmethod get_strategy_class(executor_type: str, data_type: str, data_source: str) → Type[DataLoadStrategy] | None[源代码]¶

Retrieve the most specific matching strategy

Matching priority: 1. Exact match 2. Wildcard matches from most specific to most general

classmethod register(executor_type: str, data_type: str, data_source: str)[源代码]¶

Decorator for registering data load strategies with wildcard support

参数:

executor_type -- Type of executor (e.g., 'default', 'ray')
data_type -- Type of data (e.g., 'local', 'remote')
data_source -- Specific data source (e.g., 'arxiv', 's3')

返回:

Decorator function

class data_juicer.core.data.load_strategy.RayDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]¶

基类：DataLoadStrategy

abstract class for data load strategy for RayExecutor

abstract load_data(**kwargs) → DJDataset[源代码]¶

class data_juicer.core.data.load_strategy.DefaultDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]¶

基类：DataLoadStrategy

abstract class for data load strategy for LocalExecutor

abstract load_data(**kwargs) → DJDataset[源代码]¶

class data_juicer.core.data.load_strategy.RayLocalJsonDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]¶

基类：RayDataLoadStrategy

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}¶

load_data(**kwargs)[源代码]¶

class data_juicer.core.data.load_strategy.RayHuggingfaceDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]¶

基类：RayDataLoadStrategy

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}¶

load_data(**kwargs)[源代码]¶

class data_juicer.core.data.load_strategy.DefaultLocalDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]¶

基类：DefaultDataLoadStrategy

data load strategy for on disk data for LocalExecutor rely on AutoFormatter for actual data loading

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}¶

load_data(**kwargs)[源代码]¶

class data_juicer.core.data.load_strategy.DefaultHuggingfaceDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]¶

基类：DefaultDataLoadStrategy

data load strategy for Huggingface dataset for LocalExecutor

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'optional_fields': ['split', 'limit', 'name', 'data_files', 'data_dir'], 'required_fields': ['path']}¶

load_data(**kwargs)[源代码]¶

class data_juicer.core.data.load_strategy.DefaultModelScopeDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]¶

基类：DefaultDataLoadStrategy

data load strategy for ModelScope dataset for LocalExecutor

load_data(**kwargs)[源代码]¶

class data_juicer.core.data.load_strategy.DefaultArxivDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]¶

基类：DefaultDataLoadStrategy

data load strategy for arxiv dataset for LocalExecutor

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}¶

load_data(**kwargs)[源代码]¶

class data_juicer.core.data.load_strategy.DefaultWikiDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]¶

基类：DefaultDataLoadStrategy

data load strategy for wiki dataset for LocalExecutor

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}¶

load_data(**kwargs)[源代码]¶

class data_juicer.core.data.load_strategy.DefaultCommonCrawlDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]¶

基类：DefaultDataLoadStrategy

data load strategy for commoncrawl dataset for LocalExecutor

CONFIG_VALIDATION_RULES = {'custom_validators': {'end_snapshot': <function validate_snapshot_format>, 'start_snashot': <function validate_snapshot_format>, 'url_limit': <function DefaultCommonCrawlDataLoadStrategy.<lambda>>}, 'field_types': {'end_snapshot': <class 'str'>, 'start_snapshot': <class 'str'>}, 'optional_fields': ['aws', 'url_limit'], 'required_fields': ['start_snapshot', 'end_snapshot']}¶

load_data(**kwargs)[源代码]¶

data_juicer.core.data.ray_dataset module¶

data_juicer.core.data.ray_dataset.get_abs_path(path, dataset_dir)[源代码]¶

data_juicer.core.data.ray_dataset.convert_to_absolute_paths(samples, dataset_dir, path_keys)[源代码]¶

data_juicer.core.data.ray_dataset.set_dataset_to_absolute_path(dataset, dataset_path, cfg)[源代码]¶: Set all the path in input data to absolute path. Checks dataset_dir and project_dir for valid paths.

data_juicer.core.data.ray_dataset.preprocess_dataset(dataset: Dataset, dataset_path, cfg) → Dataset[源代码]¶

data_juicer.core.data.ray_dataset.get_num_gpus(op, op_proc)[源代码]¶

data_juicer.core.data.ray_dataset.filter_batch(batch, filter_func)[源代码]¶

class data_juicer.core.data.ray_dataset.RayDataset(dataset: Dataset, dataset_path: str | None = None, cfg: Namespace | None = None)[源代码]¶

基类：DJDataset

__init__(dataset: Dataset, dataset_path: str | None = None, cfg: Namespace | None = None) → None[源代码]¶

schema() → Schema[源代码]¶

Get dataset schema.

返回:: Dataset schema containing column names and types
返回类型:: Schema

get(k: int) → List[Dict[str, Any]][源代码]¶: Get k rows from the dataset.

get_column(column: str, k: int | None = None) → List[Any][源代码]¶

Get column values from Ray dataset.

参数:

column -- Name of the column to retrieve
k -- Optional number of rows to return. If None, returns all rows

返回:

List of values from the specified column

抛出:

KeyError -- If column doesn't exist
ValueError -- If k is negative

process(operators, *, exporter=None, checkpointer=None, tracer=None) → DJDataset[源代码]¶: process a list of operators on the dataset.

classmethod read_json(paths: str | List[str]) → RayDataset[源代码]¶

class data_juicer.core.data.ray_dataset.JSONStreamDatasource(paths: str | List[str], *, arrow_json_args: Dict[str, Any] | None = None, **file_based_datasource_kwargs)[源代码]¶

基类：JSONDatasource

A temp Datasource for reading json stream.

备注

Depends on a customized pyarrow with open_json method.

data_juicer.core.data.ray_dataset.read_json_stream(paths: str | List[str], *, filesystem: FileSystem | None = None, parallelism: int = -1, ray_remote_args: Dict[str, Any] | None = None, arrow_open_stream_args: Dict[str, Any] | None = None, meta_provider=None, partition_filter=None, partitioning=Partitioning(style='hive', base_dir='', field_names=None, field_types={}, filesystem=None), include_paths: bool = False, ignore_missing_paths: bool = False, shuffle: Literal['files'] | None = None, file_extensions: List[str] | None = ['json', 'jsonl'], concurrency: int | None = None, override_num_blocks: int | None = None, **arrow_json_args) → Dataset[源代码]¶

data_juicer.core.data.schema module¶

class data_juicer.core.data.schema.Schema(column_types: Dict[str, Any], columns: List[str])[源代码]¶

基类：object

Dataset schema representation.

column_types¶

Mapping of column names to their types

Type:: Dict[str, Any]

columns¶

List of column names in order

Type:: List[str]

column_types: Dict[str, Any]¶

columns: List[str]¶

classmethod map_hf_type_to_python(feature)[源代码]¶

Map HuggingFace feature type to Python type.

Recursively maps nested types (e.g., List[str], Dict[str, int]).

示例

Value('string') -> str Sequence(Value('int32')) -> List[int] Dict({'text': Value('string')}) -> Dict[str, Any]

参数:: feature -- HuggingFace feature type
返回:: Corresponding Python type

classmethod map_ray_type_to_python(ray_type: DataType) → type[源代码]¶

Map Ray/Arrow data type to Python type.

参数:: ray_type -- PyArrow DataType
返回:: Corresponding Python type

__init__(column_types: Dict[str, Any], columns: List[str]) → None¶

Module contents¶

class data_juicer.core.data.DJDataset[源代码]¶

基类：ABC

Base dataset of DJ

abstract process(operators, *, exporter=None, checkpointer=None, tracer=None) → DJDataset[源代码]¶: process a list of operators on the dataset.

abstract schema() → Schema[源代码]¶

Get dataset schema.

返回:: Dataset schema containing column names and types
返回类型:: Schema

abstract get(k: int) → List[Dict[str, Any]][源代码]¶

Get k rows from the dataset.

参数:: k -- Number of rows to take
返回:: A list of rows from the dataset.
返回类型:: List[Any]

abstract get_column(column: str, k: int | None = None) → List[Any][源代码]¶

Get values from a specific column/field, optionally limited to first k rows.

参数:

column -- Name of the column to retrieve
k -- Optional number of rows to return. If None, returns all rows

返回:

List of values from the specified column

抛出:

KeyError -- If column doesn't exist in dataset
ValueError -- If k is negative

class data_juicer.core.data.NestedDataset(*args, **kargs)[源代码]¶

基类：Dataset, DJDataset

Enhanced HuggingFace-Dataset for better usability and efficiency.

__init__(*args, **kargs)[源代码]¶

schema() → Schema[源代码]¶: Get dataset schema.

get(k: int) → List[Dict[str, Any]][源代码]¶: Get k rows from the dataset.

get_column(column: str, k: int | None = None) → List[Any][源代码]¶

Get column values from HuggingFace dataset.

参数:

column -- Name of the column to retrieve
k -- Optional number of rows to return. If None, returns all rows

返回:

List of values from the specified column

抛出:

KeyError -- If column doesn't exist
ValueError -- If k is negative

process(operators, *, work_dir=None, exporter=None, checkpointer=None, tracer=None, adapter=None, open_monitor=True)[源代码]¶: process a list of operators on the dataset.

update_args(args, kargs, is_filter=False)[源代码]¶

map(*args, **kargs)[源代码]¶: Override the map func, which is called by most common operations, such that the processed samples can be accessed by nested manner.

filter(*args, **kargs)[源代码]¶: Override the filter func, which is called by most common operations, such that the processed samples can be accessed by nested manner.

select(*args, **kargs)[源代码]¶: Override the select func, such that selected samples can be accessed by nested manner.

classmethod from_dict(*args, **kargs)[源代码]¶: Override the from_dict func, which is called by most from_xx constructors, such that the constructed dataset object is NestedDataset.

add_column(*args, **kargs)[源代码]¶: Override the add column func, such that the processed samples can be accessed by nested manner.

select_columns(*args, **kargs)[源代码]¶: Override the select columns func, such that the processed samples can be accessed by nested manner.

remove_columns(*args, **kargs)[源代码]¶: Override the remove columns func, such that the processed samples can be accessed by nested manner.

cleanup_cache_files()[源代码]¶: Override the cleanup_cache_files func, clear raw and compressed cache files.

static load_from_disk(*args, **kargs)[源代码]¶

Loads a dataset that was previously saved using [save_to_disk] from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem.

参数:

dataset_path (path-like) -- Path (e.g. "dataset/train") or remote URI (e.g. "s3//my-bucket/dataset/train") of the dataset directory where the dataset will be loaded from.
keep_in_memory (bool, defaults to None) -- Whether to copy the dataset in-memory. If None, the dataset will not be copied in-memory unless explicitly enabled by setting datasets.config.IN_MEMORY_MAX_SIZE to nonzero. See more details in the [improve performance](../cache#improve-performance) section.
storage_options (dict, optional) --
Key/value pairs to be passed on to the file-system backend, if any.

<Added version="2.8.0"/>

返回:

If dataset_path is a path of a dataset directory, the dataset requested.
If dataset_path is a path of a dataset dict directory, a datasets.DatasetDict with each split.

返回类型:

[Dataset] or [DatasetDict]

Example:

`py >>> ds = load_from_disk("path/to/dataset/directory") `

data_juicer.core.data.wrap_func_with_nested_access(f)[源代码]¶

Before conducting actual function f, wrap its args and kargs into nested ones.

参数:: f -- function to be wrapped.
返回:: wrapped function

data_juicer.core.data.add_same_content_to_new_column(sample, new_column_name, initial_value=None)[源代码]¶: A helper function to speed up add_column function. Apply map on this function in parallel instead of using add_column. :param sample: a single sample to add this new column/field. :param new_column_name: the name of this new column/field. :param initial_value: the initial value of this new column/field.