data_juicer.core.data package

Submodules

data_juicer.core.data.config_validator module

exception data_juicer.core.data.config_validator.ConfigValidationError[源代码]

基类:Exception

Custom exception for validation errors

class data_juicer.core.data.config_validator.ConfigValidator[源代码]

基类:object

Mixin class for configuration validation

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {}, 'optional_fields': [], 'required_fields': []}
validate_config(ds_config: Dict) None[源代码]

Validate the configuration dictionary.

参数:

ds_config -- Configuration dictionary to validate

抛出:

ValidationError -- If validation fails

data_juicer.core.data.data_validator module

class data_juicer.core.data.data_validator.DataValidator(config: Dict)[源代码]

基类:ABC

Base class for data validation

__init__(config: Dict)[源代码]
abstract validate(dataset: DJDataset) None[源代码]

Validate dataset content

参数:

dataset -- The dataset to validate

抛出:

DataValidationError -- If validation fails

exception data_juicer.core.data.data_validator.DataValidationError[源代码]

基类:Exception

Custom exception for data validation errors

class data_juicer.core.data.data_validator.DataValidatorRegistry[源代码]

基类:object

Registry for data validators

classmethod register(validator_type: str)[源代码]
classmethod get_validator(validator_type: str) Type[DataValidator] | None[源代码]
class data_juicer.core.data.data_validator.BaseConversationValidator(config: Dict)[源代码]

基类:DataValidator

Base class for conversation validators

__init__(config: Dict)[源代码]
validate(dataset: DJDataset) None[源代码]

Base validation for all conversation formats

abstract validate_conversation(data: Dict) None[源代码]

Validate specific conversation format

class data_juicer.core.data.data_validator.SwiftMessagesValidator(config: Dict)[源代码]

基类:BaseConversationValidator

Validator for Swift Messages conversation format.

This validator ensures conversations follow the Swift Messages format with proper message structure and role assignments.

参数:

config (Dict) --

Configuration dictionary containing: min_turns (int, optional): Minimum number of messages.

Defaults to 1.

max_turns (int, optional): Maximum number of messages.

Defaults to 100.

sample_size (int, optional): Number of samples to validate.

Defaults to 100.

Example Format:
{
    "messages": [
        {"role": "system", "content": "<system>"},
        {"role": "user", "content": "<query>"},
        {"role": "assistant", "content": "<response>"},
        ...
    ]
}
抛出:

DataValidationError -- If validation fails due to: - Missing 'messages' field - Invalid message structure - Invalid role values - Missing content - Message count outside allowed range

validate_conversation(data: Dict) None[源代码]

Validate specific conversation format

class data_juicer.core.data.data_validator.DataJuicerFormatValidator(config: Dict)[源代码]

基类:BaseConversationValidator

Validator for Data-Juicer default conversation format.

This validator ensures conversations follow the Data-Juicer format with proper fields and structure.

参数:

config (Dict) --

Configuration dictionary containing: min_turns (int, optional): Minimum number of conversation turns.

Defaults to 1.

max_turns (int, optional): Maximum number of conversation turns.

Defaults to 100.

sample_size (int, optional): Number of samples to validate.

Defaults to 100.

Example Format:
{
    "system": "<system>",  # Optional
    "instruction": "<query-inst>",
    "query": "<query2>",
    "response": "<response2>",
    "history": [  # Optional
        ["<query1>", "<response1>"],
        ...
    ]
}
抛出:

DataValidationError -- If validation fails due to: - Missing required fields - Invalid field types - Invalid conversation structure - Turn count outside allowed range

validate_conversation(data: Dict) None[源代码]

Validate specific conversation format

class data_juicer.core.data.data_validator.CodeDataValidator(config: Dict)[源代码]

基类:DataValidator

Validator for code data

__init__(config: Dict)[源代码]
validate(dataset: DJDataset) None[源代码]

Validate dataset content

参数:

dataset -- The dataset to validate

抛出:

DataValidationError -- If validation fails

class data_juicer.core.data.data_validator.RequiredFieldsValidator(config: Dict)[源代码]

基类:DataValidator

Validator that checks for required fields in dataset.

This validator ensures that specified fields exist in the dataset and optionally checks their types and missing value ratios.

参数:

config (Dict) -- Configuration dictionary containing: required_fields (List[str]): List of field names that must exist field_types (Dict[str, type], optional): Map of field names to expected types allow_missing (float, optional): Maximum ratio of missing values allowed. Defaults to 0.0.

Example Config:
{
    "required_fields": ["field1", "field2"],
    "field_types": {"field1": str, "field2": int},
    "allow_missing": 0.0
}
抛出:

DataValidationError -- If validation fails

__init__(config: Dict)[源代码]

Initialize validator with config

参数:

config -- Dict containing: - required_fields: List of field names that must exist - field_types: Optional map of field names to expected types - allow_missing: Optional float for max ratio missing allowed

validate(dataset: DJDataset) None[源代码]

Validate dataset has required fields with correct types

参数:

dataset -- NestedDataset or RayDataset to validate

抛出:

DataValidationError -- If validation fails

data_juicer.core.data.dataset_builder module

class data_juicer.core.data.dataset_builder.DatasetBuilder(cfg: Namespace, executor_type: str = 'default')[源代码]

基类:object

DatasetBuilder is a class that builds a dataset from a configuration.

__init__(cfg: Namespace, executor_type: str = 'default')[源代码]
load_dataset(**kwargs) DJDataset[源代码]
classmethod load_dataset_by_generated_config(generated_dataset_config)[源代码]

load dataset by generated config

data_juicer.core.data.dataset_builder.rewrite_cli_datapath(dataset_path, max_sample_num=None) List[源代码]

rewrite the dataset_path from CLI into proper dataset config format that is compatible with YAML config style; retrofitting CLI input of local files and huggingface path

参数:
  • dataset_path -- a dataset file or a dataset dir or a list of them, e.g. <w1> ds1.jsonl <w2> ds2_dir <w3> ds3_file.json

  • max_sample_num -- the maximum number of samples to load

返回:

list of dataset configs

data_juicer.core.data.dataset_builder.parse_cli_datapath(dataset_path) Tuple[List[str], List[float]][源代码]

Split every dataset path and its weight.

参数:

dataset_path -- a dataset file or a dataset dir or a list of them, e.g. <w1> ds1.jsonl <w2> ds2_dir <w3> ds3_file.json

返回:

list of dataset path and list of weights

data_juicer.core.data.dataset_builder.get_sample_numbers(weights, max_sample_num)[源代码]

data_juicer.core.data.dj_dataset module

class data_juicer.core.data.dj_dataset.DJDataset[源代码]

基类:ABC

Base dataset of DJ

abstract process(operators, *, exporter=None, checkpointer=None, tracer=None) DJDataset[源代码]

process a list of operators on the dataset.

abstract schema() Schema[源代码]

Get dataset schema.

返回:

Dataset schema containing column names and types

返回类型:

Schema

abstract get(k: int) List[Dict[str, Any]][源代码]

Get k rows from the dataset.

参数:

k -- Number of rows to take

返回:

A list of rows from the dataset.

返回类型:

List[Any]

abstract get_column(column: str, k: int | None = None) List[Any][源代码]

Get values from a specific column/field, optionally limited to first k rows.

参数:
  • column -- Name of the column to retrieve

  • k -- Optional number of rows to return. If None, returns all rows

返回:

List of values from the specified column

抛出:
  • KeyError -- If column doesn't exist in dataset

  • ValueError -- If k is negative

data_juicer.core.data.dj_dataset.wrap_func_with_nested_access(f)[源代码]

Before conducting actual function f, wrap its args and kargs into nested ones.

参数:

f -- function to be wrapped.

返回:

wrapped function

data_juicer.core.data.dj_dataset.nested_obj_factory(obj)[源代码]

Use nested classes to wrap the input object.

参数:

obj -- object to be nested.

返回:

nested object

class data_juicer.core.data.dj_dataset.NestedQueryDict(*args, **kargs)[源代码]

基类:dict

Enhanced dict for better usability.

__init__(*args, **kargs)[源代码]
class data_juicer.core.data.dj_dataset.NestedDatasetDict(*args, **kargs)[源代码]

基类:DatasetDict

Enhanced HuggingFace-DatasetDict for better usability and efficiency.

__init__(*args, **kargs)[源代码]
map(**args)[源代码]

Override the map func, which is called by most common operations, such that the processed samples can be accessed by nested manner.

class data_juicer.core.data.dj_dataset.NestedDataset(*args, **kargs)[源代码]

基类:Dataset, DJDataset

Enhanced HuggingFace-Dataset for better usability and efficiency.

__init__(*args, **kargs)[源代码]
schema() Schema[源代码]

Get dataset schema.

get(k: int) List[Dict[str, Any]][源代码]

Get k rows from the dataset.

get_column(column: str, k: int | None = None) List[Any][源代码]

Get column values from HuggingFace dataset.

参数:
  • column -- Name of the column to retrieve

  • k -- Optional number of rows to return. If None, returns all rows

返回:

List of values from the specified column

抛出:
  • KeyError -- If column doesn't exist

  • ValueError -- If k is negative

process(operators, *, work_dir=None, exporter=None, checkpointer=None, tracer=None, adapter=None, open_monitor=True)[源代码]

process a list of operators on the dataset.

update_args(args, kargs, is_filter=False)[源代码]
map(*args, **kargs)[源代码]

Override the map func, which is called by most common operations, such that the processed samples can be accessed by nested manner.

filter(*args, **kargs)[源代码]

Override the filter func, which is called by most common operations, such that the processed samples can be accessed by nested manner.

select(*args, **kargs)[源代码]

Override the select func, such that selected samples can be accessed by nested manner.

classmethod from_dict(*args, **kargs)[源代码]

Override the from_dict func, which is called by most from_xx constructors, such that the constructed dataset object is NestedDataset.

add_column(*args, **kargs)[源代码]

Override the add column func, such that the processed samples can be accessed by nested manner.

select_columns(*args, **kargs)[源代码]

Override the select columns func, such that the processed samples can be accessed by nested manner.

remove_columns(*args, **kargs)[源代码]

Override the remove columns func, such that the processed samples can be accessed by nested manner.

cleanup_cache_files()[源代码]

Override the cleanup_cache_files func, clear raw and compressed cache files.

static load_from_disk(*args, **kargs)[源代码]

Loads a dataset that was previously saved using [save_to_disk] from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem.

参数:
  • dataset_path (path-like) -- Path (e.g. "dataset/train") or remote URI (e.g. "s3//my-bucket/dataset/train") of the dataset directory where the dataset will be loaded from.

  • keep_in_memory (bool, defaults to None) -- Whether to copy the dataset in-memory. If None, the dataset will not be copied in-memory unless explicitly enabled by setting datasets.config.IN_MEMORY_MAX_SIZE to nonzero. See more details in the [improve performance](../cache#improve-performance) section.

  • storage_options (dict, optional) --

    Key/value pairs to be passed on to the file-system backend, if any.

    <Added version="2.8.0"/>

返回:

  • If dataset_path is a path of a dataset directory, the dataset requested.

  • If dataset_path is a path of a dataset dict directory, a datasets.DatasetDict with each split.

返回类型:

[Dataset] or [DatasetDict]

Example:

`py >>> ds = load_from_disk("path/to/dataset/directory") `

data_juicer.core.data.dj_dataset.nested_query(root_obj: NestedDatasetDict | NestedDataset | NestedQueryDict, key)[源代码]

Find item from a given object, by first checking flatten layer, then checking nested layers.

参数:
  • root_obj -- the object

  • key -- the stored item to be queried, e.g., "meta" or "meta.date"

返回:

data_juicer.core.data.dj_dataset.add_same_content_to_new_column(sample, new_column_name, initial_value=None)[源代码]

A helper function to speed up add_column function. Apply map on this function in parallel instead of using add_column. :param sample: a single sample to add this new column/field. :param new_column_name: the name of this new column/field. :param initial_value: the initial value of this new column/field.

data_juicer.core.data.load_strategy module

class data_juicer.core.data.load_strategy.StrategyKey(executor_type: str, data_type: str, data_source: str)[源代码]

基类:object

Immutable key for strategy registration with wildcard support

executor_type: str
data_type: str
data_source: str
matches(other: StrategyKey) bool[源代码]

Check if this key matches another key with wildcard support

Supports Unix-style wildcards: - '*' matches any string - '?' matches any single character - '[seq]' matches any character in seq - '[!seq]' matches any character not in seq

__init__(executor_type: str, data_type: str, data_source: str) None
class data_juicer.core.data.load_strategy.DataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:ABC, ConfigValidator

abstract class for data load strategy

__init__(ds_config: Dict, cfg: Namespace)[源代码]
abstract load_data(**kwargs) DJDataset[源代码]
class data_juicer.core.data.load_strategy.DataLoadStrategyRegistry[源代码]

基类:object

Flexible strategy registry with wildcard matching

classmethod get_strategy_class(executor_type: str, data_type: str, data_source: str) Type[DataLoadStrategy] | None[源代码]

Retrieve the most specific matching strategy

Matching priority: 1. Exact match 2. Wildcard matches from most specific to most general

classmethod register(executor_type: str, data_type: str, data_source: str)[源代码]

Decorator for registering data load strategies with wildcard support

参数:
  • executor_type -- Type of executor (e.g., 'default', 'ray')

  • data_type -- Type of data (e.g., 'local', 'remote')

  • data_source -- Specific data source (e.g., 'arxiv', 's3')

返回:

Decorator function

class data_juicer.core.data.load_strategy.RayDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:DataLoadStrategy

abstract class for data load strategy for RayExecutor

abstract load_data(**kwargs) DJDataset[源代码]
class data_juicer.core.data.load_strategy.DefaultDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:DataLoadStrategy

abstract class for data load strategy for LocalExecutor

abstract load_data(**kwargs) DJDataset[源代码]
class data_juicer.core.data.load_strategy.RayLocalJsonDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:RayDataLoadStrategy

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}
load_data(**kwargs)[源代码]
class data_juicer.core.data.load_strategy.RayHuggingfaceDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:RayDataLoadStrategy

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}
load_data(**kwargs)[源代码]
class data_juicer.core.data.load_strategy.DefaultLocalDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:DefaultDataLoadStrategy

data load strategy for on disk data for LocalExecutor rely on AutoFormatter for actual data loading

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}
load_data(**kwargs)[源代码]
class data_juicer.core.data.load_strategy.DefaultHuggingfaceDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:DefaultDataLoadStrategy

data load strategy for Huggingface dataset for LocalExecutor

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'optional_fields': ['split', 'limit', 'name', 'data_files', 'data_dir'], 'required_fields': ['path']}
load_data(**kwargs)[源代码]
class data_juicer.core.data.load_strategy.DefaultModelScopeDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:DefaultDataLoadStrategy

data load strategy for ModelScope dataset for LocalExecutor

load_data(**kwargs)[源代码]
class data_juicer.core.data.load_strategy.DefaultArxivDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:DefaultDataLoadStrategy

data load strategy for arxiv dataset for LocalExecutor

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}
load_data(**kwargs)[源代码]
class data_juicer.core.data.load_strategy.DefaultWikiDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:DefaultDataLoadStrategy

data load strategy for wiki dataset for LocalExecutor

CONFIG_VALIDATION_RULES = {'custom_validators': {}, 'field_types': {'path': <class 'str'>}, 'required_fields': ['path']}
load_data(**kwargs)[源代码]
class data_juicer.core.data.load_strategy.DefaultCommonCrawlDataLoadStrategy(ds_config: Dict, cfg: Namespace)[源代码]

基类:DefaultDataLoadStrategy

data load strategy for commoncrawl dataset for LocalExecutor

CONFIG_VALIDATION_RULES = {'custom_validators': {'end_snapshot': <function validate_snapshot_format>, 'start_snashot': <function validate_snapshot_format>, 'url_limit': <function DefaultCommonCrawlDataLoadStrategy.<lambda>>}, 'field_types': {'end_snapshot': <class 'str'>, 'start_snapshot': <class 'str'>}, 'optional_fields': ['aws', 'url_limit'], 'required_fields': ['start_snapshot', 'end_snapshot']}
load_data(**kwargs)[源代码]

data_juicer.core.data.ray_dataset module

data_juicer.core.data.ray_dataset.get_abs_path(path, dataset_dir)[源代码]
data_juicer.core.data.ray_dataset.convert_to_absolute_paths(samples, dataset_dir, path_keys)[源代码]
data_juicer.core.data.ray_dataset.set_dataset_to_absolute_path(dataset, dataset_path, cfg)[源代码]

Set all the path in input data to absolute path. Checks dataset_dir and project_dir for valid paths.

data_juicer.core.data.ray_dataset.preprocess_dataset(dataset: Dataset, dataset_path, cfg) Dataset[源代码]
data_juicer.core.data.ray_dataset.get_num_gpus(op, op_proc)[源代码]
data_juicer.core.data.ray_dataset.filter_batch(batch, filter_func)[源代码]
class data_juicer.core.data.ray_dataset.RayDataset(dataset: Dataset, dataset_path: str | None = None, cfg: Namespace | None = None)[源代码]

基类:DJDataset

__init__(dataset: Dataset, dataset_path: str | None = None, cfg: Namespace | None = None) None[源代码]
schema() Schema[源代码]

Get dataset schema.

返回:

Dataset schema containing column names and types

返回类型:

Schema

get(k: int) List[Dict[str, Any]][源代码]

Get k rows from the dataset.

get_column(column: str, k: int | None = None) List[Any][源代码]

Get column values from Ray dataset.

参数:
  • column -- Name of the column to retrieve

  • k -- Optional number of rows to return. If None, returns all rows

返回:

List of values from the specified column

抛出:
  • KeyError -- If column doesn't exist

  • ValueError -- If k is negative

process(operators, *, exporter=None, checkpointer=None, tracer=None) DJDataset[源代码]

process a list of operators on the dataset.

classmethod read_json(paths: str | List[str]) RayDataset[源代码]
class data_juicer.core.data.ray_dataset.JSONStreamDatasource(paths: str | List[str], *, arrow_json_args: Dict[str, Any] | None = None, **file_based_datasource_kwargs)[源代码]

基类:JSONDatasource

A temp Datasource for reading json stream.

备注

Depends on a customized pyarrow with open_json method.

data_juicer.core.data.ray_dataset.read_json_stream(paths: str | List[str], *, filesystem: FileSystem | None = None, parallelism: int = -1, ray_remote_args: Dict[str, Any] | None = None, arrow_open_stream_args: Dict[str, Any] | None = None, meta_provider=None, partition_filter=None, partitioning=Partitioning(style='hive', base_dir='', field_names=None, field_types={}, filesystem=None), include_paths: bool = False, ignore_missing_paths: bool = False, shuffle: Literal['files'] | None = None, file_extensions: List[str] | None = ['json', 'jsonl'], concurrency: int | None = None, override_num_blocks: int | None = None, **arrow_json_args) Dataset[源代码]

data_juicer.core.data.schema module

class data_juicer.core.data.schema.Schema(column_types: Dict[str, Any], columns: List[str])[源代码]

基类:object

Dataset schema representation.

column_types

Mapping of column names to their types

Type:

Dict[str, Any]

columns

List of column names in order

Type:

List[str]

column_types: Dict[str, Any]
columns: List[str]
classmethod map_hf_type_to_python(feature)[源代码]

Map HuggingFace feature type to Python type.

Recursively maps nested types (e.g., List[str], Dict[str, int]).

示例

Value('string') -> str Sequence(Value('int32')) -> List[int] Dict({'text': Value('string')}) -> Dict[str, Any]

参数:

feature -- HuggingFace feature type

返回:

Corresponding Python type

classmethod map_ray_type_to_python(ray_type: DataType) type[源代码]

Map Ray/Arrow data type to Python type.

参数:

ray_type -- PyArrow DataType

返回:

Corresponding Python type

__init__(column_types: Dict[str, Any], columns: List[str]) None

Module contents

class data_juicer.core.data.DJDataset[源代码]

基类:ABC

Base dataset of DJ

abstract process(operators, *, exporter=None, checkpointer=None, tracer=None) DJDataset[源代码]

process a list of operators on the dataset.

abstract schema() Schema[源代码]

Get dataset schema.

返回:

Dataset schema containing column names and types

返回类型:

Schema

abstract get(k: int) List[Dict[str, Any]][源代码]

Get k rows from the dataset.

参数:

k -- Number of rows to take

返回:

A list of rows from the dataset.

返回类型:

List[Any]

abstract get_column(column: str, k: int | None = None) List[Any][源代码]

Get values from a specific column/field, optionally limited to first k rows.

参数:
  • column -- Name of the column to retrieve

  • k -- Optional number of rows to return. If None, returns all rows

返回:

List of values from the specified column

抛出:
  • KeyError -- If column doesn't exist in dataset

  • ValueError -- If k is negative

class data_juicer.core.data.NestedDataset(*args, **kargs)[源代码]

基类:Dataset, DJDataset

Enhanced HuggingFace-Dataset for better usability and efficiency.

__init__(*args, **kargs)[源代码]
schema() Schema[源代码]

Get dataset schema.

get(k: int) List[Dict[str, Any]][源代码]

Get k rows from the dataset.

get_column(column: str, k: int | None = None) List[Any][源代码]

Get column values from HuggingFace dataset.

参数:
  • column -- Name of the column to retrieve

  • k -- Optional number of rows to return. If None, returns all rows

返回:

List of values from the specified column

抛出:
  • KeyError -- If column doesn't exist

  • ValueError -- If k is negative

process(operators, *, work_dir=None, exporter=None, checkpointer=None, tracer=None, adapter=None, open_monitor=True)[源代码]

process a list of operators on the dataset.

update_args(args, kargs, is_filter=False)[源代码]
map(*args, **kargs)[源代码]

Override the map func, which is called by most common operations, such that the processed samples can be accessed by nested manner.

filter(*args, **kargs)[源代码]

Override the filter func, which is called by most common operations, such that the processed samples can be accessed by nested manner.

select(*args, **kargs)[源代码]

Override the select func, such that selected samples can be accessed by nested manner.

classmethod from_dict(*args, **kargs)[源代码]

Override the from_dict func, which is called by most from_xx constructors, such that the constructed dataset object is NestedDataset.

add_column(*args, **kargs)[源代码]

Override the add column func, such that the processed samples can be accessed by nested manner.

select_columns(*args, **kargs)[源代码]

Override the select columns func, such that the processed samples can be accessed by nested manner.

remove_columns(*args, **kargs)[源代码]

Override the remove columns func, such that the processed samples can be accessed by nested manner.

cleanup_cache_files()[源代码]

Override the cleanup_cache_files func, clear raw and compressed cache files.

static load_from_disk(*args, **kargs)[源代码]

Loads a dataset that was previously saved using [save_to_disk] from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem.

参数:
  • dataset_path (path-like) -- Path (e.g. "dataset/train") or remote URI (e.g. "s3//my-bucket/dataset/train") of the dataset directory where the dataset will be loaded from.

  • keep_in_memory (bool, defaults to None) -- Whether to copy the dataset in-memory. If None, the dataset will not be copied in-memory unless explicitly enabled by setting datasets.config.IN_MEMORY_MAX_SIZE to nonzero. See more details in the [improve performance](../cache#improve-performance) section.

  • storage_options (dict, optional) --

    Key/value pairs to be passed on to the file-system backend, if any.

    <Added version="2.8.0"/>

返回:

  • If dataset_path is a path of a dataset directory, the dataset requested.

  • If dataset_path is a path of a dataset dict directory, a datasets.DatasetDict with each split.

返回类型:

[Dataset] or [DatasetDict]

Example:

`py >>> ds = load_from_disk("path/to/dataset/directory") `

data_juicer.core.data.wrap_func_with_nested_access(f)[源代码]

Before conducting actual function f, wrap its args and kargs into nested ones.

参数:

f -- function to be wrapped.

返回:

wrapped function

data_juicer.core.data.add_same_content_to_new_column(sample, new_column_name, initial_value=None)[源代码]

A helper function to speed up add_column function. Apply map on this function in parallel instead of using add_column. :param sample: a single sample to add this new column/field. :param new_column_name: the name of this new column/field. :param initial_value: the initial value of this new column/field.