data_juicer.core.data¶

class data_juicer.core.data.DJDataset[源代码]¶

基类：ABC

Base dataset of DJ

contain_column(column: str) → bool[源代码]¶

Check whether the dataset contains a specific column/field.

参数:: column -- Name of the column to check
返回:: True if the dataset contains the column, False otherwise
返回类型:: bool

abstractmethod get(k: int) → List[Dict[str, Any]][源代码]¶

Get k rows from the dataset.

参数:: k -- Number of rows to take
返回:: A list of rows from the dataset.
返回类型:: List[Any]

abstractmethod get_column(column: str, k: int | None = None) → List[Any][源代码]¶

Get values from a specific column/field, optionally limited to first k rows.

参数:

column -- Name of the column to retrieve
k -- Optional number of rows to return. If None, returns all rows

返回:

List of values from the specified column

抛出:

KeyError -- If column doesn't exist in dataset
ValueError -- If k is negative

abstractmethod process(operators, *, exporter=None, checkpointer=None, tracer=None) → DJDataset[源代码]¶: process a list of operators on the dataset.

abstractmethod schema() → Schema[源代码]¶

Get dataset schema.

返回:: Dataset schema containing column names and types
返回类型:: Schema

abstractmethod to_list() → list[源代码]¶: Convert the current dataset to a Python list.

class data_juicer.core.data.NestedDataset(*args, **kargs)[源代码]¶

基类：Dataset, DJDataset

Enhanced HuggingFace-Dataset for better usability and efficiency.

__init__(*args, **kargs)[源代码]¶

add_column(*args, **kargs)[源代码]¶: Override the add column func, such that the processed samples can be accessed by nested manner.

cleanup_cache_files()[源代码]¶: Override the cleanup_cache_files func, clear raw and compressed cache files.

filter(*args, **kargs)[源代码]¶: Override the filter func, which is called by most common operations, such that the processed samples can be accessed by nested manner.

classmethod from_dict(*args, **kargs)[源代码]¶: Override the from_dict func, which is called by most from_xx constructors, such that the constructed dataset object is NestedDataset.

classmethod from_list(*args, **kargs)[源代码]¶: Override the from_dict func, which is called by most from_xx constructors, such that the constructed dataset object is NestedDataset.

get(k: int) → List[Dict[str, Any]][源代码]¶: Get k rows from the dataset.

get_column(column: str, k: int | None = None) → List[Any][源代码]¶

Get column values from HuggingFace dataset.

参数:

column -- Name of the column to retrieve
k -- Optional number of rows to return. If None, returns all rows

返回:

List of values from the specified column

抛出:

KeyError -- If column doesn't exist
ValueError -- If k is negative

static load_from_disk(*args, **kargs)[源代码]¶

Loads a dataset that was previously saved using [save_to_disk] from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem.

参数:

dataset_path (path-like) -- Path (e.g. "dataset/train") or remote URI (e.g. "s3//my-bucket/dataset/train") of the dataset directory where the dataset will be loaded from.
keep_in_memory (bool, defaults to None) -- Whether to copy the dataset in-memory. If None, the dataset will not be copied in-memory unless explicitly enabled by setting datasets.config.IN_MEMORY_MAX_SIZE to nonzero. See more details in the [improve performance](../cache#improve-performance) section.
storage_options (dict, optional) --
Key/value pairs to be passed on to the file-system backend, if any.

<Added version="2.8.0"/>

返回:

If dataset_path is a path of a dataset directory, the dataset requested.
If dataset_path is a path of a dataset dict directory, a datasets.DatasetDict with each split.

返回类型:

[Dataset] or [DatasetDict]

Example:

`py >>> ds = load_from_disk("path/to/dataset/directory") `

map(*args, **kargs)[源代码]¶: Override the map func, which is called by most common operations, such that the processed samples can be accessed by nested manner.

process(operators, *, work_dir=None, exporter=None, checkpointer=None, tracer=None, adapter=None, open_monitor=True)[源代码]¶: process a list of operators on the dataset.

remove_columns(*args, **kargs)[源代码]¶: Override the remove columns func, such that the processed samples can be accessed by nested manner.

schema() → Schema[源代码]¶: Get dataset schema.

select(*args, **kargs)[源代码]¶: Override the select func, such that selected samples can be accessed by nested manner.

select_columns(*args, **kargs)[源代码]¶: Override the select columns func, such that the processed samples can be accessed by nested manner.

to_list() → list[源代码]¶

Returns the dataset as a Python list.

返回:: list

Example:

`py >>> ds.to_list() `

update_args(args, kargs, is_filter=False)[源代码]¶

data_juicer.core.data.wrap_func_with_nested_access(f)[源代码]¶

Before conducting actual function f, wrap its args and kargs into nested ones.

参数:: f -- function to be wrapped.
返回:: wrapped function

data_juicer.core.data.add_same_content_to_new_column(sample, new_column_name, initial_value=None)[源代码]¶: A helper function to speed up add_column function. Apply map on this function in parallel instead of using add_column. :param sample: a single sample to add this new column/field. :param new_column_name: the name of this new column/field. :param initial_value: the initial value of this new column/field.