data_juicer.core.data¶
- class data_juicer.core.data.DJDataset[source]¶
Bases:
ABC
Base dataset of DJ
- abstract process(operators, *, exporter=None, checkpointer=None, tracer=None) DJDataset [source]¶
process a list of operators on the dataset.
- abstract schema() Schema [source]¶
Get dataset schema.
- Returns:
Dataset schema containing column names and types
- Return type:
Schema
- abstract get(k: int) List[Dict[str, Any]] [source]¶
Get k rows from the dataset.
- Parameters:
k – Number of rows to take
- Returns:
A list of rows from the dataset.
- Return type:
List[Any]
- abstract get_column(column: str, k: int | None = None) List[Any] [source]¶
Get values from a specific column/field, optionally limited to first k rows.
- Parameters:
column – Name of the column to retrieve
k – Optional number of rows to return. If None, returns all rows
- Returns:
List of values from the specified column
- Raises:
KeyError – If column doesn’t exist in dataset
ValueError – If k is negative
- class data_juicer.core.data.NestedDataset(*args, **kargs)[source]¶
Bases:
Dataset
,DJDataset
Enhanced HuggingFace-Dataset for better usability and efficiency.
- get_column(column: str, k: int | None = None) List[Any] [source]¶
Get column values from HuggingFace dataset.
- Parameters:
column – Name of the column to retrieve
k – Optional number of rows to return. If None, returns all rows
- Returns:
List of values from the specified column
- Raises:
KeyError – If column doesn’t exist
ValueError – If k is negative
- process(operators, *, work_dir=None, exporter=None, checkpointer=None, tracer=None, adapter=None, open_monitor=True)[source]¶
process a list of operators on the dataset.
- map(*args, **kargs)[source]¶
Override the map func, which is called by most common operations, such that the processed samples can be accessed by nested manner.
- filter(*args, **kargs)[source]¶
Override the filter func, which is called by most common operations, such that the processed samples can be accessed by nested manner.
- select(*args, **kargs)[source]¶
Override the select func, such that selected samples can be accessed by nested manner.
- classmethod from_dict(*args, **kargs)[source]¶
Override the from_dict func, which is called by most from_xx constructors, such that the constructed dataset object is NestedDataset.
- add_column(*args, **kargs)[source]¶
Override the add column func, such that the processed samples can be accessed by nested manner.
- select_columns(*args, **kargs)[source]¶
Override the select columns func, such that the processed samples can be accessed by nested manner.
- remove_columns(*args, **kargs)[source]¶
Override the remove columns func, such that the processed samples can be accessed by nested manner.
- cleanup_cache_files()[source]¶
Override the cleanup_cache_files func, clear raw and compressed cache files.
- static load_from_disk(*args, **kargs)[source]¶
Loads a dataset that was previously saved using [save_to_disk] from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem.
- Parameters:
dataset_path (path-like) – Path (e.g. “dataset/train”) or remote URI (e.g. “s3//my-bucket/dataset/train”) of the dataset directory where the dataset will be loaded from.
keep_in_memory (bool, defaults to None) – Whether to copy the dataset in-memory. If None, the dataset will not be copied in-memory unless explicitly enabled by setting datasets.config.IN_MEMORY_MAX_SIZE to nonzero. See more details in the [improve performance](../cache#improve-performance) section.
storage_options (dict, optional) –
Key/value pairs to be passed on to the file-system backend, if any.
<Added version=”2.8.0”/>
- Returns:
If dataset_path is a path of a dataset directory, the dataset requested.
If dataset_path is a path of a dataset dict directory, a datasets.DatasetDict with each split.
- Return type:
[Dataset] or [DatasetDict]
Example:
`py >>> ds = load_from_disk("path/to/dataset/directory") `
- data_juicer.core.data.wrap_func_with_nested_access(f)[source]¶
Before conducting actual function f, wrap its args and kargs into nested ones.
- Parameters:
f – function to be wrapped.
- Returns:
wrapped function
- data_juicer.core.data.add_same_content_to_new_column(sample, new_column_name, initial_value=None)[source]¶
A helper function to speed up add_column function. Apply map on this function in parallel instead of using add_column. :param sample: a single sample to add this new column/field. :param new_column_name: the name of this new column/field. :param initial_value: the initial value of this new column/field.