data_juicer.format package¶
Submodules¶
data_juicer.format.csv_formatter module¶
- class data_juicer.format.csv_formatter.CsvFormatter(dataset_path, suffixes=None, **kwargs)[source]¶
Bases:
LocalFormatter
The class is used to load and format csv-type files.
Default suffixes is [‘.csv’]
- SUFFIXES = ['.csv']¶
data_juicer.format.empty_formatter module¶
- class data_juicer.format.empty_formatter.EmptyFormatter(length, feature_keys: List[str] = [], *args, **kwargs)[source]¶
Bases:
BaseFormatter
The class is used to create empty data.
- SUFFIXES = []¶
- __init__(length, feature_keys: List[str] = [], *args, **kwargs)[source]¶
Initialization method.
- Parameters:
length – The empty dataset length.
feature_keys – feature key name list.
- property null_value¶
- class data_juicer.format.empty_formatter.RayEmptyFormatter(length, feature_keys: List[str] = [], *args, **kwargs)[source]¶
Bases:
BaseFormatter
The class is used to create empty data for ray.
- SUFFIXES = []¶
- __init__(length, feature_keys: List[str] = [], *args, **kwargs)[source]¶
Initialization method.
- Parameters:
length – The empty dataset length.
feature_keys – feature key name list.
- property null_value¶
data_juicer.format.formatter module¶
- class data_juicer.format.formatter.LocalFormatter(dataset_path: str, type: str, suffixes: str | List[str] | None = None, text_keys: List[str] | None = None, add_suffix=False, **kwargs)[source]¶
Bases:
BaseFormatter
The class is used to load a dataset from local files or local directory.
- __init__(dataset_path: str, type: str, suffixes: str | List[str] | None = None, text_keys: List[str] | None = None, add_suffix=False, **kwargs)[source]¶
Initialization method.
- Parameters:
dataset_path – path to a dataset file or a dataset directory
type – a packaged dataset module type (json, csv, etc.)
suffixes – files with specified suffixes to be processed
text_keys – key names of field that stores sample text.
add_suffix – whether to add the file suffix to dataset meta info
kwargs – extra args
- class data_juicer.format.formatter.RemoteFormatter(dataset_path: str, text_keys: List[str] | None = None, **kwargs)[source]¶
Bases:
BaseFormatter
The class is used to load a dataset from repository of huggingface hub.
- data_juicer.format.formatter.add_suffixes(datasets: DatasetDict, num_proc: int = 1) Dataset [source]¶
Add suffix filed to datasets.
- Parameters:
datasets – a DatasetDict object
num_proc – number of processes to add suffixes
- Returns:
datasets with suffix features.
- data_juicer.format.formatter.unify_format(dataset: Dataset, text_keys: List[str] | str = 'text', num_proc: int = 1, global_cfg=None) Dataset [source]¶
Get an unified internal format, conduct the following modifications.
check keys of dataset
filter out those samples with empty or None text
- Parameters:
dataset – input dataset
text_keys – original text key(s) of dataset.
num_proc – number of processes for mapping
global_cfg – the global cfg used in consequent processes, since cfg.text_key may be modified after unifying
- Returns:
unified_format_dataset
- data_juicer.format.formatter.load_formatter(dataset_path, text_keys=None, suffixes=None, add_suffix=False, **kwargs) BaseFormatter [source]¶
Load the appropriate formatter for different types of data formats.
- Parameters:
dataset_path – Path to dataset file or dataset directory
text_keys – key names of field that stores sample text. Default: None
suffixes – the suffix of files that will be read. Default: None
- Returns:
a dataset formatter.
data_juicer.format.json_formatter module¶
- class data_juicer.format.json_formatter.JsonFormatter(dataset_path, suffixes=None, **kwargs)[source]¶
Bases:
LocalFormatter
The class is used to load and format json-type files.
Default suffixes is [‘.json’, ‘.jsonl’, ‘.jsonl.zst’]
- SUFFIXES = ['.json', '.jsonl', '.jsonl.zst']¶
data_juicer.format.load module¶
- data_juicer.format.load.load_formatter(dataset_path, generated_dataset_config=None, text_keys=None, suffixes=[], add_suffix=False, **kwargs) BaseFormatter [source]¶
Load mixture formatter for multiple different data formats with an optional weight(default 1.0) according to their formats.
- Parameters:
dataset_path – path to a dataset file or a dataset directory
generated_dataset_config – Configuration used to create a dataset. The dataset will be created from this configuration if provided. It must contain the type field to specify the dataset name.
text_keys – key names of field that stores sample text. Default: None
suffixes – files with specified suffixes to be processed.
add_suffix – whether to add the file suffix to dataset meta info
- Returns:
a dataset formatter.
data_juicer.format.mixture_formatter module¶
- class data_juicer.format.mixture_formatter.MixtureFormatter(dataset_path: str, suffixes: str | List[str] | None = None, text_keys=None, add_suffix=False, max_samples=None, **kwargs)[source]¶
Bases:
BaseFormatter
The class mixes multiple datasets by randomly selecting samples from every dataset and merging them, and then exports the merged datasset as a new mixed dataset.
- __init__(dataset_path: str, suffixes: str | List[str] | None = None, text_keys=None, add_suffix=False, max_samples=None, **kwargs)[source]¶
Initialization method.
- Parameters:
dataset_path – a dataset file or a dataset dir or a list of them, optional weights, default 1.0 e.g. <w1> ds.jsonl <w2> ds_dir <w3> ds_file.json
suffixes – files with specified suffixes to be processed
text_keys – key names of field that stores sample text.
add_suffix – whether to add the file suffix to dataset meta info
max_samples – max samples number of mixed dataset.
kwargs – extra args
- classmethod random_sample(dataset, weight=1.0, sample_number=0, seed=None)[source]¶
Randomly sample a subset from a dataset with weight or number, if sample number is bigger than 0, we will use sample number instead of weight. :param dataset: a HuggingFace dataset :param weight: sample ratio of dataset :param sample_number: sample number of dataset :param seed: random sample seed, if None, 42 as default :return: a subset of dataset
data_juicer.format.parquet_formatter module¶
- class data_juicer.format.parquet_formatter.ParquetFormatter(dataset_path, suffixes=None, **kwargs)[source]¶
Bases:
LocalFormatter
The class is used to load and format parquet-type files.
Default suffixes is [‘.parquet’]
- SUFFIXES = ['.parquet']¶
data_juicer.format.text_formatter module¶
- data_juicer.format.text_formatter.extract_txt_from_docx(fn, tgt_path)[source]¶
Extract text from a docx file and save to target path.
- Parameters:
fn – path to input pdf file
tgt_path – path to save text file.
- data_juicer.format.text_formatter.extract_txt_from_pdf(fn, tgt_path)[source]¶
Extract text from a pdf file and save to target path.
- Parameters:
fn – path to input pdf file
tgt_path – path to save text file.
- class data_juicer.format.text_formatter.TextFormatter(dataset_path, suffixes=None, add_suffix=False, **kwargs)[source]¶
Bases:
LocalFormatter
The class is used to load and format text-type files.
e.g. [‘.txt’, ‘.pdf’, ‘.cpp’, ‘.docx’]
- SUFFIXES = ['.docx', '.pdf', '.txt', '.md', '.tex', '.asm', '.bat', '.cmd', '.c', '.h', '.cs', '.cpp', '.hpp', '.c++', '.h++', '.cc', '.hh', '.C', '.H', '.cmake', '.css', '.dockerfile', '.f90', '.f', '.f03', '.f08', '.f77', '.f95', '.for', '.fpp', '.go', '.hs', '.html', '.java', '.js', '.jl', '.lua', '.markdown', '.php', '.php3', '.php4', '.php5', '.phps', '.phpt', '.pl', '.pm', '.pod', '.perl', '.ps1', '.psd1', '.psm1', '.py', '.rb', '.rs', '.sql', '.scala', '.sh', '.bash', '.command', '.zsh', '.ts', '.tsx', '.vb', 'Dockerfile', 'Makefile', '.xml', '.rst', '.m', '.smali']¶
data_juicer.format.tsv_formatter module¶
- class data_juicer.format.tsv_formatter.TsvFormatter(dataset_path, suffixes=None, **kwargs)[source]¶
Bases:
LocalFormatter
The class is used to load and format tsv-type files.
Default suffixes is [‘.tsv’]
- SUFFIXES = ['.tsv']¶
Module contents¶
- data_juicer.format.load_formatter(dataset_path, generated_dataset_config=None, text_keys=None, suffixes=[], add_suffix=False, **kwargs) BaseFormatter [source]¶
Load mixture formatter for multiple different data formats with an optional weight(default 1.0) according to their formats.
- Parameters:
dataset_path – path to a dataset file or a dataset directory
generated_dataset_config – Configuration used to create a dataset. The dataset will be created from this configuration if provided. It must contain the type field to specify the dataset name.
text_keys – key names of field that stores sample text. Default: None
suffixes – files with specified suffixes to be processed.
add_suffix – whether to add the file suffix to dataset meta info
- Returns:
a dataset formatter.
- class data_juicer.format.JsonFormatter(dataset_path, suffixes=None, **kwargs)[source]¶
Bases:
LocalFormatter
The class is used to load and format json-type files.
Default suffixes is [‘.json’, ‘.jsonl’, ‘.jsonl.zst’]
- SUFFIXES = ['.json', '.jsonl', '.jsonl.zst']¶
- class data_juicer.format.LocalFormatter(dataset_path: str, type: str, suffixes: str | List[str] | None = None, text_keys: List[str] | None = None, add_suffix=False, **kwargs)[source]¶
Bases:
BaseFormatter
The class is used to load a dataset from local files or local directory.
- __init__(dataset_path: str, type: str, suffixes: str | List[str] | None = None, text_keys: List[str] | None = None, add_suffix=False, **kwargs)[source]¶
Initialization method.
- Parameters:
dataset_path – path to a dataset file or a dataset directory
type – a packaged dataset module type (json, csv, etc.)
suffixes – files with specified suffixes to be processed
text_keys – key names of field that stores sample text.
add_suffix – whether to add the file suffix to dataset meta info
kwargs – extra args
- class data_juicer.format.RemoteFormatter(dataset_path: str, text_keys: List[str] | None = None, **kwargs)[source]¶
Bases:
BaseFormatter
The class is used to load a dataset from repository of huggingface hub.
- class data_juicer.format.TextFormatter(dataset_path, suffixes=None, add_suffix=False, **kwargs)[source]¶
Bases:
LocalFormatter
The class is used to load and format text-type files.
e.g. [‘.txt’, ‘.pdf’, ‘.cpp’, ‘.docx’]
- SUFFIXES = ['.docx', '.pdf', '.txt', '.md', '.tex', '.asm', '.bat', '.cmd', '.c', '.h', '.cs', '.cpp', '.hpp', '.c++', '.h++', '.cc', '.hh', '.C', '.H', '.cmake', '.css', '.dockerfile', '.f90', '.f', '.f03', '.f08', '.f77', '.f95', '.for', '.fpp', '.go', '.hs', '.html', '.java', '.js', '.jl', '.lua', '.markdown', '.php', '.php3', '.php4', '.php5', '.phps', '.phpt', '.pl', '.pm', '.pod', '.perl', '.ps1', '.psd1', '.psm1', '.py', '.rb', '.rs', '.sql', '.scala', '.sh', '.bash', '.command', '.zsh', '.ts', '.tsx', '.vb', 'Dockerfile', 'Makefile', '.xml', '.rst', '.m', '.smali']¶
- class data_juicer.format.ParquetFormatter(dataset_path, suffixes=None, **kwargs)[source]¶
Bases:
LocalFormatter
The class is used to load and format parquet-type files.
Default suffixes is [‘.parquet’]
- SUFFIXES = ['.parquet']¶
- class data_juicer.format.CsvFormatter(dataset_path, suffixes=None, **kwargs)[source]¶
Bases:
LocalFormatter
The class is used to load and format csv-type files.
Default suffixes is [‘.csv’]
- SUFFIXES = ['.csv']¶
- class data_juicer.format.TsvFormatter(dataset_path, suffixes=None, **kwargs)[source]¶
Bases:
LocalFormatter
The class is used to load and format tsv-type files.
Default suffixes is [‘.tsv’]
- SUFFIXES = ['.tsv']¶
- class data_juicer.format.MixtureFormatter(dataset_path: str, suffixes: str | List[str] | None = None, text_keys=None, add_suffix=False, max_samples=None, **kwargs)[source]¶
Bases:
BaseFormatter
The class mixes multiple datasets by randomly selecting samples from every dataset and merging them, and then exports the merged datasset as a new mixed dataset.
- __init__(dataset_path: str, suffixes: str | List[str] | None = None, text_keys=None, add_suffix=False, max_samples=None, **kwargs)[source]¶
Initialization method.
- Parameters:
dataset_path – a dataset file or a dataset dir or a list of them, optional weights, default 1.0 e.g. <w1> ds.jsonl <w2> ds_dir <w3> ds_file.json
suffixes – files with specified suffixes to be processed
text_keys – key names of field that stores sample text.
add_suffix – whether to add the file suffix to dataset meta info
max_samples – max samples number of mixed dataset.
kwargs – extra args
- classmethod random_sample(dataset, weight=1.0, sample_number=0, seed=None)[source]¶
Randomly sample a subset from a dataset with weight or number, if sample number is bigger than 0, we will use sample number instead of weight. :param dataset: a HuggingFace dataset :param weight: sample ratio of dataset :param sample_number: sample number of dataset :param seed: random sample seed, if None, 42 as default :return: a subset of dataset
- class data_juicer.format.EmptyFormatter(length, feature_keys: List[str] = [], *args, **kwargs)[source]¶
Bases:
BaseFormatter
The class is used to create empty data.
- SUFFIXES = []¶
- __init__(length, feature_keys: List[str] = [], *args, **kwargs)[source]¶
Initialization method.
- Parameters:
length – The empty dataset length.
feature_keys – feature key name list.
- property null_value¶
- class data_juicer.format.RayEmptyFormatter(length, feature_keys: List[str] = [], *args, **kwargs)[source]¶
Bases:
BaseFormatter
The class is used to create empty data for ray.
- SUFFIXES = []¶
- __init__(length, feature_keys: List[str] = [], *args, **kwargs)[source]¶
Initialization method.
- Parameters:
length – The empty dataset length.
feature_keys – feature key name list.
- property null_value¶