data_juicer.download.downloader module¶

class data_juicer.download.downloader.DocumentDownloader[source]¶

Bases: ABC

Abstract class for downloading remote data to disk

__init__()[source]¶

abstractmethod download(url)[source]¶

class data_juicer.download.downloader.DocumentIterator[source]¶

Bases: ABC

Abstract iterator class for reading in raw records that have been downloaded to disk

__init__()[source]¶

abstractmethod iterate(file_path)[source]¶

class data_juicer.download.downloader.DocumentExtractor[source]¶

Bases: ABC

Abstract class for extracting text from records read from disk

__init__()[source]¶

abstractmethod extract(content)[source]¶

data_juicer.download.downloader.download_and_extract(urls: List[str], output_paths: List[str], downloader: DocumentDownloader, iterator: DocumentIterator, extractor: DocumentExtractor, output_format: dict, output_type: str = 'jsonl', keep_raw_download=False, force_download=False, input_meta: str | dict = None, item_limit=None) → Dataset[source]¶

Downloads and extracts a dataset

Parameters:

urls – A list of urls to download the dataset from
output_paths – A list of paths to save the final extracted output to. The raw output of the downloader will be saved using the path given by downloader.download(url).
downloader – A DocumentDownloader that handles retrieving each file from its url and saving it to storage
iterator – A DocumentIterator that handles iterating through the downloaded file’s format
extractor – A DocumentExtractor that handles extracting the data from its raw format into text
output_format – A dictionary mappings columns to datatypes for the fields of each datapoint after extraction.
output_type – The file type to save the dataset as.
keep_raw_download – Whether to keep the pre-extracted download file.
force_download – If False, will skip processing all files in output_paths that already exist and directly read from them instead.
input_meta – A dictionary or a string formatted as a dictionary, which outlines the field names and their respective data types within the JSONL input file.
item_limit – limit on number of items downloaded; for sampling and testing purposes

Returns:

A HuggingFace DataSet of the downloaded data

data_juicer.download.downloader.get_wikipedia_urls(language='en', wikidumps_index_prefix='https://dumps.wikimedia.org', dump_date: str | None = None) → List[str][source]¶

Retrieves all urls pointing to the latest Wikipedia dumps

Parameters:

language – Desired language of the Wikipedia dump.
wikidumps_index_prefix – The base url for all wikipedia dumps
dump_date – A string formatted as “YYYYMMDD” for the wikipedia dump to use. If None, latest dump is used.

data_juicer.download.downloader.get_arxiv_urls()[source]¶

data_juicer.download.downloader.validate_snapshot_format(snapshot: str | None) → None[source]¶

Validate snapshot format ‘YYYY-WW’.

Parameters:: snapshot – Snapshot string in format ‘YYYY-WW’ or None
Raises:: ValueError – If format is invalid