data_juicer.download.wikipedia module¶

class data_juicer.download.wikipedia.WikipediaDownloader(download_dir, verbose=False)[源代码]¶

基类：DocumentDownloader

__init__(download_dir, verbose=False)[源代码]¶

download(url)[源代码]¶

class data_juicer.download.wikipedia.WikipediaIterator(language='en', log_frequency=1000)[源代码]¶

基类：DocumentIterator

__init__(language='en', log_frequency=1000)[源代码]¶

iterate(file_path)[源代码]¶

class data_juicer.download.wikipedia.WikipediaExtractor(language='en', parser=<module 'mwparserfromhell' from '/opt/hostedtoolcache/Python/3.11.13/x64/lib/python3.11/site-packages/mwparserfromhell/__init__.py'>)[源代码]¶

基类：DocumentExtractor

__init__(language='en', parser=<module 'mwparserfromhell' from '/opt/hostedtoolcache/Python/3.11.13/x64/lib/python3.11/site-packages/mwparserfromhell/__init__.py'>)[源代码]¶

extract(content)[源代码]¶

data_juicer.download.wikipedia.download_wikipedia(output_path: str, language: str = 'en', dump_date=None, output_type: str = 'jsonl', raw_download_dir=None, keep_raw_download=False, force_download=False, url_limit=None, item_limit=None) → Dataset[源代码]¶

Downloads the latest Wikipedia dumps and extracts them using mwparserfromhell

参数:

output_path -- The path to the root directory of the files
language -- The language of the Wikipedia articles to download
dump_date -- A string formatted as "YYYYMMDD" for the wikipedia dump to use. If None, latest dump is used.
output_type -- The file type to save the data as.
raw_download_dir -- Path to store the raw download files for intermediate processing. If None, they are stored in a folder named "downloads" under output_path.
keep_raw_download -- If True, keeps the bz2 files that have not been extracted.
force_download -- If False, will skip processing all files in output_paths that already exist and directly read from them instead.
url_limit -- The maximum number of raw files to download from the snapshot. If None, all files from the range of snapshots are downloaded.