data_juicer.ops.mapper.download_file_mapper module¶

class data_juicer.ops.mapper.download_file_mapper.DownloadFileMapper(download_field: str = None, save_dir: str = None, save_field: str = None, resume_download: bool = False, timeout: int = 30, max_concurrent: int = 10, *args, **kwargs)[source]¶

Bases: Mapper

Mapper to download URL files to local files or load them into memory.

This operator downloads files from URLs and can either save them to a specified directory or load the contents directly into memory. It supports downloading multiple files concurrently and can resume downloads if the resume_download flag is set. The operator processes nested lists of URLs, flattening them for batch processing and then reconstructing the original structure in the output. If both save_dir and save_field are not specified, it defaults to saving the content under the key image_bytes. The operator logs any failed download attempts and provides error messages for troubleshooting.

__init__(download_field: str = None, save_dir: str = None, save_field: str = None, resume_download: bool = False, timeout: int = 30, max_concurrent: int = 10, *args, **kwargs)[source]¶

Initialization method.

Parameters:

save_dir – The directory to save downloaded files.
download_field – The filed name to get the url to download.
save_field – The filed name to save the downloaded file content.
resume_download – Whether to resume download. if True, skip the sample if it exists.
timeout – Timeout for download.
max_concurrent – Maximum concurrent downloads.
args – extra args
kwargs – extra args

download_files_async(urls, return_contents, save_dir=None, **kwargs)[source]¶

download_nested_urls(nested_urls: List[str | List[str]], save_dir=None, save_field_contents=None)[source]¶

process_batched(samples)[source]¶