data_juicer.ops.mapper.download_file_mapper module

class data_juicer.ops.mapper.download_file_mapper.DownloadFileMapper(download_field: str = None, save_dir: str = None, save_field: str = None, resume_download: bool = False, timeout: int = 30, max_concurrent: int = 10, *args, **kwargs)[源代码]

基类:Mapper

Mapper to download url files to local files or load them into memory.

__init__(download_field: str = None, save_dir: str = None, save_field: str = None, resume_download: bool = False, timeout: int = 30, max_concurrent: int = 10, *args, **kwargs)[源代码]

Initialization method.

参数:
  • save_dir -- The directory to save downloaded files.

  • download_field -- The filed name to get the url to download.

  • save_field -- The filed name to save the downloaded file content.

  • resume_download -- Whether to resume download. if True, skip the sample if it exists.

  • timeout -- Timeout for download.

  • max_concurrent -- Maximum concurrent downloads.

  • args -- extra args

  • kwargs -- extra args

download_files_async(urls, return_contents, save_dir=None, **kwargs)[源代码]
download_nested_urls(nested_urls: List[str | List[str]], save_dir=None, save_field_contents=None)[源代码]
process_batched(samples)[源代码]