data_juicer.utils.file_utils module

class data_juicer.utils.file_utils.Sizes[源代码]

基类:object

KiB = 1024
MiB = 1048576
GiB = 1073741824
TiB = 1099511627776
data_juicer.utils.file_utils.byte_size_to_size_str(byte_size: int)[源代码]
async data_juicer.utils.file_utils.follow_read(logfile_path: str, skip_existing_content: bool = False) AsyncGenerator[源代码]

Read a file in online and iterative manner

参数:
  • logfile_path (str) -- The file path to be read.

  • skip_existing_content (bool, defaults to `False) -- If True, read from the end, otherwise read from the beginning.

返回:

One line string of the file content.

data_juicer.utils.file_utils.find_files_with_suffix(path: str | Path, suffixes: str | List[str] | None = None) Dict[str, List[str]][源代码]

Traverse a path to find all files with the specified suffixes.

参数:
  • path -- path (str/Path): source path

  • suffixes -- specified file suffixes, '.txt' or ['.txt', '.md'] etc

返回:

list of all files with the specified suffixes

data_juicer.utils.file_utils.is_remote_path(path: str)[源代码]

Check if the path is a remote path.

data_juicer.utils.file_utils.is_absolute_path(path: str | Path) bool[源代码]

Check whether input path is a absolute path.

参数:

path -- input path

返回:

True means input path is absolute path, False means input path is a relative path.

data_juicer.utils.file_utils.add_suffix_to_filename(filename, suffix)[源代码]

Add a suffix to the filename. Only regard the content after the last dot as the file extension. E.g. 1. abc.jpg + "_resized" --> abc_resized.jpg 2. edf.xyz.csv + "_processed" --> edf.xyz_processed.csv 3. /path/to/file.json + "_suf" --> /path/to/file_suf.json 4. ds.tar.gz + "_whoops" --> ds.tar_whoops.gz (maybe unexpected)

参数:
  • filename -- input filename

  • suffix -- suffix string to be added

data_juicer.utils.file_utils.create_directory_if_not_exists(directory_path)[源代码]

create a directory if not exists, this function is process safe

参数:

directory_path -- directory path to be create

data_juicer.utils.file_utils.transfer_data_dir(original_dir, op_name)[源代码]

Transfer the original multimodal data dir to a new dir to store the newly generated multimodal data. The pattern is {original_dir}/__dj__produced_data__/{op_name}

data_juicer.utils.file_utils.transfer_filename(original_filepath: str | Path, op_name, save_dir: str = None, **op_kwargs)[源代码]

According to the op and hashing its parameters 'op_kwargs' addition to the process id and current time as the 'hash_val', map the original_filepath to another unique file path. E.g.

When save_dir is provided: '/save_dir/path/to/data/'
/path/to/abc.jpg -->

/save_dir/path/to/data/abc__dj_hash_#{hash_val}#.jpg

When environment variable DJ_PRODUCED_DATA_DIR is provided: '/environment/path/to/data/'
/path/to/abc.jpg -->

/environment/path/to/data/{op_name}/abc__dj_hash_#{hash_val}#.jpg

When neither save_dir nor DJ_PRODUCED_DATA_DIR is provided:
  1. abc.jpg -->

    __dj__produced_data__/{op_name}/ abc__dj_hash_#{hash_val}#.jpg

  2. ./abc.jpg -->

    ./__dj__produced_data__/{op_name}/ abc__dj_hash_#{hash_val}#.jpg

  3. /path/to/abc.jpg -->

    /path/to/__dj__produced_data__/{op_name}/ abc__dj_hash_#{hash_val}#.jpg

  4. /path/to/__dj__produced_data__/{op_name}/

    abc__dj_hash_#{hash_val1}#.jpg --> /path/to/__dj__produced_data__/{op_name}/ abc__dj_hash_#{hash_val2}#.jpg

Priority: save_dir > DJ_PRODUCED_DATA_DIR > original data directory (default)

data_juicer.utils.file_utils.copy_data(from_dir, to_dir, data_path)[源代码]

Copy data from from_dir/data_path to to_dir/data_path. Return True if success.

data_juicer.utils.file_utils.expand_outdir_and_mkdir(outdir)[源代码]
data_juicer.utils.file_utils.single_partition_write_with_filename(df: DataFrame, output_file_dir: str, keep_filename_column: bool = False, output_type: str = 'jsonl') Series[源代码]

This function processes a DataFrame and writes it to disk

参数:
  • df -- A DataFrame.

  • output_file_dir -- The output file path.

  • keep_filename_column -- Whether to keep or drop the "filename" column, if it exists.

  • output_type="jsonl" -- The type of output file to write.

返回:

If the DataFrame is non-empty, return a Series containing a single element, True. If the DataFrame is empty, return a Series containing a single element, False.

data_juicer.utils.file_utils.read_single_partition(files, filetype='jsonl', add_filename=False, input_meta: str | dict = None, columns: List[str] | None = None, **kwargs) DataFrame[源代码]

This function reads a file with cuDF, sorts the columns of the DataFrame and adds a "filename" column.

参数:
  • files -- The path to the jsonl files to read.

  • add_filename -- Whether to add a "filename" column to the DataFrame.

  • input_meta -- A dictionary or a string formatted as a dictionary, which outlines the field names and their respective data types within the JSONL input file.

  • columns -- If not None, only these columns will be read from the file. There is a significant performance gain when specifying columns for Parquet files.

返回:

A pandas DataFrame.

data_juicer.utils.file_utils.get_all_files_paths_under(root, recurse_subdirectories=True, followlinks=False)[源代码]

This function returns a list of all the files under a specified directory. :param root: The path to the directory to read. :param recurse_subdirecties: Whether to recurse into subdirectories.

Please note that this can be slow for large number of files.

参数:

followlinks -- Whether to follow symbolic links.

async data_juicer.utils.file_utils.download_file(session: ClientSession, url: str, save_path: str = None, return_content=False, timeout: int = 300, **kwargs)[源代码]

Download a file from a given URL and save it to a specified directory. :param url: The URL of the file to download. :param save_path: The path where the downloaded file will be saved. :param return_content: Whether to return the content of the downloaded file. :param timeout: The timeout in seconds for each HTTP request. :param kwargs: The keyword arguments to pass to the HTTP request.

返回:

The response object from the HTTP request.