data_juicer.utils.compress module¶
- class data_juicer.utils.compress.FileLock(lock_file: str | PathLike[str], timeout: float = -1, mode: int = 420, thread_local: bool = True, *, blocking: bool = True, is_singleton: bool = False, **kwargs: Any)[源代码]¶
基类:
FileLock
File lock for compression or decompression, and remove lock file automatically.
- class data_juicer.utils.compress.Extractor[源代码]¶
基类:
Extractor
Extract content from a compressed file.
- classmethod extract(input_path: Path | str, output_path: Path | str, extractor_format: str)[源代码]¶
Extract content from a compressed file.
- 参数:
input_path -- path to compressed file.
output_path -- path to uncompressed file.
extractor_format -- extraction format, see supported algorithm in Extractor of huggingface dataset.
- class data_juicer.utils.compress.ZstdCompressor[源代码]¶
-
This class compresses a file using the zstd algorithm.
- class data_juicer.utils.compress.Lz4Compressor[源代码]¶
-
This class compresses a file using the lz4 algorithm.
- class data_juicer.utils.compress.GzipCompressor[源代码]¶
-
This class compresses a file using the gzip algorithm.
- class data_juicer.utils.compress.Compressor[源代码]¶
基类:
object
This class that contains multiple compressors.
- compressors: Dict[str, Type[BaseCompressor]] = {'gzip': <class 'data_juicer.utils.compress.GzipCompressor'>, 'lz4': <class 'data_juicer.utils.compress.Lz4Compressor'>, 'zstd': <class 'data_juicer.utils.compress.ZstdCompressor'>}¶
- classmethod compress(input_path: Path | str, output_path: Path | str, compressor_format: str)[源代码]¶
Compress input file and save to output file.
- 参数:
input_path -- path to uncompressed file.
output_path -- path to compressed file.
compressor_format -- compression format, see supported algorithm in compressors.
- class data_juicer.utils.compress.CompressManager(compressor_format: str = 'zstd')[源代码]¶
基类:
object
This class is used to compress or decompress a input file using compression format algorithms.
- __init__(compressor_format: str = 'zstd')[源代码]¶
Initialization method.
- 参数:
compressor_format -- compression format algorithms, default zstd.
- class data_juicer.utils.compress.CacheCompressManager(compressor_format: str = 'zstd')[源代码]¶
基类:
object
This class is used to compress or decompress huggingface cache files using compression format algorithms.
- __init__(compressor_format: str = 'zstd')[源代码]¶
Initialization method.
- 参数:
compressor_format -- compression format algorithms, default zstd.
- compress(prev_ds: Dataset, this_ds: Dataset = None, num_proc: int = 1)[源代码]¶
Compress cache files with fingerprint in dataset cache directory.
- 参数:
prev_ds -- previous dataset whose cache files need to be compressed here.
this_ds -- Current dataset that is computed from the previous dataset. There might be overlaps between cache files of them, so we must not compress cache files that will be used again in the current dataset. If it's None, it means all cache files of previous dataset should be compressed.
num_proc -- number of processes to compress cache files.
- decompress(ds: Dataset, fingerprints: str | List[str] = None, num_proc: int = 1)[源代码]¶
Decompress compressed cache files with fingerprint in dataset cache directory.
- 参数:
ds -- input dataset.
fingerprints -- fingerprints of cache files. String or List are accepted. If None, we will find all cache files which starts with cache- and ends with compression format.
num_proc -- number of processes to decompress cache files.