data_juicer.format.text_formatter module¶

data_juicer.format.text_formatter.extract_txt_from_docx(fn, tgt_path)[源代码]¶

Extract text from a docx file and save to target path.

参数:

data_juicer.format.text_formatter.extract_txt_from_pdf(fn, tgt_path)[源代码]¶

Extract text from a pdf file and save to target path.

参数:

class data_juicer.format.text_formatter.TextFormatter(dataset_path, suffixes=None, add_suffix=False, **kwargs)[源代码]¶

The class is used to load and format text-type files.

e.g. ['.txt', '.pdf', '.cpp', '.docx']

SUFFIXES = ['.docx', '.pdf', '.txt', '.md', '.tex', '.asm', '.bat', '.cmd', '.c', '.h', '.cs', '.cpp', '.hpp', '.c++', '.h++', '.cc', '.hh', '.C', '.H', '.cmake', '.css', '.dockerfile', '.f90', '.f', '.f03', '.f08', '.f77', '.f95', '.for', '.fpp', '.go', '.hs', '.html', '.java', '.js', '.jl', '.lua', '.markdown', '.php', '.php3', '.php4', '.php5', '.phps', '.phpt', '.pl', '.pm', '.pod', '.perl', '.ps1', '.psd1', '.psm1', '.py', '.rb', '.rs', '.sql', '.scala', '.sh', '.bash', '.command', '.zsh', '.ts', '.tsx', '.vb', 'Dockerfile', 'Makefile', '.xml', '.rst', '.m', '.smali']¶

__init__(dataset_path, suffixes=None, add_suffix=False, **kwargs)[源代码]¶

Initialization method.

参数:

load_dataset(num_proc: int = 1, global_cfg=None) → Dataset[源代码]¶

Load a dataset from local text-type files.

参数:

返回:

unified_format_dataset.