data_juicer.utils.model_utils module¶

data_juicer.utils.model_utils.get_backup_model_link(model_name)[源代码]¶

data_juicer.utils.model_utils.check_model(model_name, force=False)[源代码]¶

Check whether a model exists in DATA_JUICER_MODELS_CACHE. If exists, return its full path. Else, download it from cached models links.

参数:

model_name -- a specified model name
force -- Whether to download model forcefully or not, Sometimes the model file maybe incomplete for some reason, so need to download again forcefully.

data_juicer.utils.model_utils.check_model_home(model_name)[源代码]¶

data_juicer.utils.model_utils.filter_arguments(func, args_dict)[源代码]¶

Filters and returns only the valid arguments for a given function signature.

参数:

func -- The function or callable to inspect.
args_dict -- A dictionary of argument names and values to filter.

返回:

A dictionary containing only the arguments that match the function's signature, preserving any **kwargs if applicable.

class data_juicer.utils.model_utils.ChatAPIModel(model=None, endpoint=None, response_path=None, **kwargs)[源代码]¶

基类：object

__init__(model=None, endpoint=None, response_path=None, **kwargs)[源代码]¶

Initializes an instance of the APIModel class.

参数:

model -- The name of the model to be used for making API calls. This should correspond to a valid model identifier recognized by the API server. If it's None, use the first available model from the server.
endpoint -- The URL endpoint for the API. If provided as a relative path, it will be appended to the base URL (defined by the OPENAI_BASE_URL environment variable or through an additional base_url parameter). Defaults to '/chat/completions' for OpenAI compatibility.
response_path -- A dot-separated string specifying the path to extract the desired content from the API response. The default value is 'choices.0.message.content', which corresponds to the typical structure of an OpenAI API response.
kwargs -- Additional keyword arguments for configuring the internal OpenAI client.

class data_juicer.utils.model_utils.EmbeddingAPIModel(model=None, endpoint=None, response_path=None, **kwargs)[源代码]¶

基类：object

__init__(model=None, endpoint=None, response_path=None, **kwargs)[源代码]¶

Initializes an instance specialized for embedding APIs.

参数:

model -- The model identifier for embedding API calls. If it's None, use the first available model from the server.
endpoint -- API endpoint URL. Defaults to '/embeddings'.
response_path -- Path to extract embeddings from response. Defaults to 'data.0.embedding'.
kwargs -- Configuration for the OpenAI client.

data_juicer.utils.model_utils.prepare_api_model(model, *, endpoint=None, response_path=None, return_processor=False, processor_config=None, **model_params)[源代码]¶

Creates a callable API model for interacting with OpenAI-compatible API. The callable supports custom response parsing and works with proxy servers that may be incompatible.

参数:

model -- The name of the model to interact with.
endpoint -- The URL endpoint for the API. If provided as a relative path, it will be appended to the base URL (defined by the OPENAI_BASE_URL environment variable or through an additional base_url parameter). Supported endpoints include: - '/chat/completions' for chat models - '/embeddings' for embedding models Defaults to /chat/completions for OpenAI compatibility.
response_path -- The dot-separated path to extract desired content from the API response. Defaults to 'choices.0.message.content' for chat models and 'data.0.embedding' for embedding models.
return_processor -- A boolean flag indicating whether to return a processor along with the model. The processor can be used for tasks like tokenization or encoding. Defaults to False.
processor_config -- A dictionary containing configuration parameters for initializing a Hugging Face processor. It is only relevant if return_processor is set to True.
model_params -- Additional parameters for configuring the API model.

返回:

A callable APIModel instance, and optionally a processor if return_processor is True.

data_juicer.utils.model_utils.prepare_diffusion_model(pretrained_model_name_or_path, diffusion_type, **model_params)[源代码]¶

Prepare and load an Diffusion model from HuggingFace.

参数:

pretrained_model_name_or_path -- input Diffusion model name or local path to the model
diffusion_type -- the use of the diffusion model. It can be 'image2image', 'text2image', 'inpainting'

返回:

a Diffusion model.

data_juicer.utils.model_utils.prepare_fastsam_model(model_path, **model_params)[源代码]¶

data_juicer.utils.model_utils.prepare_fasttext_model(model_name='lid.176.bin', **model_params)[源代码]¶

Prepare and load a fasttext model.

参数:: model_name -- input model name
返回:: model instance.

data_juicer.utils.model_utils.prepare_huggingface_model(pretrained_model_name_or_path, *, return_model=True, return_pipe=False, pipe_task='text-generation', **model_params)[源代码]¶

Prepare and load a huggingface model.

参数:

pretrained_model_name_or_path -- model name or path
return_model -- return model or not
return_pipe -- return pipeline or not
pipe_task -- task for pipeline

返回:

a tuple (model, processor) if return_model is True; otherwise, only the processor is returned.

data_juicer.utils.model_utils.prepare_kenlm_model(lang, name_pattern='{}.arpa.bin', **model_params)[源代码]¶

Prepare and load a kenlm model.

参数:

model_name -- input model name in formatting syntax.
lang -- language to render model name

返回:

model instance.

data_juicer.utils.model_utils.prepare_nltk_model(lang, name_pattern='punkt.{}.pickle', **model_params)[源代码]¶

Prepare and load a nltk punkt model with enhanced resource handling.

参数:

model_name -- input model name in formatting syntax
lang -- language to render model name

返回:

model instance.

data_juicer.utils.model_utils.prepare_nltk_pos_tagger(**model_params)[源代码]¶

Prepare and load NLTK's part-of-speech tagger with enhanced resource: handling.

返回:: The POS tagger model

data_juicer.utils.model_utils.prepare_opencv_classifier(model_path, **model_params)[源代码]¶

data_juicer.utils.model_utils.prepare_recognizeAnything_model(pretrained_model_name_or_path='ram_plus_swin_large_14m.pth', input_size=384, **model_params)[源代码]¶

Prepare and load recognizeAnything model.

参数:

model_name -- input model name.
input_size -- the input size of the model.

data_juicer.utils.model_utils.prepare_sdxl_prompt2prompt(pretrained_model_name_or_path, pipe_func, torch_dtype='fp32', device='cpu')[源代码]¶

data_juicer.utils.model_utils.prepare_sentencepiece_model(model_path, **model_params)[源代码]¶

Prepare and load a sentencepiece model.

参数:: model_path -- input model path
返回:: model instance

data_juicer.utils.model_utils.prepare_sentencepiece_for_lang(lang, name_pattern='{}.sp.model', **model_params)[源代码]¶

Prepare and load a sentencepiece model for specific language.

参数:

lang -- language to render model name
name_pattern -- pattern to render the model name

返回:

model instance.

data_juicer.utils.model_utils.prepare_simple_aesthetics_model(pretrained_model_name_or_path, *, return_model=True, **model_params)[源代码]¶

Prepare and load a simple aesthetics model.

参数:

pretrained_model_name_or_path -- model name or path
return_model -- return model or not

返回:

a tuple (model, input processor) if return_model is True; otherwise, only the processor is returned.

data_juicer.utils.model_utils.prepare_spacy_model(lang, name_pattern='{}_core_web_md-3.7.0', **model_params)[源代码]¶

Prepare spacy model for specific language.

参数:: lang -- language of sapcy model. Should be one of ["zh", "en"]
返回:: corresponding spacy model

data_juicer.utils.model_utils.prepare_video_blip_model(pretrained_model_name_or_path, *, return_model=True, **model_params)[源代码]¶

Prepare and load a video-clip model with the corresponding processor.

参数:

pretrained_model_name_or_path -- model name or path
return_model -- return model or not
trust_remote_code -- passed to transformers

返回:

a tuple (model, input processor) if return_model is True; otherwise, only the processor is returned.

data_juicer.utils.model_utils.prepare_yolo_model(model_path, **model_params)[源代码]¶

data_juicer.utils.model_utils.prepare_vllm_model(pretrained_model_name_or_path, **model_params)[源代码]¶

Prepare and load a HuggingFace model with the corresponding processor.

参数:

pretrained_model_name_or_path -- model name or path
model_params -- LLM initialization parameters.

返回:

a tuple of (model, tokenizer)

data_juicer.utils.model_utils.prepare_embedding_model(model_path, **model_params)[源代码]¶

Prepare and load an embedding model using transformers.

参数:

model_path -- Path to the embedding model.
model_params -- Optional model parameters.

返回:

Model with encode() returning embedding list.

data_juicer.utils.model_utils.update_sampling_params(sampling_params, pretrained_model_name_or_path, enable_vllm=False)[源代码]¶

data_juicer.utils.model_utils.prepare_model(model_type, **model_kwargs)[源代码]¶

data_juicer.utils.model_utils.get_model(model_key=None, rank=None, use_cuda=False)[源代码]¶

data_juicer.utils.model_utils.free_models(clear_model_zoo=True)[源代码]¶