data_juicer.utils.model_utils module

data_juicer.utils.model_utils.check_model(model_name, force=False)[源代码]

Check whether a model exists in DATA_JUICER_MODELS_CACHE. If exists, return its full path. Else, download it from cached models links.

参数:
  • model_name -- a specified model name

  • force -- Whether to download model forcefully or not, Sometimes the model file maybe incomplete for some reason, so need to download again forcefully.

data_juicer.utils.model_utils.check_model_home(model_name)[源代码]
data_juicer.utils.model_utils.filter_arguments(func, args_dict)[源代码]

Filters and returns only the valid arguments for a given function signature.

参数:
  • func -- The function or callable to inspect.

  • args_dict -- A dictionary of argument names and values to filter.

返回:

A dictionary containing only the arguments that match the function's signature, preserving any **kwargs if applicable.

class data_juicer.utils.model_utils.ChatAPIModel(model=None, endpoint=None, response_path=None, **kwargs)[源代码]

基类:object

__init__(model=None, endpoint=None, response_path=None, **kwargs)[源代码]

Initializes an instance of the APIModel class.

参数:
  • model -- The name of the model to be used for making API calls. This should correspond to a valid model identifier recognized by the API server. If it's None, use the first available model from the server.

  • endpoint -- The URL endpoint for the API. If provided as a relative path, it will be appended to the base URL (defined by the OPENAI_BASE_URL environment variable or through an additional base_url parameter). Defaults to '/chat/completions' for OpenAI compatibility.

  • response_path -- A dot-separated string specifying the path to extract the desired content from the API response. The default value is 'choices.0.message.content', which corresponds to the typical structure of an OpenAI API response.

  • kwargs -- Additional keyword arguments for configuring the internal OpenAI client.

class data_juicer.utils.model_utils.EmbeddingAPIModel(model=None, endpoint=None, response_path=None, **kwargs)[源代码]

基类:object

__init__(model=None, endpoint=None, response_path=None, **kwargs)[源代码]

Initializes an instance specialized for embedding APIs.

参数:
  • model -- The model identifier for embedding API calls. If it's None, use the first available model from the server.

  • endpoint -- API endpoint URL. Defaults to '/embeddings'.

  • response_path -- Path to extract embeddings from response. Defaults to 'data.0.embedding'.

  • kwargs -- Configuration for the OpenAI client.

data_juicer.utils.model_utils.prepare_api_model(model, *, endpoint=None, response_path=None, return_processor=False, processor_config=None, **model_params)[源代码]

Creates a callable API model for interacting with OpenAI-compatible API. The callable supports custom response parsing and works with proxy servers that may be incompatible.

参数:
  • model -- The name of the model to interact with.

  • endpoint -- The URL endpoint for the API. If provided as a relative path, it will be appended to the base URL (defined by the OPENAI_BASE_URL environment variable or through an additional base_url parameter). Supported endpoints include: - '/chat/completions' for chat models - '/embeddings' for embedding models Defaults to /chat/completions for OpenAI compatibility.

  • response_path -- The dot-separated path to extract desired content from the API response. Defaults to 'choices.0.message.content' for chat models and 'data.0.embedding' for embedding models.

  • return_processor -- A boolean flag indicating whether to return a processor along with the model. The processor can be used for tasks like tokenization or encoding. Defaults to False.

  • processor_config -- A dictionary containing configuration parameters for initializing a Hugging Face processor. It is only relevant if return_processor is set to True.

  • model_params -- Additional parameters for configuring the API model.

返回:

A callable APIModel instance, and optionally a processor if return_processor is True.

data_juicer.utils.model_utils.prepare_diffusion_model(pretrained_model_name_or_path, diffusion_type, **model_params)[源代码]

Prepare and load an Diffusion model from HuggingFace.

参数:
  • pretrained_model_name_or_path -- input Diffusion model name or local path to the model

  • diffusion_type -- the use of the diffusion model. It can be 'image2image', 'text2image', 'inpainting'

返回:

a Diffusion model.

data_juicer.utils.model_utils.prepare_fastsam_model(model_path, **model_params)[源代码]
data_juicer.utils.model_utils.prepare_fasttext_model(model_name='lid.176.bin', **model_params)[源代码]

Prepare and load a fasttext model.

参数:

model_name -- input model name

返回:

model instance.

data_juicer.utils.model_utils.prepare_huggingface_model(pretrained_model_name_or_path, *, return_model=True, return_pipe=False, pipe_task='text-generation', **model_params)[源代码]

Prepare and load a huggingface model.

参数:
  • pretrained_model_name_or_path -- model name or path

  • return_model -- return model or not

  • return_pipe -- return pipeline or not

  • pipe_task -- task for pipeline

返回:

a tuple (model, processor) if return_model is True; otherwise, only the processor is returned.

data_juicer.utils.model_utils.prepare_kenlm_model(lang, name_pattern='{}.arpa.bin', **model_params)[源代码]

Prepare and load a kenlm model.

参数:
  • model_name -- input model name in formatting syntax.

  • lang -- language to render model name

返回:

model instance.

data_juicer.utils.model_utils.prepare_nltk_model(lang, name_pattern='punkt.{}.pickle', **model_params)[源代码]

Prepare and load a nltk punkt model with enhanced resource handling.

参数:
  • model_name -- input model name in formatting syntax

  • lang -- language to render model name

返回:

model instance.

data_juicer.utils.model_utils.prepare_nltk_pos_tagger(**model_params)[源代码]
Prepare and load NLTK's part-of-speech tagger with enhanced resource

handling.

返回:

The POS tagger model

data_juicer.utils.model_utils.prepare_opencv_classifier(model_path, **model_params)[源代码]
data_juicer.utils.model_utils.prepare_recognizeAnything_model(pretrained_model_name_or_path='ram_plus_swin_large_14m.pth', input_size=384, **model_params)[源代码]

Prepare and load recognizeAnything model.

参数:
  • model_name -- input model name.

  • input_size -- the input size of the model.

data_juicer.utils.model_utils.prepare_sdxl_prompt2prompt(pretrained_model_name_or_path, pipe_func, torch_dtype='fp32', device='cpu')[源代码]
data_juicer.utils.model_utils.prepare_sentencepiece_model(model_path, **model_params)[源代码]

Prepare and load a sentencepiece model.

参数:

model_path -- input model path

返回:

model instance

data_juicer.utils.model_utils.prepare_sentencepiece_for_lang(lang, name_pattern='{}.sp.model', **model_params)[源代码]

Prepare and load a sentencepiece model for specific language.

参数:
  • lang -- language to render model name

  • name_pattern -- pattern to render the model name

返回:

model instance.

data_juicer.utils.model_utils.prepare_simple_aesthetics_model(pretrained_model_name_or_path, *, return_model=True, **model_params)[源代码]

Prepare and load a simple aesthetics model.

参数:
  • pretrained_model_name_or_path -- model name or path

  • return_model -- return model or not

返回:

a tuple (model, input processor) if return_model is True; otherwise, only the processor is returned.

data_juicer.utils.model_utils.prepare_spacy_model(lang, name_pattern='{}_core_web_md-3.7.0', **model_params)[源代码]

Prepare spacy model for specific language.

参数:

lang -- language of sapcy model. Should be one of ["zh", "en"]

返回:

corresponding spacy model

data_juicer.utils.model_utils.prepare_video_blip_model(pretrained_model_name_or_path, *, return_model=True, **model_params)[源代码]

Prepare and load a video-clip model with the corresponding processor.

参数:
  • pretrained_model_name_or_path -- model name or path

  • return_model -- return model or not

  • trust_remote_code -- passed to transformers

返回:

a tuple (model, input processor) if return_model is True; otherwise, only the processor is returned.

data_juicer.utils.model_utils.prepare_yolo_model(model_path, **model_params)[源代码]
data_juicer.utils.model_utils.prepare_vllm_model(pretrained_model_name_or_path, **model_params)[源代码]

Prepare and load a HuggingFace model with the corresponding processor.

参数:
  • pretrained_model_name_or_path -- model name or path

  • model_params -- LLM initialization parameters.

返回:

a tuple of (model, tokenizer)

data_juicer.utils.model_utils.prepare_embedding_model(model_path, **model_params)[源代码]

Prepare and load an embedding model using transformers.

参数:
  • model_path -- Path to the embedding model.

  • model_params -- Optional model parameters.

返回:

Model with encode() returning embedding list.

data_juicer.utils.model_utils.update_sampling_params(sampling_params, pretrained_model_name_or_path, enable_vllm=False)[源代码]
data_juicer.utils.model_utils.prepare_model(model_type, **model_kwargs)[源代码]
data_juicer.utils.model_utils.get_model(model_key=None, rank=None, use_cuda=False)[源代码]
data_juicer.utils.model_utils.free_models(clear_model_zoo=True)[源代码]