trinity.common.models.vllm_patch.api_patch module#

Patch for vllm OpenAI API server.

  1. Mocks the add_signal_handler method to do nothing.

  2. Adds token_ids and prompt_token_ids to the ChatCompletionResponse.

class trinity.common.models.vllm_patch.api_patch.PatchedChatCompletionResponseChoice(*, index: int, message: ~vllm.entrypoints.openai.protocol.ChatMessage, logprobs: ~vllm.entrypoints.openai.protocol.ChatCompletionLogProbs | None = None, finish_reason: str | None = 'stop', stop_reason: int | str | None = None, token_ids: list[int] = <factory>, **extra_data: ~typing.Any)[source]#

Bases: ChatCompletionResponseChoice

token_ids: list[int]#
model_config: ClassVar[ConfigDict] = {'extra': 'allow'}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class trinity.common.models.vllm_patch.api_patch.PatchedChatCompletionResponse(*, id: str = <factory>, object: ~typing.Literal['chat.completion'] = 'chat.completion', created: int = <factory>, model: str, choices: list[~trinity.common.models.vllm_patch.api_patch.PatchedChatCompletionResponseChoice] = list[vllm.entrypoints.openai.protocol.ChatCompletionResponseChoice], service_tier: ~typing.Literal['auto', 'default', 'flex', 'scale', 'priority'] | None = None, system_fingerprint: str | None = None, usage: ~vllm.entrypoints.openai.protocol.UsageInfo, prompt_logprobs: list[dict[int, ~vllm.logprobs.Logprob] | None] | None = None, prompt_token_ids: list[int] = <factory>, kv_transfer_params: dict[str, ~typing.Any] | None = None, **extra_data: ~typing.Any)[source]#

Bases: ChatCompletionResponse

prompt_token_ids: list[int]#
choices: list[PatchedChatCompletionResponseChoice]#
model_config: ClassVar[ConfigDict] = {'extra': 'allow'}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

async trinity.common.models.vllm_patch.api_patch.chat_completion_full_generator(self, request, result_generator, request_id, model_name, conversation, tokenizer, request_metadata) ErrorResponse | ChatCompletionResponse[source]#
async trinity.common.models.vllm_patch.api_patch.run_server_in_ray(args, engine_client)[source]#
trinity.common.models.vllm_patch.api_patch.dummy_add_signal_handler(self, *args, **kwargs)[source]#
async trinity.common.models.vllm_patch.api_patch.patch_and_serve_http(app, sock, args)[source]#

Patch the add_signal_handler method and serve the app.

async trinity.common.models.vllm_patch.api_patch.run_api_server_in_ray_actor(async_llm, host: str, port: int, model_path: str, enable_auto_tool_choice: bool = False, tool_call_parser: str | None = None, reasoning_parser: str | None = None)[source]#