vLLM Inference Engine

Use vLLM to accelerate autoregressive, LLM-based ASR models. The new engine supports offline batch transcription, SDK-style chunked streaming, and a production WebSocket service with VAD, hotwords, and speaker labels.

Overview

The audio frontend, encoder, adaptor, and optional CTC timestamp decoder still run in PyTorch. The LLM decoder runs in vLLM with prompt-embedding input, PagedAttention, continuous batching, and optional tensor parallelism.

Supported Models

Model familyvLLM supportWhy
FunASRNanoYesAudio encoder + adaptor + Qwen3-0.6B LLM.
LLMASR / LLMASRNARYesWhisper-style audio encoder followed by Qwen, Vicuna, or LLaMA decoding.
GLMASRYesGLM-ASR-Nano uses autoregressive LLM decoding.
QwenAudioWarpYesLLM-based audio generation path.
Paraformer, SenseVoice, Conformer, TransformerNoThese are non-LLM or encoder-decoder/CTC models; use the standard AutoModel.

Three Entry Points

ModeEntry pointBest for
Offline batchAutoModelVLLM or FunASRNanoVLLMLarge file sets and throughput-oriented transcription.
Streaming SDKFunASRNanoStreamingVLLMApplications that want chunk-level incremental text in Python.
WebSocket serviceserve_realtime_ws.pyProduction real-time clients with VAD segmentation and speaker labels.

Installation

pip install "funasr>=1.3.0"
pip install "vllm>=0.12.0"
pip install safetensors tiktoken websockets regex

# Development install if you are using the source tree
cd /path/to/FunASR
pip install -e .
ResourceMinimumRecommended
GPU memory8 GB16 GB or more for comfortable KV cache space.
CUDA11.812.x
GPUs12 or more when using tensor parallelism.

On first use, FunASR extracts the LLM weights from model.pt into a vLLM-compatible directory such as Qwen3-0.6B-vllm. Later starts reuse the prepared weights.

Offline SDK Inference

Recommended Generic API

from funasr.auto.auto_model_vllm import AutoModelVLLM

model = AutoModelVLLM(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    hub="ms",                    # or "hf"
    tensor_parallel_size=2,
    gpu_memory_utilization=0.8,
)

results = model.generate(
    ["audio1.wav", "audio2.wav"],
    language="中文",
    hotwords=["张三", "北京"],
)
for item in results:
    print(f"[{item['key']}] {item['text']}")

Fun-ASR-Nano Direct API

from funasr.models.fun_asr_nano.inference_vllm import FunASRNanoVLLM

engine = FunASRNanoVLLM.from_pretrained(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    tensor_parallel_size=4,
)

results = engine.generate(
    inputs="wav.scp",
    language="中文",
    hotwords=["开放时间"],
    max_new_tokens=512,
)

Command Line

cd examples/industrial_data_pretraining/fun_asr_nano

# Single file
python demo_vllm.py --input audio.wav --language 中文

# Batch + multi-GPU tensor parallel
python demo_vllm.py --input wav.scp --tensor-parallel-size 4 --batch-size 32

# Hotwords + JSONL output
python demo_vllm.py --input audio.wav --hotwords 张三 北京 --output results.jsonl

Streaming SDK

FunASRNanoStreamingVLLM slices audio into 720 ms chunks, re-encodes cumulative audio, batches the chunk prompts into vLLM, and returns a fixed/unfixed text split. It is useful when a Python application wants progressive subtitles without running a service.

from funasr.models.fun_asr_nano.inference_vllm_streaming import FunASRNanoStreamingVLLM

engine = FunASRNanoStreamingVLLM.from_pretrained(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    chunk_ms=720,
    rollback_chars=8,
)

for result in engine.streaming_generate("audio.wav", language="中文"):
    if result["is_final"]:
        print(f"Final: {result['text']}")
    else:
        print(f"[{result['audio_duration_ms']:.0f} ms] fixed: {result['fixed_text']}")
BehaviorDetails
Stage 1The first 10 chunks are decoded without prev_text to find a stable prefix.
Stage 2Remaining chunks use the stable prefix as assistant context.
RollbackThe last rollback_chars characters stay unfixed until later chunks.
Short audioThe first 1.5 to 3 seconds may be empty or unstable; this is expected for the model.

WebSocket Service

The real-time service combines streaming VAD, vLLM segment decoding, partial previews, hallucination cleanup, hotwords, language hints, and speaker diarization.

Start the Service

cd examples/industrial_data_pretraining/fun_asr_nano

# Single GPU
CUDA_VISIBLE_DEVICES=0 python serve_realtime_ws.py --port 10095 --language 中文

# Multi-GPU tensor parallel
CUDA_VISIBLE_DEVICES=0,1 python serve_realtime_ws.py \
    --port 10095 \
    --tensor-parallel-size 2 \
    --language 中文

# Full parameter example
python serve_realtime_ws.py \
    --port 10095 \
    --model FunAudioLLM/Fun-ASR-Nano-2512 \
    --hub ms \
    --device cuda:0 \
    --decode-interval 0.48 \
    --hotword-file 热词列表 \
    --language 中文 \
    --dtype bf16 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.8 \
    --max-model-len 2048

Clients

ClientUsage
BrowserOpen client_mic.html for microphone, file upload, hotwords, and speaker labels.
Python CLIpython client_python.py --server ws://localhost:10095 --mic
Test scriptpython client_test.py --server ws://localhost:10095 --file audio.wav

For a remote GPU server, forward the port first: ssh -L 10095:localhost:10095 <server>.

Protocol

Client -> Server:
  "START"                 initialize a session
  "HOTWORDS:word1,word2"  set hotwords, optional
  "LANGUAGE:中文"          set language, optional
  [binary bytes]          PCM16 16 kHz mono audio
  "STOP"                  finalize the session

Server -> Client:
  {"event": "started"}
  {"event": "hotwords_set", "hotwords": ["word1", "word2"]}
  {"event": "language_set", "language": "中文"}
  {"sentences": [...], "partial": "...", "is_final": false}
  {"sentences": [...], "partial": "", "is_final": true}
  {"event": "stopped"}

Inference Logic

PathTriggerOutput
Confirmed segmentDynamic VAD detects an endpoint.The full segment is decoded and locked in sentences.
Partial preview--decode-interval, default 0.48 s, and enough new audio.Temporary partial text that may be overwritten.
FinalizationSTOP.Flush remaining VAD audio, force-end active speech, run final speaker re-clustering, return is_final: true.

Dynamic VAD

Dynamic VAD adjusts the silence threshold from the current speech duration: short utterances wait longer before cutting, while long utterances are split faster to protect ASR quality.

Streaming Wrapper Schedule

Accumulated speechSilence thresholdEffect
≤ 5 s2.0 sAvoid cutting short turns too early.
5-10 s1.5 sNormal conversational segmentation.
10-15 s1.0 sStart tightening long turns.
15-30 s0.8 sFaster cuts.
30-45 s0.4 sPrevent very long ASR segments.
> 45 s0.1 sForce splitting.

Native fsmn-vad Schedule

Accumulated speechSilence threshold
≤ 5 s800 ms
5-10 s600 ms
10-20 s500 ms
20-30 s400 ms
> 30 s300 ms
# Default: dynamic_silence=True
model.generate(input="audio.wav")

# Disable dynamic silence thresholds
model.generate(input="audio.wav", dynamic_silence=False)

# Custom schedule: (duration_limit_ms, silence_threshold_ms)
model.generate(input="audio.wav", silence_schedule=[
    (5000, 1000),
    (15000, 500),
    (float("inf"), 200),
])
from funasr import AutoModel
from funasr.models.fsmn_vad_streaming.dynamic_vad import DynamicStreamingVAD

vad_model = AutoModel(model="fsmn-vad", device="cuda:0")
vad = DynamicStreamingVAD(vad_model)

for chunk in audio_stream:
    segments = vad.feed(chunk)
    for start_ms, end_ms in segments:
        print(f"Speech: {start_ms}-{end_ms} ms")

final_segments = vad.finalize()

Performance

ScenarioPyTorch baselinevLLMSpeedup
Offline, 5.6 s audio0.89 s0.30 s3x
Offline, 2-GPU tensor parallel0.89 s~0.20 s4.5x
Batch 16 files~16x serial cost~4x4x
Batch 32 files~32x serial cost~5x6x
WebSocket RTF0.1560.0782x

API Reference

AutoModelVLLM

ParameterDefaultDescription
model-ModelScope/Hugging Face id or local model directory.
hub"ms""ms", "modelscope", "hf", or "huggingface".
device"cuda:0"PyTorch audio encoder and adaptor device.
dtype"bf16""bf16", "fp16", or "fp32".
tensor_parallel_size1Number of GPUs used by vLLM tensor parallelism.
gpu_memory_utilization0.8Fraction of GPU memory reserved for the vLLM KV cache.
max_model_len4096Maximum vLLM sequence length.

generate()

ParameterDefaultDescription
inputs-Audio path, path list, numpy array, tensor, wav.scp, or JSONL.
languageNoneLanguage hint, for example "中文", "English", or "日本語".
hotwordsNoneList of hotwords to include in the ASR prompt.
itnTrueApply inverse text normalization.
max_new_tokens512Maximum generated tokens per sample.
temperature0.0Greedy decoding by default.
repetition_penalty1.0Penalty used by vLLM generation.

Return format: [{"key": str, "text": str, "timestamps": [...]}]. Timestamps are emitted when the model includes the optional CTC decoder and tokenizer.

FAQ

Q: Why is the first startup slow?
vLLM initializes the KV cache and CUDA graphs, and FunASR may extract LLM weights from model.pt. This can take about 60-90 seconds on the first run.
Q: What should I do for CUDA OOM?
Lower gpu_memory_utilization, lower max_model_len, or increase tensor_parallel_size.
Q: Can Paraformer use vLLM?
No. Paraformer is non-autoregressive and does not benefit from vLLM KV-cache decoding.
Q: WebSocket service or streaming_generate()?
Use the WebSocket service for production real-time ASR with VAD endpoints. Use streaming_generate() for SDK integration or chunk-level demos.
Q: The browser cannot access my microphone on a remote server.
Chrome requires HTTPS or localhost for microphone access. Use ssh -L 10095:localhost:10095 <server> and open the client from localhost.