vLLM Inference Engine

Use vLLM to accelerate autoregressive, LLM-based ASR models. The new engine supports offline batch transcription, SDK-style chunked streaming, and a production WebSocket service with VAD, hotwords, and speaker labels.

Overview Installation Offline Batch Streaming SDK WebSocket Service Dynamic VAD Performance API Reference FAQ

Overview

The audio frontend, encoder, adaptor, and optional CTC timestamp decoder still run in PyTorch. The LLM decoder runs in vLLM with prompt-embedding input, PagedAttention, continuous batching, and optional tensor parallelism.

Supported Models

Model family	vLLM support	Why
FunASRNano	Yes	Audio encoder + adaptor + Qwen3-0.6B LLM.
LLMASR / LLMASRNAR	Yes	Whisper-style audio encoder followed by Qwen, Vicuna, or LLaMA decoding.
GLMASR	Yes	GLM-ASR-Nano uses autoregressive LLM decoding.
QwenAudioWarp	Yes	LLM-based audio generation path.
Paraformer, SenseVoice, Conformer, Transformer	No	These are non-LLM or encoder-decoder/CTC models; use the standard `AutoModel`.

Three Entry Points

Mode	Entry point	Best for
Offline batch	`AutoModelVLLM` or `FunASRNanoVLLM`	Large file sets and throughput-oriented transcription.
Streaming SDK	`FunASRNanoStreamingVLLM`	Applications that want chunk-level incremental text in Python.
WebSocket service	`serve_realtime_ws.py`	Production real-time clients with VAD segmentation and speaker labels.

Installation

pip install "funasr>=1.3.0"
pip install "vllm>=0.12.0"
pip install safetensors tiktoken websockets regex

# Development install if you are using the source tree
cd /path/to/FunASR
pip install -e .

Resource	Minimum	Recommended
GPU memory	8 GB	16 GB or more for comfortable KV cache space.
CUDA	11.8	12.x
GPUs	1	2 or more when using tensor parallelism.

On first use, FunASR extracts the LLM weights from model.pt into a vLLM-compatible directory such as Qwen3-0.6B-vllm. Later starts reuse the prepared weights.

Offline SDK Inference

Recommended Generic API

from funasr.auto.auto_model_vllm import AutoModelVLLM

model = AutoModelVLLM(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    hub="ms",                    # or "hf"
    tensor_parallel_size=2,
    gpu_memory_utilization=0.8,
)

results = model.generate(
    ["audio1.wav", "audio2.wav"],
    language="中文",
    hotwords=["张三", "北京"],
)
for item in results:
    print(f"[{item['key']}] {item['text']}")

Fun-ASR-Nano Direct API

from funasr.models.fun_asr_nano.inference_vllm import FunASRNanoVLLM

engine = FunASRNanoVLLM.from_pretrained(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    tensor_parallel_size=4,
)

results = engine.generate(
    inputs="wav.scp",
    language="中文",
    hotwords=["开放时间"],
    max_new_tokens=512,
)

Command Line

cd examples/industrial_data_pretraining/fun_asr_nano

# Single file
python demo_vllm.py --input audio.wav --language 中文

# Batch + multi-GPU tensor parallel
python demo_vllm.py --input wav.scp --tensor-parallel-size 4 --batch-size 32

# Hotwords + JSONL output
python demo_vllm.py --input audio.wav --hotwords 张三 北京 --output results.jsonl

Streaming SDK

FunASRNanoStreamingVLLM slices audio into 720 ms chunks, re-encodes cumulative audio, batches the chunk prompts into vLLM, and returns a fixed/unfixed text split. It is useful when a Python application wants progressive subtitles without running a service.

from funasr.models.fun_asr_nano.inference_vllm_streaming import FunASRNanoStreamingVLLM

engine = FunASRNanoStreamingVLLM.from_pretrained(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    chunk_ms=720,
    rollback_chars=8,
)

for result in engine.streaming_generate("audio.wav", language="中文"):
    if result["is_final"]:
        print(f"Final: {result['text']}")
    else:
        print(f"[{result['audio_duration_ms']:.0f} ms] fixed: {result['fixed_text']}")

Behavior	Details
Stage 1	The first 10 chunks are decoded without `prev_text` to find a stable prefix.
Stage 2	Remaining chunks use the stable prefix as assistant context.
Rollback	The last `rollback_chars` characters stay unfixed until later chunks.
Short audio	The first 1.5 to 3 seconds may be empty or unstable; this is expected for the model.

WebSocket Service

The real-time service combines streaming VAD, vLLM segment decoding, partial previews, hallucination cleanup, hotwords, language hints, and speaker diarization.

Start the Service

cd examples/industrial_data_pretraining/fun_asr_nano

# Single GPU
CUDA_VISIBLE_DEVICES=0 python serve_realtime_ws.py --port 10095 --language 中文

# Multi-GPU tensor parallel
CUDA_VISIBLE_DEVICES=0,1 python serve_realtime_ws.py \
    --port 10095 \
    --tensor-parallel-size 2 \
    --language 中文

# Full parameter example
python serve_realtime_ws.py \
    --port 10095 \
    --model FunAudioLLM/Fun-ASR-Nano-2512 \
    --hub ms \
    --device cuda:0 \
    --decode-interval 0.48 \
    --hotword-file 热词列表 \
    --language 中文 \
    --dtype bf16 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.8 \
    --max-model-len 2048

Clients

Client	Usage
Browser	Open `client_mic.html` for microphone, file upload, hotwords, and speaker labels.
Python CLI	`python client_python.py --server ws://localhost:10095 --mic`
Test script	`python client_test.py --server ws://localhost:10095 --file audio.wav`

For a remote GPU server, forward the port first: ssh -L 10095:localhost:10095 <server>.

Protocol

Client -> Server:
  "START"                 initialize a session
  "HOTWORDS:word1,word2"  set hotwords, optional
  "LANGUAGE:中文"          set language, optional
  [binary bytes]          PCM16 16 kHz mono audio
  "STOP"                  finalize the session

Server -> Client:
  {"event": "started"}
  {"event": "hotwords_set", "hotwords": ["word1", "word2"]}
  {"event": "language_set", "language": "中文"}
  {"sentences": [...], "partial": "...", "is_final": false}
  {"sentences": [...], "partial": "", "is_final": true}
  {"event": "stopped"}

Inference Logic

Path	Trigger	Output
Confirmed segment	Dynamic VAD detects an endpoint.	The full segment is decoded and locked in `sentences`.
Partial preview	`--decode-interval`, default 0.48 s, and enough new audio.	Temporary `partial` text that may be overwritten.
Finalization	`STOP`.	Flush remaining VAD audio, force-end active speech, run final speaker re-clustering, return `is_final: true`.

Dynamic VAD

Dynamic VAD adjusts the silence threshold from the current speech duration: short utterances wait longer before cutting, while long utterances are split faster to protect ASR quality.

Streaming Wrapper Schedule

Accumulated speech	Silence threshold	Effect
≤ 5 s	2.0 s	Avoid cutting short turns too early.
5-10 s	1.5 s	Normal conversational segmentation.
10-15 s	1.0 s	Start tightening long turns.
15-30 s	0.8 s	Faster cuts.
30-45 s	0.4 s	Prevent very long ASR segments.
> 45 s	0.1 s	Force splitting.

Native fsmn-vad Schedule

Accumulated speech	Silence threshold
≤ 5 s	800 ms
5-10 s	600 ms
10-20 s	500 ms
20-30 s	400 ms
> 30 s	300 ms

# Default: dynamic_silence=True
model.generate(input="audio.wav")

# Disable dynamic silence thresholds
model.generate(input="audio.wav", dynamic_silence=False)

# Custom schedule: (duration_limit_ms, silence_threshold_ms)
model.generate(input="audio.wav", silence_schedule=[
    (5000, 1000),
    (15000, 500),
    (float("inf"), 200),
])

from funasr import AutoModel
from funasr.models.fsmn_vad_streaming.dynamic_vad import DynamicStreamingVAD

vad_model = AutoModel(model="fsmn-vad", device="cuda:0")
vad = DynamicStreamingVAD(vad_model)

for chunk in audio_stream:
    segments = vad.feed(chunk)
    for start_ms, end_ms in segments:
        print(f"Speech: {start_ms}-{end_ms} ms")

final_segments = vad.finalize()

Performance

Scenario	PyTorch baseline	vLLM	Speedup
Offline, 5.6 s audio	0.89 s	0.30 s	3x
Offline, 2-GPU tensor parallel	0.89 s	~0.20 s	4.5x
Batch 16 files	~16x serial cost	~4x	4x
Batch 32 files	~32x serial cost	~5x	6x
WebSocket RTF	0.156	0.078	2x

API Reference

AutoModelVLLM

Parameter	Default	Description
`model`	-	ModelScope/Hugging Face id or local model directory.
`hub`	`"ms"`	`"ms"`, `"modelscope"`, `"hf"`, or `"huggingface"`.
`device`	`"cuda:0"`	PyTorch audio encoder and adaptor device.
`dtype`	`"bf16"`	`"bf16"`, `"fp16"`, or `"fp32"`.
`tensor_parallel_size`	`1`	Number of GPUs used by vLLM tensor parallelism.
`gpu_memory_utilization`	`0.8`	Fraction of GPU memory reserved for the vLLM KV cache.
`max_model_len`	`4096`	Maximum vLLM sequence length.

`generate()`

Parameter	Default	Description
`inputs`	-	Audio path, path list, numpy array, tensor, `wav.scp`, or JSONL.
`language`	`None`	Language hint, for example `"中文"`, `"English"`, or `"日本語"`.
`hotwords`	`None`	List of hotwords to include in the ASR prompt.
`itn`	`True`	Apply inverse text normalization.
`max_new_tokens`	`512`	Maximum generated tokens per sample.
`temperature`	`0.0`	Greedy decoding by default.
`repetition_penalty`	`1.0`	Penalty used by vLLM generation.

Return format: [{"key": str, "text": str, "timestamps": [...]}]. Timestamps are emitted when the model includes the optional CTC decoder and tokenizer.

FAQ

Q: Why is the first startup slow?: vLLM initializes the KV cache and CUDA graphs, and FunASR may extract LLM weights from model.pt. This can take about 60-90 seconds on the first run.
Q: What should I do for CUDA OOM?: Lower gpu_memory_utilization, lower max_model_len, or increase tensor_parallel_size.
Q: Can Paraformer use vLLM?: No. Paraformer is non-autoregressive and does not benefit from vLLM KV-cache decoding.
Q: WebSocket service or streaming_generate()?: Use the WebSocket service for production real-time ASR with VAD endpoints. Use streaming_generate() for SDK integration or chunk-level demos.
Q: The browser cannot access my microphone on a remote server.: Chrome requires HTTPS or localhost for microphone access. Use ssh -L 10095:localhost:10095 <server> and open the client from localhost.