vLLM Inference Engine
Use vLLM to accelerate autoregressive, LLM-based ASR models. The new engine supports offline batch transcription, SDK-style chunked streaming, and a production WebSocket service with VAD, hotwords, and speaker labels.
Overview
The audio frontend, encoder, adaptor, and optional CTC timestamp decoder still run in PyTorch. The LLM decoder runs in vLLM with prompt-embedding input, PagedAttention, continuous batching, and optional tensor parallelism.
Supported Models
| Model family | vLLM support | Why |
| FunASRNano | Yes | Audio encoder + adaptor + Qwen3-0.6B LLM. |
| LLMASR / LLMASRNAR | Yes | Whisper-style audio encoder followed by Qwen, Vicuna, or LLaMA decoding. |
| GLMASR | Yes | GLM-ASR-Nano uses autoregressive LLM decoding. |
| QwenAudioWarp | Yes | LLM-based audio generation path. |
| Paraformer, SenseVoice, Conformer, Transformer | No | These are non-LLM or encoder-decoder/CTC models; use the standard AutoModel. |
Three Entry Points
| Mode | Entry point | Best for |
| Offline batch | AutoModelVLLM or FunASRNanoVLLM | Large file sets and throughput-oriented transcription. |
| Streaming SDK | FunASRNanoStreamingVLLM | Applications that want chunk-level incremental text in Python. |
| WebSocket service | serve_realtime_ws.py | Production real-time clients with VAD segmentation and speaker labels. |
Installation
pip install "funasr>=1.3.0"
pip install "vllm>=0.12.0"
pip install safetensors tiktoken websockets regex
# Development install if you are using the source tree
cd /path/to/FunASR
pip install -e .
| Resource | Minimum | Recommended |
| GPU memory | 8 GB | 16 GB or more for comfortable KV cache space. |
| CUDA | 11.8 | 12.x |
| GPUs | 1 | 2 or more when using tensor parallelism. |
On first use, FunASR extracts the LLM weights from model.pt into a vLLM-compatible directory such as Qwen3-0.6B-vllm. Later starts reuse the prepared weights.
Offline SDK Inference
Recommended Generic API
from funasr.auto.auto_model_vllm import AutoModelVLLM
model = AutoModelVLLM(
model="FunAudioLLM/Fun-ASR-Nano-2512",
hub="ms", # or "hf"
tensor_parallel_size=2,
gpu_memory_utilization=0.8,
)
results = model.generate(
["audio1.wav", "audio2.wav"],
language="中文",
hotwords=["张三", "北京"],
)
for item in results:
print(f"[{item['key']}] {item['text']}")
Fun-ASR-Nano Direct API
from funasr.models.fun_asr_nano.inference_vllm import FunASRNanoVLLM
engine = FunASRNanoVLLM.from_pretrained(
model="FunAudioLLM/Fun-ASR-Nano-2512",
tensor_parallel_size=4,
)
results = engine.generate(
inputs="wav.scp",
language="中文",
hotwords=["开放时间"],
max_new_tokens=512,
)
Command Line
cd examples/industrial_data_pretraining/fun_asr_nano
# Single file
python demo_vllm.py --input audio.wav --language 中文
# Batch + multi-GPU tensor parallel
python demo_vllm.py --input wav.scp --tensor-parallel-size 4 --batch-size 32
# Hotwords + JSONL output
python demo_vllm.py --input audio.wav --hotwords 张三 北京 --output results.jsonl
Streaming SDK
FunASRNanoStreamingVLLM slices audio into 720 ms chunks, re-encodes cumulative audio, batches the chunk prompts into vLLM, and returns a fixed/unfixed text split. It is useful when a Python application wants progressive subtitles without running a service.
from funasr.models.fun_asr_nano.inference_vllm_streaming import FunASRNanoStreamingVLLM
engine = FunASRNanoStreamingVLLM.from_pretrained(
model="FunAudioLLM/Fun-ASR-Nano-2512",
chunk_ms=720,
rollback_chars=8,
)
for result in engine.streaming_generate("audio.wav", language="中文"):
if result["is_final"]:
print(f"Final: {result['text']}")
else:
print(f"[{result['audio_duration_ms']:.0f} ms] fixed: {result['fixed_text']}")
| Behavior | Details |
| Stage 1 | The first 10 chunks are decoded without prev_text to find a stable prefix. |
| Stage 2 | Remaining chunks use the stable prefix as assistant context. |
| Rollback | The last rollback_chars characters stay unfixed until later chunks. |
| Short audio | The first 1.5 to 3 seconds may be empty or unstable; this is expected for the model. |
WebSocket Service
The real-time service combines streaming VAD, vLLM segment decoding, partial previews, hallucination cleanup, hotwords, language hints, and speaker diarization.
Start the Service
cd examples/industrial_data_pretraining/fun_asr_nano
# Single GPU
CUDA_VISIBLE_DEVICES=0 python serve_realtime_ws.py --port 10095 --language 中文
# Multi-GPU tensor parallel
CUDA_VISIBLE_DEVICES=0,1 python serve_realtime_ws.py \
--port 10095 \
--tensor-parallel-size 2 \
--language 中文
# Full parameter example
python serve_realtime_ws.py \
--port 10095 \
--model FunAudioLLM/Fun-ASR-Nano-2512 \
--hub ms \
--device cuda:0 \
--decode-interval 0.48 \
--hotword-file 热词列表 \
--language 中文 \
--dtype bf16 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.8 \
--max-model-len 2048
Clients
| Client | Usage |
| Browser | Open client_mic.html for microphone, file upload, hotwords, and speaker labels. |
| Python CLI | python client_python.py --server ws://localhost:10095 --mic |
| Test script | python client_test.py --server ws://localhost:10095 --file audio.wav |
For a remote GPU server, forward the port first: ssh -L 10095:localhost:10095 <server>.
Protocol
Client -> Server:
"START" initialize a session
"HOTWORDS:word1,word2" set hotwords, optional
"LANGUAGE:中文" set language, optional
[binary bytes] PCM16 16 kHz mono audio
"STOP" finalize the session
Server -> Client:
{"event": "started"}
{"event": "hotwords_set", "hotwords": ["word1", "word2"]}
{"event": "language_set", "language": "中文"}
{"sentences": [...], "partial": "...", "is_final": false}
{"sentences": [...], "partial": "", "is_final": true}
{"event": "stopped"}
Inference Logic
| Path | Trigger | Output |
| Confirmed segment | Dynamic VAD detects an endpoint. | The full segment is decoded and locked in sentences. |
| Partial preview | --decode-interval, default 0.48 s, and enough new audio. | Temporary partial text that may be overwritten. |
| Finalization | STOP. | Flush remaining VAD audio, force-end active speech, run final speaker re-clustering, return is_final: true. |
Dynamic VAD
Dynamic VAD adjusts the silence threshold from the current speech duration: short utterances wait longer before cutting, while long utterances are split faster to protect ASR quality.
Streaming Wrapper Schedule
| Accumulated speech | Silence threshold | Effect |
| ≤ 5 s | 2.0 s | Avoid cutting short turns too early. |
| 5-10 s | 1.5 s | Normal conversational segmentation. |
| 10-15 s | 1.0 s | Start tightening long turns. |
| 15-30 s | 0.8 s | Faster cuts. |
| 30-45 s | 0.4 s | Prevent very long ASR segments. |
| > 45 s | 0.1 s | Force splitting. |
Native fsmn-vad Schedule
| Accumulated speech | Silence threshold |
| ≤ 5 s | 800 ms |
| 5-10 s | 600 ms |
| 10-20 s | 500 ms |
| 20-30 s | 400 ms |
| > 30 s | 300 ms |
# Default: dynamic_silence=True
model.generate(input="audio.wav")
# Disable dynamic silence thresholds
model.generate(input="audio.wav", dynamic_silence=False)
# Custom schedule: (duration_limit_ms, silence_threshold_ms)
model.generate(input="audio.wav", silence_schedule=[
(5000, 1000),
(15000, 500),
(float("inf"), 200),
])
from funasr import AutoModel
from funasr.models.fsmn_vad_streaming.dynamic_vad import DynamicStreamingVAD
vad_model = AutoModel(model="fsmn-vad", device="cuda:0")
vad = DynamicStreamingVAD(vad_model)
for chunk in audio_stream:
segments = vad.feed(chunk)
for start_ms, end_ms in segments:
print(f"Speech: {start_ms}-{end_ms} ms")
final_segments = vad.finalize()
API Reference
AutoModelVLLM
| Parameter | Default | Description |
model | - | ModelScope/Hugging Face id or local model directory. |
hub | "ms" | "ms", "modelscope", "hf", or "huggingface". |
device | "cuda:0" | PyTorch audio encoder and adaptor device. |
dtype | "bf16" | "bf16", "fp16", or "fp32". |
tensor_parallel_size | 1 | Number of GPUs used by vLLM tensor parallelism. |
gpu_memory_utilization | 0.8 | Fraction of GPU memory reserved for the vLLM KV cache. |
max_model_len | 4096 | Maximum vLLM sequence length. |
generate()
| Parameter | Default | Description |
inputs | - | Audio path, path list, numpy array, tensor, wav.scp, or JSONL. |
language | None | Language hint, for example "中文", "English", or "日本語". |
hotwords | None | List of hotwords to include in the ASR prompt. |
itn | True | Apply inverse text normalization. |
max_new_tokens | 512 | Maximum generated tokens per sample. |
temperature | 0.0 | Greedy decoding by default. |
repetition_penalty | 1.0 | Penalty used by vLLM generation. |
Return format: [{"key": str, "text": str, "timestamps": [...]}]. Timestamps are emitted when the model includes the optional CTC decoder and tokenizer.
FAQ
- Q: Why is the first startup slow?
- vLLM initializes the KV cache and CUDA graphs, and FunASR may extract LLM weights from
model.pt. This can take about 60-90 seconds on the first run.
- Q: What should I do for CUDA OOM?
- Lower
gpu_memory_utilization, lower max_model_len, or increase tensor_parallel_size.
- Q: Can Paraformer use vLLM?
- No. Paraformer is non-autoregressive and does not benefit from vLLM KV-cache decoding.
- Q: WebSocket service or
streaming_generate()?
- Use the WebSocket service for production real-time ASR with VAD endpoints. Use
streaming_generate() for SDK integration or chunk-level demos.
- Q: The browser cannot access my microphone on a remote server.
- Chrome requires HTTPS or localhost for microphone access. Use
ssh -L 10095:localhost:10095 <server> and open the client from localhost.