This guide covers everything from installation to advanced usage — streaming ASR, speaker diarization, emotion detection, and model export.
FunASR can be installed from PyPI or from source. The source install is recommended to get the latest bug fixes and new model support.
pip install funasr # Install from source (recommended for latest features) pip install git+https://github.com/modelscope/FunASR.git
FunASR automatically downloads models from ModelScope (default) or HuggingFace. Set hub="hf" to use HuggingFace.
The most common use case: transcribe a complete audio file. FunASR's AutoModel provides a unified interface for all models. Combined with a VAD model, it handles audio of any length by automatically segmenting it into manageable chunks.
Paraformer is a non-autoregressive model optimized for Chinese speech recognition. It offers the best balance of accuracy and speed for production use.
from funasr import AutoModel
model = AutoModel(
model="paraformer-zh",
vad_model="fsmn-vad",
vad_kwargs={"max_single_segment_time": 60000},
punc_model="ct-punc",
# spk_model="cam++", # uncomment for speaker diarization
)
res = model.generate(input="audio.wav", batch_size_s=300, hotword='魔搭')
print(res)
Key parameters:
vad_model="fsmn-vad" — enables Voice Activity Detection to split long audio into segments. Without this, input is limited to ~30 seconds.max_single_segment_time=60000 — maximum VAD segment length in milliseconds (60s).punc_model="ct-punc" — adds punctuation to the output text.batch_size_s=300 — dynamic batching by total audio duration (seconds). Larger values use more memory but are faster.hotword — recognition keywords to boost accuracy for domain-specific terms.batch_size_s, or reduce max_single_segment_time in vad_kwargs. For extremely long segments, set batch_size_threshold_s=30 to force batch_size=1 when a segment exceeds this threshold.
Fun-ASR-Nano is the latest end-to-end ASR model trained on tens of millions of hours of real speech data. It excels at:
from funasr import AutoModel
model = AutoModel(
model="FunAudioLLM/Fun-ASR-Nano-2512",
trust_remote_code=True,
remote_code="./model.py",
vad_model="fsmn-vad",
vad_kwargs={"max_single_segment_time": 30000},
device="cuda:0",
hub="hf",
)
res = model.generate(
input=["audio.wav"],
cache={},
batch_size=1,
hotwords=["keyword1", "keyword2"],
language="中文", # or "英文", "日文", etc.
)
print(res[0]["text"]) # recognized text
print(res[0]["timestamps"]) # character-level timestamps
The output includes timestamps — a list of dictionaries with token, start_time, and end_time (in seconds) for each character.
Add spk_model and punc_model to get per-sentence speaker labels. The output sentence_info contains text, timestamps, and speaker ID for each sentence.
model = AutoModel(
model="FunAudioLLM/Fun-ASR-Nano-2512",
trust_remote_code=True,
remote_code="./model.py",
vad_model="fsmn-vad",
vad_kwargs={"max_single_segment_time": 30000},
spk_model="cam++",
punc_model="ct-punc",
device="cuda:0",
hub="hf",
)
res = model.generate(input=["meeting.wav"], cache={}, batch_size=1, language="中文")
for sent in res[0]["sentence_info"]:
print(f"Speaker {sent['spk']}: [{sent['start']}ms-{sent['end']}ms] {sent['text']}")
pip install tiktoken huggingface_hub. The model is downloaded from HuggingFace (~1.6GB).
SenseVoice is a multi-task speech understanding model. Beyond ASR, it detects:
It uses a non-autoregressive architecture — processing 10 seconds of audio in just 70ms on GPU.
from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess
model = AutoModel(
model="iic/SenseVoiceSmall",
vad_model="fsmn-vad",
vad_kwargs={"max_single_segment_time": 30000},
device="cuda:0",
)
res = model.generate(
input="audio.wav",
cache={},
language="auto", # auto-detect language
use_itn=True, # inverse text normalization (numbers, dates)
batch_size_s=60,
merge_vad=True,
merge_length_s=15,
)
text = rich_transcription_postprocess(res[0]["text"])
print(text)
The rich_transcription_postprocess function removes the internal tags (like <|zh|>, <|NEUTRAL|>) and returns clean text.
SenseVoice also supports speaker diarization when combined with spk_model:
model = AutoModel(
model="iic/SenseVoiceSmall",
vad_model="fsmn-vad",
vad_kwargs={"max_single_segment_time": 30000},
spk_model="cam++",
punc_model="ct-punc",
device="cuda:0",
)
res = model.generate(input="meeting.wav", cache={}, language="auto",
use_itn=True, batch_size_s=60, merge_vad=True, merge_length_s=15)
for sent in res[0]["sentence_info"]:
text = rich_transcription_postprocess(sent["text"])
print(f"Speaker {sent['spk']}: {text}")
For real-time transcription, use the streaming Paraformer model. Audio is processed chunk-by-chunk, and partial results are returned incrementally. This is ideal for live captioning, voice assistants, and real-time meeting transcription.
The cache dictionary maintains state between chunks — it must persist across the entire session.
from funasr import AutoModel
import soundfile
chunk_size = [0, 10, 5] # [0, 10, 5] = 600ms chunks, 300ms lookahead
encoder_chunk_look_back = 4
decoder_chunk_look_back = 1
model = AutoModel(model="paraformer-zh-streaming")
speech, sample_rate = soundfile.read("audio.wav")
chunk_stride = chunk_size[1] * 960 # 600ms = 9600 samples at 16kHz
cache = {}
total_chunk_num = int((len(speech) - 1) / chunk_stride + 1)
for i in range(total_chunk_num):
speech_chunk = speech[i * chunk_stride:(i + 1) * chunk_stride]
is_final = i == total_chunk_num - 1
res = model.generate(
input=speech_chunk, cache=cache, is_final=is_final,
chunk_size=chunk_size,
encoder_chunk_look_back=encoder_chunk_look_back,
decoder_chunk_look_back=decoder_chunk_look_back,
)
print(res) # partial results for each chunk
How it works:
chunk_size=[0, 10, 5] — display granularity is 10×60ms=600ms, lookahead is 5×60ms=300msis_final=True on the last chunk forces output of any remaining buffered textcache dict maintains encoder/decoder state — do NOT reset it between chunksVAD detects speech segments in audio, returning start/end times in milliseconds. Useful for pre-processing before ASR, or for skipping silence in long recordings.
from funasr import AutoModel
model = AutoModel(model="fsmn-vad")
res = model.generate(input="audio.wav")
print(res)
# [{"key": "audio", "value": [[610, 5530], [7200, 12400], ...]}]
# Each [start_ms, end_ms] is a speech segment
For real-time voice activity detection (e.g., detecting when a user starts/stops speaking), use the streaming VAD mode. Audio is fed in small chunks (e.g., 200ms), and the model reports speech boundaries as they are detected.
from funasr import AutoModel
import soundfile
chunk_size = 200 # process 200ms at a time
model = AutoModel(model="fsmn-vad")
speech, sample_rate = soundfile.read("audio.wav")
chunk_stride = int(chunk_size * sample_rate / 1000)
cache = {}
total_chunk_num = int((len(speech) - 1) / chunk_stride + 1)
for i in range(total_chunk_num):
speech_chunk = speech[i * chunk_stride:(i + 1) * chunk_stride]
is_final = i == total_chunk_num - 1
res = model.generate(input=speech_chunk, cache=cache,
is_final=is_final, chunk_size=chunk_size)
if len(res[0]["value"]):
print(res)
Streaming VAD output formats:
[[beg, end]] — complete speech segment detected[[beg, -1]] — speech started but not yet ended[[-1, end]] — speech ended (paired with previous start)[] — no event detected in this chunkASR output typically lacks punctuation. The CT-Transformer model adds commas, periods, and question marks to raw text. It supports both Chinese and English.
from funasr import AutoModel model = AutoModel(model="ct-punc") res = model.generate(input="那今天的会就到这里吧 happy new year 明年见") print(res[0]["text"]) # "那今天的会就到这里吧,happy new year,明年见。"
The punctuation model is usually used as part of the pipeline (via punc_model="ct-punc" in AutoModel), but can also be used standalone for post-processing text from other sources.
Speaker diarization answers "who spoke when" by clustering audio segments by speaker identity. FunASR supports this for all three major ASR models: Paraformer, Fun-ASR-Nano, and SenseVoice.
The pipeline works as follows:
from funasr import AutoModel
model = AutoModel(
model="paraformer-zh",
vad_model="fsmn-vad",
punc_model="ct-punc",
spk_model="cam++", # or "iic/speech_eres2netv2_sv_zh-cn_16k-common"
)
res = model.generate(input="meeting.wav", batch_size_s=300)
for sent in res[0]["sentence_info"]:
print(f"[Speaker {sent['spk']}] [{sent['start']}-{sent['end']}ms] {sent['text']}")
You can also specify the number of speakers with preset_spk_num=N if known in advance.
Qwen3-ASR is a large language model based ASR system supporting 52 languages with auto language detection. It leverages LLM's contextual understanding for improved accuracy on complex speech.
# First install: pip install qwen-asr
from funasr import AutoModel
model = AutoModel(
model="Qwen/Qwen3-ASR-1.7B", # or Qwen/Qwen3-ASR-0.6B for smaller model
hub="hf",
device="cuda:0",
)
# Auto language detection
res = model.generate(input="audio.wav")
print(res[0]["text"], "| Language:", res[0].get("language", ""))
# With specified language (slightly faster, more accurate)
res = model.generate(input="audio_zh.wav", language="Chinese")
pip install qwen-asr and a GPU with sufficient memory (~4GB for 0.6B, ~8GB for 1.7B).
Export models to ONNX format for deployment with ONNX Runtime. This enables inference without PyTorch and supports additional optimizations like quantization.
# Export via command line
funasr-export ++model=paraformer ++quantize=false ++device=cpu
# Or via Python
from funasr import AutoModel
model = AutoModel(model="paraformer", device="cpu")
export_dir = model.export(quantize=False)
print(f"Exported to: {export_dir}")
# pip install funasr-onnx
from funasr_onnx import Paraformer
model = Paraformer(
"damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
batch_size=1, quantize=True
)
result = model(["audio.wav"])
print(result)
ONNX models can be further optimized with onnxslim:
pip install onnxslim onnxslim model.onnx model_optimized.onnx
Complete reference for the two main interfaces: AutoModel() for initialization and model.generate() for inference.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | — | Model name (from Hub) or local path |
device | str | "cuda:0" | Device: "cuda:0", "cpu", "mps", "npu:0" |
vad_model | str | None | VAD model for long audio segmentation |
vad_kwargs | dict | {} | VAD config, e.g. {"max_single_segment_time": 60000} |
punc_model | str | None | Punctuation restoration model |
spk_model | str | None | Speaker model for diarization ("cam++" or full model ID) |
hub | str | "ms" | "ms" (ModelScope) or "hf" (HuggingFace) |
ncpu | int | 4 | CPU threads for parallel operations |
disable_update | bool | False | Skip version check on startup |
disable_pbar | bool | False | Disable tqdm progress bars |
| Parameter | Type | Description |
|---|---|---|
input | str / array / list | Audio: file path, URL, numpy array, or list of arrays |
cache | dict | State cache for streaming. Pass {} for first call. |
hotword | str / list | Keywords to boost recognition accuracy |
language | str | Language hint: "auto", "zh", "en", "中文", "Chinese", etc. |
batch_size_s | int | Dynamic batch total duration (seconds) |
is_final | bool | Last chunk flag for streaming mode |
return_spk_res | bool | Return speaker diarization in sentence_info |
sentence_timestamp | bool | Return sentence-level timestamps |
use_itn | bool | Apply inverse text normalization (SenseVoice) |
| Field | Format | Description |
|---|---|---|
text | str | Recognized text (all ASR models) |
timestamp | [[start_ms, end_ms], ...] | Character/word timestamps |
sentence_info | [{text, start, end, spk, timestamp}, ...] | Sentences with speaker labels |
value | [[start_ms, end_ms], ...] | VAD speech segments |
spk_embedding | Tensor [N, 192] | Speaker embeddings (CAM++/ERes2NetV2) |
"paraformer-zh", "fsmn-vad", "ct-punc", "cam++" instead of full model IDs. FunASR resolves them automatically via the built-in alias map.