FunASR Tutorial

This guide covers everything from installation to advanced usage — streaming ASR, speaker diarization, emotion detection, and model export.

Installation

FunASR can be installed from PyPI or from source. The source install is recommended to get the latest bug fixes and new model support.

pip install funasr

# Install from source (recommended for latest features)
pip install git+https://github.com/modelscope/FunASR.git

FunASR automatically downloads models from ModelScope (default) or HuggingFace. Set hub="hf" to use HuggingFace.

Speech Recognition (Offline)

The most common use case: transcribe a complete audio file. FunASR's AutoModel provides a unified interface for all models. Combined with a VAD model, it handles audio of any length by automatically segmenting it into manageable chunks.

Paraformer — Chinese/English ASR

Paraformer is a non-autoregressive model optimized for Chinese speech recognition. It offers the best balance of accuracy and speed for production use.

from funasr import AutoModel

model = AutoModel(
    model="paraformer-zh",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 60000},
    punc_model="ct-punc",
    # spk_model="cam++",  # uncomment for speaker diarization
)
res = model.generate(input="audio.wav", batch_size_s=300, hotword='魔搭')
print(res)

Key parameters:

vad_model="fsmn-vad" — enables Voice Activity Detection to split long audio into segments. Without this, input is limited to ~30 seconds.
max_single_segment_time=60000 — maximum VAD segment length in milliseconds (60s).
punc_model="ct-punc" — adds punctuation to the output text.
batch_size_s=300 — dynamic batching by total audio duration (seconds). Larger values use more memory but are faster.
hotword — recognition keywords to boost accuracy for domain-specific terms.

OOM (Out of Memory)? Reduce batch_size_s, or reduce max_single_segment_time in vad_kwargs. For extremely long segments, set batch_size_threshold_s=30 to force batch_size=1 when a segment exceeds this threshold.

Fun-ASR-Nano — 31 Languages

Fun-ASR-Nano is the latest end-to-end ASR model trained on tens of millions of hours of real speech data. It excels at:

31 languages including Chinese dialects (Wu, Cantonese, Min, etc.) and regional accents
Character-level timestamps
Lyrics and rap recognition
Hotword customization

Basic Usage

from funasr import AutoModel

model = AutoModel(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    trust_remote_code=True,
    remote_code="./model.py",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    device="cuda:0",
    hub="hf",
)
res = model.generate(
    input=["audio.wav"],
    cache={},
    batch_size=1,
    hotwords=["keyword1", "keyword2"],
    language="中文",  # or "英文", "日文", etc.
)
print(res[0]["text"])        # recognized text
print(res[0]["timestamps"])  # character-level timestamps

The output includes timestamps — a list of dictionaries with token, start_time, and end_time (in seconds) for each character.

With Speaker Diarization

Add spk_model and punc_model to get per-sentence speaker labels. The output sentence_info contains text, timestamps, and speaker ID for each sentence.

model = AutoModel(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    trust_remote_code=True,
    remote_code="./model.py",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    spk_model="cam++",
    punc_model="ct-punc",
    device="cuda:0",
    hub="hf",
)
res = model.generate(input=["meeting.wav"], cache={}, batch_size=1, language="中文")

for sent in res[0]["sentence_info"]:
    print(f"Speaker {sent['spk']}: [{sent['start']}ms-{sent['end']}ms] {sent['text']}")

Requirements: pip install tiktoken huggingface_hub. The model is downloaded from HuggingFace (~1.6GB).

SenseVoice — ASR + Emotion + Audio Events

SenseVoice is a multi-task speech understanding model. Beyond ASR, it detects:

Emotions: happy, sad, angry, neutral
Audio events: BGM, applause, laughter, crying, coughing, sneezing
Languages: Chinese, English, Cantonese, Japanese, Korean

It uses a non-autoregressive architecture — processing 10 seconds of audio in just 70ms on GPU.

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model = AutoModel(
    model="iic/SenseVoiceSmall",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    device="cuda:0",
)
res = model.generate(
    input="audio.wav",
    cache={},
    language="auto",   # auto-detect language
    use_itn=True,      # inverse text normalization (numbers, dates)
    batch_size_s=60,
    merge_vad=True,
    merge_length_s=15,
)
text = rich_transcription_postprocess(res[0]["text"])
print(text)

The rich_transcription_postprocess function removes the internal tags (like <|zh|>, <|NEUTRAL|>) and returns clean text.

SenseVoice + Speaker Diarization

SenseVoice also supports speaker diarization when combined with spk_model:

model = AutoModel(
    model="iic/SenseVoiceSmall",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    spk_model="cam++",
    punc_model="ct-punc",
    device="cuda:0",
)
res = model.generate(input="meeting.wav", cache={}, language="auto",
                     use_itn=True, batch_size_s=60, merge_vad=True, merge_length_s=15)

for sent in res[0]["sentence_info"]:
    text = rich_transcription_postprocess(sent["text"])
    print(f"Speaker {sent['spk']}: {text}")

Speech Recognition (Streaming)

For real-time transcription, use the streaming Paraformer model. Audio is processed chunk-by-chunk, and partial results are returned incrementally. This is ideal for live captioning, voice assistants, and real-time meeting transcription.

The cache dictionary maintains state between chunks — it must persist across the entire session.

from funasr import AutoModel
import soundfile

chunk_size = [0, 10, 5]  # [0, 10, 5] = 600ms chunks, 300ms lookahead
encoder_chunk_look_back = 4
decoder_chunk_look_back = 1

model = AutoModel(model="paraformer-zh-streaming")

speech, sample_rate = soundfile.read("audio.wav")
chunk_stride = chunk_size[1] * 960  # 600ms = 9600 samples at 16kHz

cache = {}
total_chunk_num = int((len(speech) - 1) / chunk_stride + 1)
for i in range(total_chunk_num):
    speech_chunk = speech[i * chunk_stride:(i + 1) * chunk_stride]
    is_final = i == total_chunk_num - 1
    res = model.generate(
        input=speech_chunk, cache=cache, is_final=is_final,
        chunk_size=chunk_size,
        encoder_chunk_look_back=encoder_chunk_look_back,
        decoder_chunk_look_back=decoder_chunk_look_back,
    )
    print(res)  # partial results for each chunk

How it works:

chunk_size=[0, 10, 5] — display granularity is 10×60ms=600ms, lookahead is 5×60ms=300ms
Each iteration feeds 600ms of audio (9600 samples at 16kHz)
is_final=True on the last chunk forces output of any remaining buffered text
The cache dict maintains encoder/decoder state — do NOT reset it between chunks

Voice Activity Detection (Offline)

VAD detects speech segments in audio, returning start/end times in milliseconds. Useful for pre-processing before ASR, or for skipping silence in long recordings.

from funasr import AutoModel

model = AutoModel(model="fsmn-vad")
res = model.generate(input="audio.wav")
print(res)
# [{"key": "audio", "value": [[610, 5530], [7200, 12400], ...]}]
# Each [start_ms, end_ms] is a speech segment

Voice Activity Detection (Streaming)

For real-time voice activity detection (e.g., detecting when a user starts/stops speaking), use the streaming VAD mode. Audio is fed in small chunks (e.g., 200ms), and the model reports speech boundaries as they are detected.

from funasr import AutoModel
import soundfile

chunk_size = 200  # process 200ms at a time
model = AutoModel(model="fsmn-vad")

speech, sample_rate = soundfile.read("audio.wav")
chunk_stride = int(chunk_size * sample_rate / 1000)

cache = {}
total_chunk_num = int((len(speech) - 1) / chunk_stride + 1)
for i in range(total_chunk_num):
    speech_chunk = speech[i * chunk_stride:(i + 1) * chunk_stride]
    is_final = i == total_chunk_num - 1
    res = model.generate(input=speech_chunk, cache=cache,
                         is_final=is_final, chunk_size=chunk_size)
    if len(res[0]["value"]):
        print(res)

Streaming VAD output formats:

[[beg, end]] — complete speech segment detected
[[beg, -1]] — speech started but not yet ended
[[-1, end]] — speech ended (paired with previous start)
[] — no event detected in this chunk

Punctuation Restoration

ASR output typically lacks punctuation. The CT-Transformer model adds commas, periods, and question marks to raw text. It supports both Chinese and English.

from funasr import AutoModel

model = AutoModel(model="ct-punc")
res = model.generate(input="那今天的会就到这里吧 happy new year 明年见")
print(res[0]["text"])
# "那今天的会就到这里吧，happy new year，明年见。"

The punctuation model is usually used as part of the pipeline (via punc_model="ct-punc" in AutoModel), but can also be used standalone for post-processing text from other sources.

Speaker Diarization

Speaker diarization answers "who spoke when" by clustering audio segments by speaker identity. FunASR supports this for all three major ASR models: Paraformer, Fun-ASR-Nano, and SenseVoice.

The pipeline works as follows:

VAD segments the audio into speech regions
Each region is split into 1.5-second chunks
CAM++ extracts a 192-dimensional speaker embedding for each chunk
Spectral clustering groups chunks by speaker
Punctuation model segments text into sentences
Each sentence is assigned a speaker label based on time overlap

from funasr import AutoModel

model = AutoModel(
    model="paraformer-zh",
    vad_model="fsmn-vad",
    punc_model="ct-punc",
    spk_model="cam++",  # or "iic/speech_eres2netv2_sv_zh-cn_16k-common"
)
res = model.generate(input="meeting.wav", batch_size_s=300)

for sent in res[0]["sentence_info"]:
    print(f"[Speaker {sent['spk']}] [{sent['start']}-{sent['end']}ms] {sent['text']}")

You can also specify the number of speakers with preset_spk_num=N if known in advance.

Qwen3-ASR — 52 Languages

Qwen3-ASR is a large language model based ASR system supporting 52 languages with auto language detection. It leverages LLM's contextual understanding for improved accuracy on complex speech.

# First install: pip install qwen-asr
from funasr import AutoModel

model = AutoModel(
    model="Qwen/Qwen3-ASR-1.7B",  # or Qwen/Qwen3-ASR-0.6B for smaller model
    hub="hf",
    device="cuda:0",
)

# Auto language detection
res = model.generate(input="audio.wav")
print(res[0]["text"], "| Language:", res[0].get("language", ""))

# With specified language (slightly faster, more accurate)
res = model.generate(input="audio_zh.wav", language="Chinese")

Requires pip install qwen-asr and a GPU with sufficient memory (~4GB for 0.6B, ~8GB for 1.7B).

ONNX Export

Export models to ONNX format for deployment with ONNX Runtime. This enables inference without PyTorch and supports additional optimizations like quantization.

# Export via command line
funasr-export ++model=paraformer ++quantize=false ++device=cpu

# Or via Python
from funasr import AutoModel
model = AutoModel(model="paraformer", device="cpu")
export_dir = model.export(quantize=False)
print(f"Exported to: {export_dir}")

Using the ONNX Model

# pip install funasr-onnx
from funasr_onnx import Paraformer

model = Paraformer(
    "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
    batch_size=1, quantize=True
)
result = model(["audio.wav"])
print(result)

ONNX models can be further optimized with onnxslim:

pip install onnxslim
onnxslim model.onnx model_optimized.onnx

API Reference

Complete reference for the two main interfaces: AutoModel() for initialization and model.generate() for inference.

AutoModel() Parameters

Parameter	Type	Default	Description
`model`	str	—	Model name (from Hub) or local path
`device`	str	"cuda:0"	Device: "cuda:0", "cpu", "mps", "npu:0"
`vad_model`	str	None	VAD model for long audio segmentation
`vad_kwargs`	dict	{}	VAD config, e.g. {"max_single_segment_time": 60000}
`punc_model`	str	None	Punctuation restoration model
`spk_model`	str	None	Speaker model for diarization ("cam++" or full model ID)
`hub`	str	"ms"	"ms" (ModelScope) or "hf" (HuggingFace)
`ncpu`	int	4	CPU threads for parallel operations
`disable_update`	bool	False	Skip version check on startup
`disable_pbar`	bool	False	Disable tqdm progress bars

model.generate() Parameters

Parameter	Type	Description
`input`	str / array / list	Audio: file path, URL, numpy array, or list of arrays
`cache`	dict	State cache for streaming. Pass `{}` for first call.
`hotword`	str / list	Keywords to boost recognition accuracy
`language`	str	Language hint: "auto", "zh", "en", "中文", "Chinese", etc.
`batch_size_s`	int	Dynamic batch total duration (seconds)
`is_final`	bool	Last chunk flag for streaming mode
`return_spk_res`	bool	Return speaker diarization in sentence_info
`sentence_timestamp`	bool	Return sentence-level timestamps
`use_itn`	bool	Apply inverse text normalization (SenseVoice)

Output Fields

Field	Format	Description
`text`	str	Recognized text (all ASR models)
`timestamp`	[[start_ms, end_ms], ...]	Character/word timestamps
`sentence_info`	[{text, start, end, spk, timestamp}, ...]	Sentences with speaker labels
`value`	[[start_ms, end_ms], ...]	VAD speech segments
`spk_embedding`	Tensor [N, 192]	Speaker embeddings (CAM++/ERes2NetV2)

Model Aliases: You can use short names like "paraformer-zh", "fsmn-vad", "ct-punc", "cam++" instead of full model IDs. FunASR resolves them automatically via the built-in alias map.