FunASR Tutorial

This guide covers everything from installation to advanced usage — streaming ASR, speaker diarization, emotion detection, and model export.

Installation

FunASR can be installed from PyPI or from source. The source install is recommended to get the latest bug fixes and new model support.

pip install funasr

# Install from source (recommended for latest features)
pip install git+https://github.com/modelscope/FunASR.git

FunASR automatically downloads models from ModelScope (default) or HuggingFace. Set hub="hf" to use HuggingFace.

Speech Recognition (Offline)

The most common use case: transcribe a complete audio file. FunASR's AutoModel provides a unified interface for all models. Combined with a VAD model, it handles audio of any length by automatically segmenting it into manageable chunks.

Paraformer — Chinese/English ASR

Paraformer is a non-autoregressive model optimized for Chinese speech recognition. It offers the best balance of accuracy and speed for production use.

from funasr import AutoModel

model = AutoModel(
    model="paraformer-zh",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 60000},
    punc_model="ct-punc",
    # spk_model="cam++",  # uncomment for speaker diarization
)
res = model.generate(input="audio.wav", batch_size_s=300, hotword='魔搭')
print(res)

Key parameters:

OOM (Out of Memory)? Reduce batch_size_s, or reduce max_single_segment_time in vad_kwargs. For extremely long segments, set batch_size_threshold_s=30 to force batch_size=1 when a segment exceeds this threshold.

Fun-ASR-Nano — 31 Languages

Fun-ASR-Nano is the latest end-to-end ASR model trained on tens of millions of hours of real speech data. It excels at:

Basic Usage

from funasr import AutoModel

model = AutoModel(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    trust_remote_code=True,
    remote_code="./model.py",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    device="cuda:0",
    hub="hf",
)
res = model.generate(
    input=["audio.wav"],
    cache={},
    batch_size=1,
    hotwords=["keyword1", "keyword2"],
    language="中文",  # or "英文", "日文", etc.
)
print(res[0]["text"])        # recognized text
print(res[0]["timestamps"])  # character-level timestamps

The output includes timestamps — a list of dictionaries with token, start_time, and end_time (in seconds) for each character.

With Speaker Diarization

Add spk_model and punc_model to get per-sentence speaker labels. The output sentence_info contains text, timestamps, and speaker ID for each sentence.

model = AutoModel(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    trust_remote_code=True,
    remote_code="./model.py",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    spk_model="cam++",
    punc_model="ct-punc",
    device="cuda:0",
    hub="hf",
)
res = model.generate(input=["meeting.wav"], cache={}, batch_size=1, language="中文")

for sent in res[0]["sentence_info"]:
    print(f"Speaker {sent['spk']}: [{sent['start']}ms-{sent['end']}ms] {sent['text']}")
Requirements: pip install tiktoken huggingface_hub. The model is downloaded from HuggingFace (~1.6GB).

SenseVoice — ASR + Emotion + Audio Events

SenseVoice is a multi-task speech understanding model. Beyond ASR, it detects:

It uses a non-autoregressive architecture — processing 10 seconds of audio in just 70ms on GPU.

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model = AutoModel(
    model="iic/SenseVoiceSmall",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    device="cuda:0",
)
res = model.generate(
    input="audio.wav",
    cache={},
    language="auto",   # auto-detect language
    use_itn=True,      # inverse text normalization (numbers, dates)
    batch_size_s=60,
    merge_vad=True,
    merge_length_s=15,
)
text = rich_transcription_postprocess(res[0]["text"])
print(text)

The rich_transcription_postprocess function removes the internal tags (like <|zh|>, <|NEUTRAL|>) and returns clean text.

SenseVoice + Speaker Diarization

SenseVoice also supports speaker diarization when combined with spk_model:

model = AutoModel(
    model="iic/SenseVoiceSmall",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    spk_model="cam++",
    punc_model="ct-punc",
    device="cuda:0",
)
res = model.generate(input="meeting.wav", cache={}, language="auto",
                     use_itn=True, batch_size_s=60, merge_vad=True, merge_length_s=15)

for sent in res[0]["sentence_info"]:
    text = rich_transcription_postprocess(sent["text"])
    print(f"Speaker {sent['spk']}: {text}")

Speech Recognition (Streaming)

For real-time transcription, use the streaming Paraformer model. Audio is processed chunk-by-chunk, and partial results are returned incrementally. This is ideal for live captioning, voice assistants, and real-time meeting transcription.

The cache dictionary maintains state between chunks — it must persist across the entire session.

from funasr import AutoModel
import soundfile

chunk_size = [0, 10, 5]  # [0, 10, 5] = 600ms chunks, 300ms lookahead
encoder_chunk_look_back = 4
decoder_chunk_look_back = 1

model = AutoModel(model="paraformer-zh-streaming")

speech, sample_rate = soundfile.read("audio.wav")
chunk_stride = chunk_size[1] * 960  # 600ms = 9600 samples at 16kHz

cache = {}
total_chunk_num = int((len(speech) - 1) / chunk_stride + 1)
for i in range(total_chunk_num):
    speech_chunk = speech[i * chunk_stride:(i + 1) * chunk_stride]
    is_final = i == total_chunk_num - 1
    res = model.generate(
        input=speech_chunk, cache=cache, is_final=is_final,
        chunk_size=chunk_size,
        encoder_chunk_look_back=encoder_chunk_look_back,
        decoder_chunk_look_back=decoder_chunk_look_back,
    )
    print(res)  # partial results for each chunk

How it works:

Voice Activity Detection (Offline)

VAD detects speech segments in audio, returning start/end times in milliseconds. Useful for pre-processing before ASR, or for skipping silence in long recordings.

from funasr import AutoModel

model = AutoModel(model="fsmn-vad")
res = model.generate(input="audio.wav")
print(res)
# [{"key": "audio", "value": [[610, 5530], [7200, 12400], ...]}]
# Each [start_ms, end_ms] is a speech segment

Voice Activity Detection (Streaming)

For real-time voice activity detection (e.g., detecting when a user starts/stops speaking), use the streaming VAD mode. Audio is fed in small chunks (e.g., 200ms), and the model reports speech boundaries as they are detected.

from funasr import AutoModel
import soundfile

chunk_size = 200  # process 200ms at a time
model = AutoModel(model="fsmn-vad")

speech, sample_rate = soundfile.read("audio.wav")
chunk_stride = int(chunk_size * sample_rate / 1000)

cache = {}
total_chunk_num = int((len(speech) - 1) / chunk_stride + 1)
for i in range(total_chunk_num):
    speech_chunk = speech[i * chunk_stride:(i + 1) * chunk_stride]
    is_final = i == total_chunk_num - 1
    res = model.generate(input=speech_chunk, cache=cache,
                         is_final=is_final, chunk_size=chunk_size)
    if len(res[0]["value"]):
        print(res)

Streaming VAD output formats:

Punctuation Restoration

ASR output typically lacks punctuation. The CT-Transformer model adds commas, periods, and question marks to raw text. It supports both Chinese and English.

from funasr import AutoModel

model = AutoModel(model="ct-punc")
res = model.generate(input="那今天的会就到这里吧 happy new year 明年见")
print(res[0]["text"])
# "那今天的会就到这里吧,happy new year,明年见。"

The punctuation model is usually used as part of the pipeline (via punc_model="ct-punc" in AutoModel), but can also be used standalone for post-processing text from other sources.

Speaker Diarization

Speaker diarization answers "who spoke when" by clustering audio segments by speaker identity. FunASR supports this for all three major ASR models: Paraformer, Fun-ASR-Nano, and SenseVoice.

The pipeline works as follows:

  1. VAD segments the audio into speech regions
  2. Each region is split into 1.5-second chunks
  3. CAM++ extracts a 192-dimensional speaker embedding for each chunk
  4. Spectral clustering groups chunks by speaker
  5. Punctuation model segments text into sentences
  6. Each sentence is assigned a speaker label based on time overlap
from funasr import AutoModel

model = AutoModel(
    model="paraformer-zh",
    vad_model="fsmn-vad",
    punc_model="ct-punc",
    spk_model="cam++",  # or "iic/speech_eres2netv2_sv_zh-cn_16k-common"
)
res = model.generate(input="meeting.wav", batch_size_s=300)

for sent in res[0]["sentence_info"]:
    print(f"[Speaker {sent['spk']}] [{sent['start']}-{sent['end']}ms] {sent['text']}")

You can also specify the number of speakers with preset_spk_num=N if known in advance.

Qwen3-ASR — 52 Languages

Qwen3-ASR is a large language model based ASR system supporting 52 languages with auto language detection. It leverages LLM's contextual understanding for improved accuracy on complex speech.

# First install: pip install qwen-asr
from funasr import AutoModel

model = AutoModel(
    model="Qwen/Qwen3-ASR-1.7B",  # or Qwen/Qwen3-ASR-0.6B for smaller model
    hub="hf",
    device="cuda:0",
)

# Auto language detection
res = model.generate(input="audio.wav")
print(res[0]["text"], "| Language:", res[0].get("language", ""))

# With specified language (slightly faster, more accurate)
res = model.generate(input="audio_zh.wav", language="Chinese")
Requires pip install qwen-asr and a GPU with sufficient memory (~4GB for 0.6B, ~8GB for 1.7B).

ONNX Export

Export models to ONNX format for deployment with ONNX Runtime. This enables inference without PyTorch and supports additional optimizations like quantization.

# Export via command line
funasr-export ++model=paraformer ++quantize=false ++device=cpu

# Or via Python
from funasr import AutoModel
model = AutoModel(model="paraformer", device="cpu")
export_dir = model.export(quantize=False)
print(f"Exported to: {export_dir}")

Using the ONNX Model

# pip install funasr-onnx
from funasr_onnx import Paraformer

model = Paraformer(
    "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
    batch_size=1, quantize=True
)
result = model(["audio.wav"])
print(result)

ONNX models can be further optimized with onnxslim:

pip install onnxslim
onnxslim model.onnx model_optimized.onnx

API Reference

Complete reference for the two main interfaces: AutoModel() for initialization and model.generate() for inference.

AutoModel() Parameters

ParameterTypeDefaultDescription
modelstrModel name (from Hub) or local path
devicestr"cuda:0"Device: "cuda:0", "cpu", "mps", "npu:0"
vad_modelstrNoneVAD model for long audio segmentation
vad_kwargsdict{}VAD config, e.g. {"max_single_segment_time": 60000}
punc_modelstrNonePunctuation restoration model
spk_modelstrNoneSpeaker model for diarization ("cam++" or full model ID)
hubstr"ms""ms" (ModelScope) or "hf" (HuggingFace)
ncpuint4CPU threads for parallel operations
disable_updateboolFalseSkip version check on startup
disable_pbarboolFalseDisable tqdm progress bars

model.generate() Parameters

ParameterTypeDescription
inputstr / array / listAudio: file path, URL, numpy array, or list of arrays
cachedictState cache for streaming. Pass {} for first call.
hotwordstr / listKeywords to boost recognition accuracy
languagestrLanguage hint: "auto", "zh", "en", "中文", "Chinese", etc.
batch_size_sintDynamic batch total duration (seconds)
is_finalboolLast chunk flag for streaming mode
return_spk_resboolReturn speaker diarization in sentence_info
sentence_timestampboolReturn sentence-level timestamps
use_itnboolApply inverse text normalization (SenseVoice)

Output Fields

FieldFormatDescription
textstrRecognized text (all ASR models)
timestamp[[start_ms, end_ms], ...]Character/word timestamps
sentence_info[{text, start, end, spk, timestamp}, ...]Sentences with speaker labels
value[[start_ms, end_ms], ...]VAD speech segments
spk_embeddingTensor [N, 192]Speaker embeddings (CAM++/ERes2NetV2)
Model Aliases: You can use short names like "paraformer-zh", "fsmn-vad", "ct-punc", "cam++" instead of full model IDs. FunASR resolves them automatically via the built-in alias map.