Open-source speech understanding toolkit
Production-ready ASR, VAD, punctuation, speaker diarization, emotion detection, and audio event recognition with one unified Python interface.
from funasr import AutoModel
model = AutoModel(
model="paraformer-zh",
vad_model="fsmn-vad",
punc_model="ct-punc",
spk_model="cam++",
)
res = model.generate(input="meeting.wav")
print(res[0]["sentence_info"])
Everything needed for speech understanding, from raw audio segmentation to speaker-aware transcripts.
Streaming and offline ASR with VAD segmentation. Process long-form audio with a single API call.
Fun-ASR-Nano covers 31 languages and Qwen3-ASR covers 52 languages with language detection.
Identify who spoke when, then attach speaker labels to sentence-level ASR output.
SenseVoice detects emotion and audio events including background music, applause, laughter, and crying.
Non-autoregressive models support fast batch and realtime workloads across common deployment targets.
Fine-tune with DeepSpeed, export to ONNX, and deploy through Docker runtime or the Python SDK.
Pre-trained industrial models ready for recognition, segmentation, and speech understanding workflows.
End-to-end ASR trained on tens of millions of hours. 31 languages, dialects, accents, lyrics, timestamps, and hotwords.
Non-autoregressive Chinese and English ASR with streaming and offline variants for production systems.
Multi-task speech understanding for ASR, language ID, emotion, and audio events across five languages.
LLM-based ASR with 52 languages, contextual understanding, and automatic language detection.
Install the package, compose the pipeline, and run recognition from Python.
pip install funasr # Or latest: pip install git+https://github.com/modelscope/FunASR.git
from funasr import AutoModel
model = AutoModel(
model="paraformer-zh",
vad_model="fsmn-vad",
punc_model="ct-punc",
spk_model="cam++",
)
res = model.generate(input="meeting.wav", batch_size_s=300)
for sent in res[0]["sentence_info"]:
print(f"[Speaker {sent['spk']}] {sent['text']}")
Related projects around ASR, speech understanding, video clipping, and voice generation.
The latest ASR large model with multilingual recognition, timestamps, speaker diarization, and hotwords.
Multi-task speech understanding for ASR, emotion detection, and audio event recognition.
AI video clipping powered by FunASR and LLM-assisted editing workflows.
Natural speech generation with multi-language, timbre, and emotion control.