Benchmark

Speed and accuracy measurements for long-form ASR workloads. The headline result: FunASR CPU inference can be faster than Whisper GPU inference for production transcription pipelines.

Summary Results Methodology How to Choose

Summary

Metric	Result
Dataset	184 long-form Chinese audio files, 11,539 s total, 192.3 min.
GPU	NVIDIA H100 80GB HBM3.
Best GPU speed	SenseVoice-Small: 169.6x realtime in the full benchmark, 211.8x in the initial run.
Best CPU speed	SenseVoice-Small: 17.2x realtime; Paraformer-Large: 15.6x realtime.
Baseline	OpenAI Whisper-large-v3: 13.4x realtime on GPU.

Results

Model	Device	RTF	Speed	CER	Notes
SenseVoice-Small	GPU	0.005896	169.6x	7.81%	ASR + language / emotion / event tags; CER after tag stripping.
Paraformer-Large	GPU	0.008359	119.6x	10.18%	Fast non-autoregressive Chinese ASR with VAD/punctuation pipeline.
Fun-ASR-Nano	GPU	0.058803	17.0x	8.06%	LLM-based ASR for Chinese, English, Japanese, seven Chinese dialect groups, and 26 regional accents; supports hotwords. Reliable checkpoint-native timestamps are not available (#106).
GLM-ASR-Nano	GPU	0.026974	37.1x	31.07%	LLM-based multilingual ASR.
Whisper-large-v3-turbo (OpenAI)	GPU	0.021708	46.1x	21.71%	OpenAI Whisper implementation.
Whisper-large-v3 (OpenAI)	GPU	0.074694	13.4x	20.02%	Baseline for large Whisper quality.
SenseVoice-Small	CPU	0.057988	17.2x	7.81%	CPU run from the remaining benchmark script.
Paraformer-Large	CPU	0.064056	15.6x	10.18%	CPU viable for batch jobs.
Fun-ASR-Nano	CPU	0.274318	3.6x	8.06%	LLM-based model is heavier but still above realtime.

Methodology

Measurements were collected with the benchmark scripts in the workspace on 184 audio files. RTF is total inference time / total audio duration; speed is 1 / RTF. CER is computed after model-specific text cleanup, especially for SenseVoice tags.

python benchmark/run_full_benchmark.py
python benchmark/run_remaining.py
python benchmark/fix_sensevoice_cer.py

Use these numbers as practical guidance, not a universal leaderboard: hardware, batch size, audio length, decoding options, and text normalization all affect results.

How to Choose

Need	Recommended model
Fastest production transcription	SenseVoice-Small or Paraformer-Large.
CPU batch transcription	SenseVoice-Small first; Paraformer-Large for Chinese production pipelines.
Chinese/English/Japanese LLM-style recognition with dialect and accent coverage	Fun-ASR-Nano; use the separate Fun-ASR-MLT-Nano checkpoint for 31 languages, and use vLLM for higher LLM decoding throughput.
OpenAI-compatible local endpoint	funasr-server with model alias `sensevoice`, `paraformer`, or `fun-asr-nano`.