7 models tested on GPU, 3 on CPU. 184 real-world Chinese audio files (192 minutes total). FunASR models vs Whisper variants β speed and accuracy head-to-head on GPU. All models run under PyTorch inference.
All models tested on 184 long-form Chinese audio files (44-83 seconds each, 192 minutes total). Sorted by speed. Lower RTF = faster. Lower CER = more accurate.
| Model | Type | Speed (RTF) | Realtime Factor |
|---|---|---|---|
| SenseVoice-Small | NAR | 0.0059 | |
| Paraformer-Large | NAR | 0.0083 | |
| Whisper-large-v3-turbo | AR | 0.0217 | |
| Whisper-large-v3-turbo (FunASR) | AR | 0.0385 | |
| faster-whisper-large-v3 | AR | 0.0464 | |
| Fun-ASR-Nano | LLM | 0.0588 | |
| Whisper-large-v3 | AR | 0.0746 |
Same 184-file test set on CPU. Only non-autoregressive FunASR models are practical for CPU inference.
| Model | Type | Speed (RTF) | Realtime Factor |
|---|---|---|---|
| SenseVoice-Small | NAR | 0.058 | |
| Paraformer-Large | NAR | 0.064 | |
| Fun-ASR-Nano | LLM | 0.278 |
SenseVoice-Small processes 192 minutes of audio in just 68 seconds on GPU. 12.7x faster than Whisper-large-v3.
SenseVoice runs faster-than-realtime even without GPU. Whisper models are impractical on CPU for production use.
FunASR includes VAD, punctuation, and speaker diarization in one call. No pipeline glue code needed.
Understanding the architecture tradeoffs behind each model category.
| Badge | Architecture | Characteristics |
|---|---|---|
| NAR | Non-Autoregressive | Parallel decoding, extremely fast. Best speed/accuracy tradeoff. |
| AR | Autoregressive | Sequential token-by-token decoding. Slower but widely adopted. |
| LLM | LLM-based ASR | Audio LLM with speech understanding. Versatile but compute-heavy. |
Get started with a single pip install and three lines of code.