170x realtime on GPU · 7 models compared

ASR Benchmark 2025
PyTorch Inference

7 models tested on GPU, 3 on CPU. 184 real-world Chinese audio files (192 minutes total). FunASR models vs Whisper variants β€” speed and accuracy head-to-head on GPU. All models run under PyTorch inference.

GPU Results (PyTorch, GPU)

All models tested on 184 long-form Chinese audio files (44-83 seconds each, 192 minutes total). Sorted by speed. Lower RTF = faster. Lower CER = more accurate.

Model Type Speed (RTF) Realtime Factor
SenseVoice-Small NAR 0.0059
170x
Paraformer-Large NAR 0.0083
120x
Whisper-large-v3-turbo AR 0.0217
46x
Whisper-large-v3-turbo (FunASR) AR 0.0385
26x
faster-whisper-large-v3 AR 0.0464
21.5x
Fun-ASR-Nano LLM 0.0588
17x
Whisper-large-v3 AR 0.0746
13.4x

CPU Results

Same 184-file test set on CPU. Only non-autoregressive FunASR models are practical for CPU inference.

Model Type Speed (RTF) Realtime Factor
SenseVoice-Small NAR 0.058
17.2x
Paraformer-Large NAR 0.064
15.6x
Fun-ASR-Nano LLM 0.278
3.6x
Note: Whisper models (autoregressive decoding) are impractical on CPU β€” decoding 192 minutes of audio would take over 2 hours. Only non-autoregressive FunASR models achieve real-time or faster CPU inference.

GPU Speed Comparison (Realtime Factor)

SenseVoice-Small
170x
170x
Paraformer-Large
120x
120x
Whisper-large-v3-turbo
46x
46x
Whisper-turbo (FunASR)
26x
26x
faster-whisper-large-v3
21.5x
21.5x
Fun-ASR-Nano
17x
17x
Whisper-large-v3
13.4x
13.4x

Key Findings

170x

Realtime on GPU

SenseVoice-Small processes 192 minutes of audio in just 68 seconds on GPU. 12.7x faster than Whisper-large-v3.

17.2x

Realtime on CPU

SenseVoice runs faster-than-realtime even without GPU. Whisper models are impractical on CPU for production use.

3 lines

One API for Everything

FunASR includes VAD, punctuation, and speaker diarization in one call. No pipeline glue code needed.

Model Types Explained

Understanding the architecture tradeoffs behind each model category.

Badge Architecture Characteristics
NAR Non-Autoregressive Parallel decoding, extremely fast. Best speed/accuracy tradeoff.
AR Autoregressive Sequential token-by-token decoding. Slower but widely adopted.
LLM LLM-based ASR Audio LLM with speech understanding. Versatile but compute-heavy.

Test Setup

GPU
High-end NVIDIA GPU
Audio Dataset
184 files, 192 min total
Language
Mandarin Chinese
Duration Range
44 - 83 seconds per file
FunASR Version
1.3.1
Backend
PyTorch 2.x (no vLLM)
CUDA
12.8 / Driver 550.127
CPU
Intel Xeon (server-grade)

Try FunASR Today

Get started with a single pip install and three lines of code.