Migrate from Whisper or Cloud ASR to FunASR
Use this guide when you already have a Whisper, OpenAI or cloud ASR, or custom speech pipeline and want to decide whether FunASR is worth switching to. Compare quality, speed, cost, and deployment fit on audio that looks like your real workload.
When FunASR is a good fit
- Private or self-hosted transcription where audio should stay inside your environment.
- High-throughput long-form transcription for meetings, archives, media, or call recordings.
- Speaker-aware transcripts with VAD, punctuation, timestamps, and diarization in one pipeline.
- An OpenAI-compatible audio endpoint for agents, Dify, LangChain, AutoGen, or internal apps.
- Streaming ASR or live captions with WebSocket/runtime service support.
- CPU-viable smoke tests before moving to GPU deployment.
Stay on your current pipeline if you need a fully managed service, a vendor SLA, or a language/domain that your own benchmark shows FunASR does not handle well enough yet.
Fast evaluation plan
- Pick 20-50 representative audio files, including short clips, long recordings, noisy samples, different speakers, and target languages or dialects.
- Run your current Whisper or cloud ASR pipeline exactly as you use it in production. Save transcripts, latency, cost, and failure cases.
- Run FunASR locally with the tutorial, then choose a deployment path from the deployment matrix.
- Compare output with human review or your normal WER/CER process. Do not compare only one clean demo file.
- Run the migration benchmark example to write JSONL and Markdown summaries for your own audio folder.
- Run the OpenAI-compatible API Python smoke test if your application already uses OpenAI-style clients.
- Record warmup time, model download time, device, GPU/CPU type, batch size, and audio duration separately from steady-state throughput.
Feature mapping
| Existing workflow | FunASR path | What to validate |
|---|---|---|
| Whisper file transcription | Tutorial · Model selection | Transcript quality, timestamps, speed, model download, CPU/GPU behavior. |
| Whisper plus pyannote | spk_model="cam++" with VAD and punctuation | Speaker labels, speaker changes, overlapping speech, long silences. |
| OpenAI audio API or cloud batch ASR | OpenAI-compatible API · Kubernetes template · JS/TS recipes | /v1/audio/transcriptions, response format, client compatibility, upload limits. |
| Dify/LangChain/AutoGen agent audio | Agent and API recipes or MCP server | Tool latency, file handling, auth boundary, error reporting. |
| Live captions or call-center streaming | Realtime examples | Chunking, endpointing, reconnects, backpressure, partial/final result behavior. |
| Subtitle generation | Subtitle generator | Segment readability, line length, speaker labels, SRT/VTT compatibility. |
| Offline archive processing | Batch ASR example | Manifest handling, retries, progress logs, throughput, failed-file recovery. |
Minimal local comparison
Install FunASR and run the same file you used for your baseline. For a folder-level evaluation, use benchmark_funasr.py to generate results.jsonl and summary.md.
pip install funasr
from funasr import AutoModel
model = AutoModel(
model="iic/SenseVoiceSmall",
vad_model="fsmn-vad",
spk_model="cam++",
device="cuda", # use "cpu" for a portable smoke test
)
result = model.generate(input="sample.wav")
print(result)
For a repeatable folder benchmark:
python examples/migration/benchmark_funasr.py \
--input /path/to/audio_samples \
--recursive \
--model iic/SenseVoiceSmall \
--device cuda \
--spk-model cam++ \
--output-dir outputs/funasr_migration_eval
For an API-style comparison:
pip install funasr fastapi uvicorn python-multipart
funasr-server --model sensevoice --device cuda
curl http://localhost:8000/v1/audio/transcriptions \
-F file=@sample.wav \
-F model=sensevoice \
-F response_format=verbose_jsonQuality and speed checklist
- Audio duration, language, domain, sample rate, channel count, and speaker count.
- Model name, model version, FunASR version, Python/PyTorch/CUDA versions, and Docker image tag if used.
- Hardware, device mode, batch size, streaming chunk size, and whether warmup/model download is excluded.
- WER/CER or human review notes for names, numbers, punctuation, diarization, timestamps, and domain terms.
- Latency, throughput, GPU/CPU memory, cost per hour of audio, and failed-file rate.
- Operational requirements: authentication, upload limits, TLS, logs, monitoring, retries, and retention rules.
Rollout checklist
- Keep the old pipeline available until FunASR passes your representative benchmark.
- Start with an internal endpoint or batch job before exposing a public API.
- Add request IDs and log audio duration, model, device, latency, and error type for every request.
- Pin the model alias and deployment command in your runbook.
- Test noisy audio, silence, overlapping speakers, long files, non-UTF-8 filenames, and network interruptions.
- Open a Deployment Help issue with your command, logs, model, device, and sample characteristics if you hit a blocker.