Deployment Matrix

Choose the shortest path for a product, demo, benchmark, or internal workflow. Start with the smallest surface that satisfies the job, then move to heavier runtimes only when throughput, latency, or integration requirements demand it.

Decision table Common choices Readiness checklist Get help

Quick decision table

Path	Best for	Start here	Operational notes
Colab notebook	Browser smoke tests, first evaluation, shareable demos	Colab quickstart	No local setup; first run downloads model files, GPU runtime is faster.
Python API	Notebooks, offline jobs, first model evaluation	Tutorial	Lowest ceremony; caller owns batching, retries, and files.
llama.cpp / GGUF binary	CPU or edge transcription with no Python runtime; Linux Vulkan GPUs for SenseVoiceSmall	v0.1.8 binaries · Linux Vulkan tarball · llama.cpp docs	Download GGUF models with the bundled script. Use `--backend vulkan` on Linux Vulkan systems or CPU packages for portable smoke tests.
OpenAI-compatible API	Private speech API, agents, Dify/LangChain/AutoGen/n8n-style clients	OpenAI API · Python smoke test · JS/TS recipes · Workflow recipes · Gradio demo · Security guide · Postman collection · OpenAPI spec	Easiest integration for apps and workflow engines that already support OpenAI audio APIs or multipart HTTP nodes.
Docker Compose API	Reproducible local smoke test or small internal service	OpenAI API Docker docs · Python smoke test	CPU by default; adapt the image before using CUDA in containers.
Kubernetes API	Internal speech API for cluster services	Kubernetes template · Python smoke test	Private `ClusterIP` by default; add auth, TLS, network policy, and GPU scheduling before broader exposure.
Runtime WebSocket service	Live captions, meetings, call-center streams	Runtime docs	Use when partial results, endpointing, or long-lived audio streams matter.
vLLM acceleration	Higher-throughput LLM-based ASR with Fun-ASR-Nano	vLLM guide	Use for LLM decoder throughput; does not apply to non-autoregressive Paraformer.
MCP server	Claude/Cursor/desktop agent speech tools	MCP example	Good when the ASR result should be exposed as a local tool.
Subtitle generator	SRT/VTT from long audio or video	Subtitle generator	Use verbose segments and speaker labels when readability matters.
Batch ASR script	Archives, meetings, datasets, repeated offline runs	Batch example	Add queueing, manifests, and retry logs for production use.
Triton runtime	Specialized high-performance serving	Triton runtime docs	Heavier setup; choose when your team already operates Triton/GPU serving.

Common choices

Try FunASR in five minutes

Use the Colab quickstart for a browser-only smoke test, or use the Python API from the tutorial for local work. It is the shortest route for validating installation, model download, device selection, and output shape. If you are unsure which model to start with, use the model selection guide.

Ship a no-Python edge binary

Use the llama.cpp / GGUF runtime when you need a self-contained SenseVoice, Paraformer, or Fun-ASR-Nano binary. Download v0.1.8; Linux Vulkan users can select --backend vulkan with the Vulkan tarball.

Replace cloud transcription locally

Use the OpenAI-compatible API. Start with sensevoice, run the bash smoke test or Python smoke test, then connect existing SDK or HTTP clients using the client recipes or JavaScript/TypeScript recipes. For cluster rollout, use the Kubernetes template. For Dify, n8n, or webhook workers, use the workflow recipes. For GUI API checks, import the Postman collection or launch the Gradio demo; for gateways and developer portals, use the OpenAPI spec and the security guide.

Run a repeatable container demo

cd examples/openai_api
cp .env.example .env
docker compose up --build

Keep CPU mode until you have a CUDA-capable PyTorch/FunASR image.

Serve live audio

Use the runtime WebSocket service. Validate chunk size, VAD, endpointing, punctuation, speaker diarization, reconnect behavior, and client backpressure with real audio.

Readiness checklist

Pick a model alias and pin it in deployment notes.
Record FunASR version, model version, device, CUDA/PyTorch version, Docker image tag, and command line.
Run a short public smoke sample with the Python smoke test and at least one realistic private sample; for Kubernetes, verify first through kubectl port-forward using the deployment template.
Log audio duration, model, device, latency, response format, and error type for every request.
Add upload-size limits, authentication, TLS, and rate limits before exposing an API outside a trusted network; use the security guide to plan the boundary.
For streaming, test silence, noise, overlapping speakers, long sessions, reconnects, and slow clients.

When to open an issue

Use Deployment Help for runtime, Docker, vLLM, Triton, Android, browser, or agent integration problems. Include your deployment path, exact command/config, logs, model, device, and audio characteristics.