Developer Guide

From understanding the architecture to contributing your own model — a step-by-step guide.

1. Architecture Overview

FunASR is built around three core ideas: a registry for component discovery, AutoModel as the unified entry point, and config.yaml as the declarative model definition.

User Code FunASR Framework Model Hub ───────── ──────────────── ───────── AutoModel(model="name") │ ├─→ download_model() ──────→ ModelScope / HuggingFace │ ↓ ↓ │ read config.yaml download model.pt │ ↓ ├─→ tables.model_classes["Name"] ← @tables.register decorator │ ↓ ├─→ model_class(**config) ← __init__: build encoder/decoder │ ↓ ├─→ load_pretrained_model() ← load weights from model.pt │ ↓ └─→ model.eval() ← ready for inference generate(input="audio.wav") │ ├─ No VAD → inference() ← single utterance │ ↓ │ model.inference(data_in, tokenizer, frontend, **kwargs) │ ↓ │ return [{"key", "text", "timestamp", ...}] │ └─ With VAD → inference_with_vad() ← long audio ↓ 1. VAD: segment audio → [[start_ms, end_ms], ...] 2. Sort segments by length (for efficient batching) 3. ASR: recognize each segment 4. Merge timestamps (add VAD offset) 5. Punctuation (optional) 6. Speaker diarization (optional) ↓ return [{"key", "text", "timestamp", "sentence_info"}]

Registry System

Every component in FunASR is registered by name. The registry is the lookup table that connects config strings to Python classes:

RegistryPurposeExample
model_classesASR, VAD, PUNC, SPK models"Paraformer", "FsmnVADStreaming"
encoder_classesEncoder architectures"SANMEncoder", "ConformerEncoder"
decoder_classesDecoder architectures"ParaformerSANMDecoder"
frontend_classesAudio feature extraction"WavFrontend", "WhisperFrontend"
tokenizer_classesText tokenization"SentencepiecesTokenizer"
dataset_classesTraining data loading"AudioDataset"
from funasr.register import tables

# Register a new model
@tables.register("model_classes", "MyModel")
class MyModel(nn.Module):
    ...

# View all registered models
tables.print("model")

2. Development Setup

Clone and install in development mode

git clone https://github.com/modelscope/FunASR.git
cd FunASR
pip install -e .              # editable install
pip install -e ".[train]"     # with training dependencies

Verify installation

python -c "from funasr import AutoModel; print('OK')"
python -c "from funasr.register import tables; tables.print('model')"

Run existing tests

# Quick smoke test
python tests_models/test_fsmn_vad.py
python tests_models/test_paraformer.py

# Full test suite
cd tests_models && python run_all_tests.py

3. How Inference Works (Deep Dive)

Understanding the inference data flow is essential before adding a new model. Here's what happens when you call model.generate(input="audio.wav"):

Step 1: Input Preparation

prepare_data_iterator() normalizes any input type (file path, URL, numpy, bytes, list) into a uniform (key_list, data_list) format.

Step 2: Model Inference

Each model's inference() method receives:

def inference(self, data_in, data_lengths=None, key=None,
              tokenizer=None, frontend=None, **kwargs):
    # data_in: list of audio samples (numpy arrays)
    # tokenizer: for decoding token IDs → text
    # frontend: for extracting fbank features
    # **kwargs: all config.yaml params + user runtime params

    # Must return: (results_list, meta_data_dict)
    return [{"key": "id", "text": "hello", "timestamp": [...]}], {"batch_data_time": 5.5}

Step 3: Output Format

The results_list must be a list of dicts. Required/optional fields:

FieldTypeRequiredDescription
keystrYesSample identifier
textstrYes (ASR)Recognized text
timestamplistFor SPK[[start_ms, end_ms], ...] per character
valuelistVAD only[[start_ms, end_ms], ...] speech segments
spk_embeddingTensorSPK onlyShape [N, 192]
Timestamp format matters! If your model outputs timestamps differently (e.g., dict format like Fun-ASR-Nano), FunASR's inference_with_vad will auto-convert it. But the standard expected format is [[start_ms, end_ms], ...] (list of 2-element lists in milliseconds).

4. Add a New Model

Create model directory

funasr/models/my_model/
├── __init__.py      # empty file
├── model.py         # main model class
├── encoder.py       # (optional) custom encoder
└── decoder.py       # (optional) custom decoder
Rule: Each model directory is self-contained. Never import from other model directories. Never modify existing models.

Implement the model class

import torch.nn as nn
from funasr.register import tables

@tables.register("model_classes", "MyModel")
class MyModel(nn.Module):

    def __init__(self, **kwargs):
        super().__init__()    # ← MUST call super().__init__()
        # Build your architecture from kwargs (comes from config.yaml)
        # kwargs includes: input_size, vocab_size, tokenizer, frontend, etc.

    def forward(self, speech, speech_lengths, text, text_lengths, **kwargs):
        """Training forward pass. Return (loss, stats_dict, weight)."""
        ...

    def inference(self, data_in, data_lengths=None, key=None,
                  tokenizer=None, frontend=None, **kwargs):
        """Inference. Return (results_list, meta_data)."""
        ...

Create config.yaml

# This file defines what components to use
model: MyModel              # matches @tables.register key
model_conf:
    hidden_size: 512

frontend: WavFrontend       # reuse existing frontend
frontend_conf:
    fs: 16000
    n_mels: 80
    frame_length: 25
    frame_shift: 10
    cmvn_file: null

tokenizer: SentencepiecesTokenizer
tokenizer_conf:
    bpemodel: null

Create configuration.json (for Hub upload)

{
  "framework": "pytorch",
  "task": "auto-speech-recognition",
  "model": {"type": "funasr"},
  "file_path_metas": {
    "init_param": "model.pt",
    "config": "config.yaml",
    "tokenizer_conf": {"bpemodel": "my_tokenizer.model"},
    "frontend_conf": {"cmvn_file": "am.mvn"}
  }
}

This file tells AutoModel how to resolve relative paths. When the model is downloaded, each path in file_path_metas gets the model directory prepended.

Test locally

from funasr import AutoModel

# From local directory
model = AutoModel(model="./my_model_dir")
res = model.generate(input="test.wav")
print(res)

5. Add a New Frontend / Tokenizer / Dataset

The same registry pattern applies to all components. Example — adding a new frontend:

from funasr.register import tables

@tables.register("frontend_classes", "MyFrontend")
class MyFrontend(nn.Module):
    def __init__(self, fs=16000, **kwargs):
        super().__init__()
        self.fs = fs

    def output_size(self):
        return 80  # feature dimension

    def forward(self, input, input_lengths):
        # input: raw waveform (batch, samples)
        # return: features (batch, frames, dim), lengths
        ...

Then reference it in config.yaml:

frontend: MyFrontend
frontend_conf:
    fs: 16000

Same pattern for tokenizer (tokenizer_classes), dataset (dataset_classes), encoder (encoder_classes), etc.

6. Standalone Repository Mode

Your model doesn't need to live inside the FunASR source tree. With trust_remote_code=True, FunASR dynamically loads your model class from an external file:

# User code — loads YOUR model.py from a separate repo
model = AutoModel(
    model="your-org/your-model",      # HuggingFace/ModelScope repo
    trust_remote_code=True,
    remote_code="./model.py",          # path to model class definition
    hub="hf",
)

How it works:

  1. FunASR downloads the model repo (weights + config + model.py)
  2. remote_code="./model.py" is dynamically imported
  3. The @tables.register decorator in that file registers the model class
  4. Normal build_model() flow proceeds with the registered class

Your repo structure:

your-model-repo/
├── model.py              # model class with @tables.register
├── config.yaml           # model architecture config
├── configuration.json    # path resolution
├── model.pt              # trained weights
└── example/test.wav      # demo audio

Examples: Fun-ASR-Nano, SenseVoice

7. Testing Your Model

Write a test script

# tests_models/test_my_model.py
import sys, time
from funasr import AutoModel

def main():
    model = AutoModel(model="path/to/model", device="cpu", disable_update=True)
    res = model.generate(input="test.wav")

    assert res and len(res) > 0, "empty result"
    assert "text" in res[0], "missing text field"
    print("PASSED")
    return 0

if __name__ == "__main__":
    sys.exit(main())

Test with VAD + SPK pipeline

# If your model should work with speaker diarization:
model = AutoModel(
    model="path/to/model",
    vad_model="fsmn-vad",
    spk_model="cam++",
)
res = model.generate(input="meeting.wav", cache={})
assert "sentence_info" in res[0]
assert "spk" in res[0]["sentence_info"][0]

Test streaming (if applicable)

cache = {}
for i in range(total_chunks):
    chunk = audio[i*stride:(i+1)*stride]
    res = model.generate(input=chunk, cache=cache,
                         is_final=(i == total_chunks-1), ...)
# Verify: same audio gives same result across multiple sessions

8. Common Pitfalls

❌ Forgetting super().__init__()

# WRONG — causes "object has no attribute '_state_dict_pre_hooks'"
class MyEncoder(nn.Module):
    def __init__(self):
        pass

# CORRECT
class MyEncoder(nn.Module):
    def __init__(self):
        super().__init__()

❌ Checking kwargs["batch_size"] in your model

batch_size in kwargs is set by inference_with_vad for segment batching (a large number in ms). Don't use it to check actual data batch size. Use len(data_in) instead.

❌ Not handling empty/short input

VAD may produce empty segments. Your inference() should handle data_in = [] gracefully.

❌ Timestamp format mismatch

If your model returns timestamps as dicts ({"start_time": 0.5, "end_time": 0.8}), the pipeline handles conversion. But if you output [start_time, end_time, text] (3 elements), strip the text — downstream expects [start_ms, end_ms] (2 elements, in milliseconds).

❌ Importing from other model directories

# WRONG — creates tight coupling
from funasr.models.paraformer.model import Paraformer

# CORRECT — copy what you need into your own directory
# Or inherit via the registry name in config.yaml

❌ Modifying self.kwargs during inference

Don't mutate kwargs that came from AutoModel. The framework resets state between calls, but persistent mutations can leak between sessions.

9. Contributing

Code Style

PR Checklist

License

Code: MIT. Model weights: FunASR Model License (commercial use allowed with attribution).