Developer Guide
From understanding the architecture to contributing your own model — a step-by-step guide.
1. Architecture Overview
FunASR is built around three core ideas: a registry for component discovery, AutoModel as the unified entry point, and config.yaml as the declarative model definition.
Registry System
Every component in FunASR is registered by name. The registry is the lookup table that connects config strings to Python classes:
| Registry | Purpose | Example |
|---|---|---|
model_classes | ASR, VAD, PUNC, SPK models | "Paraformer", "FsmnVADStreaming" |
encoder_classes | Encoder architectures | "SANMEncoder", "ConformerEncoder" |
decoder_classes | Decoder architectures | "ParaformerSANMDecoder" |
frontend_classes | Audio feature extraction | "WavFrontend", "WhisperFrontend" |
tokenizer_classes | Text tokenization | "SentencepiecesTokenizer" |
dataset_classes | Training data loading | "AudioDataset" |
from funasr.register import tables
# Register a new model
@tables.register("model_classes", "MyModel")
class MyModel(nn.Module):
...
# View all registered models
tables.print("model")
2. Development Setup
Clone and install in development mode
git clone https://github.com/modelscope/FunASR.git cd FunASR pip install -e . # editable install pip install -e ".[train]" # with training dependencies
Verify installation
python -c "from funasr import AutoModel; print('OK')"
python -c "from funasr.register import tables; tables.print('model')"
Run existing tests
# Quick smoke test python tests_models/test_fsmn_vad.py python tests_models/test_paraformer.py # Full test suite cd tests_models && python run_all_tests.py
3. How Inference Works (Deep Dive)
Understanding the inference data flow is essential before adding a new model. Here's what happens when you call model.generate(input="audio.wav"):
Step 1: Input Preparation
prepare_data_iterator() normalizes any input type (file path, URL, numpy, bytes, list) into a uniform (key_list, data_list) format.
Step 2: Model Inference
Each model's inference() method receives:
def inference(self, data_in, data_lengths=None, key=None,
tokenizer=None, frontend=None, **kwargs):
# data_in: list of audio samples (numpy arrays)
# tokenizer: for decoding token IDs → text
# frontend: for extracting fbank features
# **kwargs: all config.yaml params + user runtime params
# Must return: (results_list, meta_data_dict)
return [{"key": "id", "text": "hello", "timestamp": [...]}], {"batch_data_time": 5.5}
Step 3: Output Format
The results_list must be a list of dicts. Required/optional fields:
| Field | Type | Required | Description |
|---|---|---|---|
key | str | Yes | Sample identifier |
text | str | Yes (ASR) | Recognized text |
timestamp | list | For SPK | [[start_ms, end_ms], ...] per character |
value | list | VAD only | [[start_ms, end_ms], ...] speech segments |
spk_embedding | Tensor | SPK only | Shape [N, 192] |
inference_with_vad will auto-convert it. But the standard expected format is [[start_ms, end_ms], ...] (list of 2-element lists in milliseconds).
4. Add a New Model
Create model directory
funasr/models/my_model/ ├── __init__.py # empty file ├── model.py # main model class ├── encoder.py # (optional) custom encoder └── decoder.py # (optional) custom decoder
Implement the model class
import torch.nn as nn
from funasr.register import tables
@tables.register("model_classes", "MyModel")
class MyModel(nn.Module):
def __init__(self, **kwargs):
super().__init__() # ← MUST call super().__init__()
# Build your architecture from kwargs (comes from config.yaml)
# kwargs includes: input_size, vocab_size, tokenizer, frontend, etc.
def forward(self, speech, speech_lengths, text, text_lengths, **kwargs):
"""Training forward pass. Return (loss, stats_dict, weight)."""
...
def inference(self, data_in, data_lengths=None, key=None,
tokenizer=None, frontend=None, **kwargs):
"""Inference. Return (results_list, meta_data)."""
...
Create config.yaml
# This file defines what components to use
model: MyModel # matches @tables.register key
model_conf:
hidden_size: 512
frontend: WavFrontend # reuse existing frontend
frontend_conf:
fs: 16000
n_mels: 80
frame_length: 25
frame_shift: 10
cmvn_file: null
tokenizer: SentencepiecesTokenizer
tokenizer_conf:
bpemodel: null
Create configuration.json (for Hub upload)
{
"framework": "pytorch",
"task": "auto-speech-recognition",
"model": {"type": "funasr"},
"file_path_metas": {
"init_param": "model.pt",
"config": "config.yaml",
"tokenizer_conf": {"bpemodel": "my_tokenizer.model"},
"frontend_conf": {"cmvn_file": "am.mvn"}
}
}
This file tells AutoModel how to resolve relative paths. When the model is downloaded, each path in file_path_metas gets the model directory prepended.
Test locally
from funasr import AutoModel # From local directory model = AutoModel(model="./my_model_dir") res = model.generate(input="test.wav") print(res)
5. Add a New Frontend / Tokenizer / Dataset
The same registry pattern applies to all components. Example — adding a new frontend:
from funasr.register import tables
@tables.register("frontend_classes", "MyFrontend")
class MyFrontend(nn.Module):
def __init__(self, fs=16000, **kwargs):
super().__init__()
self.fs = fs
def output_size(self):
return 80 # feature dimension
def forward(self, input, input_lengths):
# input: raw waveform (batch, samples)
# return: features (batch, frames, dim), lengths
...
Then reference it in config.yaml:
frontend: MyFrontend
frontend_conf:
fs: 16000
Same pattern for tokenizer (tokenizer_classes), dataset (dataset_classes), encoder (encoder_classes), etc.
6. Standalone Repository Mode
Your model doesn't need to live inside the FunASR source tree. With trust_remote_code=True, FunASR dynamically loads your model class from an external file:
# User code — loads YOUR model.py from a separate repo
model = AutoModel(
model="your-org/your-model", # HuggingFace/ModelScope repo
trust_remote_code=True,
remote_code="./model.py", # path to model class definition
hub="hf",
)
How it works:
- FunASR downloads the model repo (weights + config + model.py)
remote_code="./model.py"is dynamically imported- The
@tables.registerdecorator in that file registers the model class - Normal
build_model()flow proceeds with the registered class
Your repo structure:
your-model-repo/ ├── model.py # model class with @tables.register ├── config.yaml # model architecture config ├── configuration.json # path resolution ├── model.pt # trained weights └── example/test.wav # demo audio
Examples: Fun-ASR-Nano, SenseVoice
7. Testing Your Model
Write a test script
# tests_models/test_my_model.py
import sys, time
from funasr import AutoModel
def main():
model = AutoModel(model="path/to/model", device="cpu", disable_update=True)
res = model.generate(input="test.wav")
assert res and len(res) > 0, "empty result"
assert "text" in res[0], "missing text field"
print("PASSED")
return 0
if __name__ == "__main__":
sys.exit(main())
Test with VAD + SPK pipeline
# If your model should work with speaker diarization:
model = AutoModel(
model="path/to/model",
vad_model="fsmn-vad",
spk_model="cam++",
)
res = model.generate(input="meeting.wav", cache={})
assert "sentence_info" in res[0]
assert "spk" in res[0]["sentence_info"][0]
Test streaming (if applicable)
cache = {}
for i in range(total_chunks):
chunk = audio[i*stride:(i+1)*stride]
res = model.generate(input=chunk, cache=cache,
is_final=(i == total_chunks-1), ...)
# Verify: same audio gives same result across multiple sessions
8. Common Pitfalls
❌ Forgetting super().__init__()
# WRONG — causes "object has no attribute '_state_dict_pre_hooks'"
class MyEncoder(nn.Module):
def __init__(self):
pass
# CORRECT
class MyEncoder(nn.Module):
def __init__(self):
super().__init__()
❌ Checking kwargs["batch_size"] in your model
batch_size in kwargs is set by inference_with_vad for segment batching (a large number in ms). Don't use it to check actual data batch size. Use len(data_in) instead.
❌ Not handling empty/short input
VAD may produce empty segments. Your inference() should handle data_in = [] gracefully.
❌ Timestamp format mismatch
If your model returns timestamps as dicts ({"start_time": 0.5, "end_time": 0.8}), the pipeline handles conversion. But if you output [start_time, end_time, text] (3 elements), strip the text — downstream expects [start_ms, end_ms] (2 elements, in milliseconds).
❌ Importing from other model directories
# WRONG — creates tight coupling from funasr.models.paraformer.model import Paraformer # CORRECT — copy what you need into your own directory # Or inherit via the registry name in config.yaml
❌ Modifying self.kwargs during inference
Don't mutate kwargs that came from AutoModel. The framework resets state between calls, but persistent mutations can leak between sessions.
9. Contributing
Code Style
- Follow existing patterns — look at
paraformer/model.pyas reference - Add docstrings to all public methods (Args, Returns)
- No comments explaining WHAT — use clear naming. Comments for WHY only.
PR Checklist
- New model: self-contained directory, no cross-model imports
- Include a test script in
tests_models/ - Include a demo in
examples/industrial_data_pretraining/ - All existing tests still pass
- Add entry to README What's New if user-facing
License
Code: MIT. Model weights: FunASR Model License (commercial use allowed with attribution).