Training & Fine-tuning

Fine-tune pretrained models on your own data using FunASR's training framework.

Overview Data Preparation Fine-tune Paraformer Fine-tune SenseVoice Fine-tune Fun-ASR-Nano Parameter Reference Multi-GPU Training DeepSpeed Monitoring Use Fine-tuned Model Tips & Troubleshooting

Overview

FunASR's training framework supports:

Fine-tuning any pretrained model on custom domain data
Multi-GPU training with PyTorch DDP (single/multi-node)
DeepSpeed ZeRO Stage 1/2/3 for large model training
Dynamic batching by token count or example count
Checkpoint averaging for best performance
Resume training from interruption

The training entry point is funasr-train-ds (or funasr/bin/train_ds.py), launched via torchrun for distributed training.

Data Preparation

Standard Format (Paraformer, SenseVoice)

Training data uses JSONL format — one JSON object per line:

{"key": "utt001", "source": "/path/to/audio.wav", "source_len": 90, "target": "这是转写文本", "target_len": 6}
{"key": "utt002", "source": "/path/to/audio2.wav", "source_len": 150, "target": "hello world", "target_len": 2}

Field	Type	Description
`key`	str	Unique utterance ID
`source`	str	Audio file path (local path or URL)
`source_len`	int	Audio length in fbank frames (1 frame = 10ms)
`target`	str	Transcription text
`target_len`	int	Number of text tokens

Generate from wav.scp + text.txt

If you have Kaldi-style data files, convert them:

# train_wav.scp (tab-separated: id \t path)
utt001  /data/audio/001.wav
utt002  /data/audio/002.wav

# train_text.txt (tab-separated: id \t text)
utt001  这是转写文本
utt002  hello world

# Convert to jsonl
scp2jsonl \
  ++scp_file_list='["/data/list/train_wav.scp", "/data/list/train_text.txt"]' \
  ++data_type_list='["source", "target"]' \
  ++jsonl_file_out="/data/list/train.jsonl"

ChatML Format (Fun-ASR-Nano)

Fun-ASR-Nano uses ChatML conversation format:

{"messages": [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "语音转写：<|startofspeech|>!/path/to/audio.wav<|endofspeech|>"},
  {"role": "assistant", "content": "几点了？"}
], "speech_length": 145, "text_length": 3}

Field	Description
`messages[0]`	System prompt (fixed: "You are a helpful assistant.")
`messages[1]`	User: prompt + audio path wrapped in `<\|startofspeech\|>!...<\|endofspeech\|>`
`messages[2]`	Assistant: transcription text
`speech_length`	Number of fbank frames (10ms each)
`text_length`	Number of tokens (tokenized by Qwen3-0.6B)

Prompt variations:
• Chinese: 语音转写：
• English: Speech transcription:
• Cross-language: 语音转写成英文：
• No ITN: 语音转写，不进行文本规整：

Convert from wav.scp + text.txt:

python tools/scp2jsonl.py \
  ++scp_file=data/train_wav.scp \
  ++transcript_file=data/train_text.txt \
  ++jsonl_file=data/train_example.jsonl

Fine-tune Paraformer

cd examples/industrial_data_pretraining/paraformer
bash finetune.sh

Or customize the key parameters:

export CUDA_VISIBLE_DEVICES="0,1"
gpu_num=2

torchrun --nproc_per_node $gpu_num \
  funasr/bin/train_ds.py \
  ++model="iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" \
  ++train_data_set_list="data/train.jsonl" \
  ++valid_data_set_list="data/val.jsonl" \
  ++dataset_conf.batch_size=6000 \
  ++dataset_conf.batch_type="token" \
  ++dataset_conf.num_workers=4 \
  ++train_conf.max_epoch=50 \
  ++train_conf.validate_interval=2000 \
  ++train_conf.save_checkpoint_interval=2000 \
  ++train_conf.keep_nbest_models=20 \
  ++train_conf.avg_nbest_model=10 \
  ++optim_conf.lr=0.0002 \
  ++output_dir="./outputs"

Fine-tune SenseVoice

cd examples/industrial_data_pretraining/sense_voice
bash finetune.sh

Same data format as Paraformer (source/target JSONL). Key difference: SenseVoice uses its own dataset class internally.

Fine-tune Fun-ASR-Nano

cd examples/industrial_data_pretraining/fun_asr_nano
bash finetune.sh

Key differences from Paraformer:

Uses ChatML data format (see above)
++trust_remote_code=true required
Supports selective freezing: freeze encoder/adaptor while training LLM decoder

# Freeze encoder + adaptor, only train LLM (recommended for domain adaptation)
++audio_encoder_conf.freeze=true
++audio_adaptor_conf.freeze=true
++llm_conf.freeze=false

# Full fine-tune (all parameters)
++audio_encoder_conf.freeze=false
++audio_adaptor_conf.freeze=false
++llm_conf.freeze=false

Recommended strategy: Start with LLM-only fine-tuning (faster, less data needed). If results are insufficient, unfreeze adaptor. Only unfreeze encoder with very large datasets (>1000h).

Parameter Reference

Dataset Parameters

Parameter	Default	Description
`dataset_conf.batch_type`	"token"	`"token"`: dynamic batch by total tokens. `"example"`: fixed batch count.
`dataset_conf.batch_size`	6000	Token mode: total frames per batch. Example mode: number of samples.
`dataset_conf.sort_size`	1024	Buffer size for length-based sorting (improves padding efficiency).
`dataset_conf.num_workers`	4	Data loading threads.
`dataset_conf.data_split_num`	1	Split data into N groups for large-scale training (reduces memory).
`dataset_conf.max_token_length`	—	Filter: skip samples longer than this (in frames/tokens).
`dataset_conf.min_token_length`	—	Filter: skip samples shorter than this.

Training Parameters

Parameter	Default	Description
`train_conf.max_epoch`	50	Total training epochs.
`train_conf.log_interval`	1	Print loss every N steps.
`train_conf.validate_interval`	2000	Run validation every N steps.
`train_conf.save_checkpoint_interval`	2000	Save model every N steps.
`train_conf.keep_nbest_models`	20	Keep top N models (by validation accuracy).
`train_conf.avg_nbest_model`	10	Average top N models for final checkpoint.
`train_conf.resume`	true	Resume from last checkpoint if exists.
`train_conf.use_deepspeed`	false	Enable DeepSpeed ZeRO optimization.
`optim_conf.lr`	0.0002	Learning rate.

Multi-GPU Training

Single Machine, Multiple GPUs

export CUDA_VISIBLE_DEVICES="0,1,2,3"
gpu_num=4

torchrun --nnodes 1 --nproc_per_node $gpu_num \
  funasr/bin/train_ds.py ${train_args}

Multiple Machines

# Machine 1 (master, IP=192.168.1.1)
torchrun --nnodes 2 --node_rank 0 --nproc_per_node 4 \
  --master_addr=192.168.1.1 --master_port=12345 \
  funasr/bin/train_ds.py ${train_args}

# Machine 2
torchrun --nnodes 2 --node_rank 1 --nproc_per_node 4 \
  --master_addr=192.168.1.1 --master_port=12345 \
  funasr/bin/train_ds.py ${train_args}

DeepSpeed

For large models (Fun-ASR-Nano 800M params), enable DeepSpeed ZeRO:

++train_conf.use_deepspeed=true
++train_conf.deepspeed_config=./deepspeed_conf/ds_stage1.json

Stage 1 config (recommended starting point):

{
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 1,
  "bf16": {"enabled": true},
  "zero_optimization": {
    "stage": 1,
    "reduce_bucket_size": 5e8,
    "allgather_bucket_size": 5e8
  }
}

When to use which stage:
• Stage 1: Optimizer state partitioned. Good for most cases.
• Stage 2: + Gradient partitioned. For larger models.
• Stage 3: + Parameter partitioned. Maximum memory savings but slower.

Monitoring Training

Log file

tail -f outputs/log.txt

# Example output:
# train, rank: 0, epoch: 0/50, step: 6990, (loss_avg_rank: 0.327),
# (acc_avg_epoch: 0.795), (lr: 1.165e-04),
# GPU memory: usage: 3.8GB, peak: 18.3GB

Key metrics to watch:

loss_avg_epoch: should decrease over time
acc_avg_epoch: should increase (most important metric)
lr: learning rate at current step
GPU memory: peak should not exceed your GPU VRAM

TensorBoard

tensorboard --logdir outputs/log/tensorboard
# Open http://localhost:6006

Use Your Fine-tuned Model

If outputs/ has configuration.json

from funasr import AutoModel
model = AutoModel(model="./outputs")
res = model.generate(input="test.wav")
print(res[0]["text"])

If no configuration.json

funasr ++model="./outputs" \
  ++config-path="./outputs" \
  ++config-name="config.yaml" \
  ++init_param="./outputs/model.pt" \
  ++input="test.wav"

Tips & Troubleshooting

OOM during training

Reduce dataset_conf.batch_size
Add dataset_conf.max_token_length=2000 to filter long utterances
Enable DeepSpeed (partitions optimizer states)
Reduce dataset_conf.num_workers

Training loss stuck / NaN gradients

Reduce learning rate (try 0.00005)
Check data quality — corrupted audio files cause NaN
For Fun-ASR-Nano: start with encoder frozen

Validation accuracy not improving

Increase training data (min ~10h for fine-tuning)
Check domain match — model may not generalize to very different domains
Try unfreezing more layers gradually

Large-scale data (>10,000 hours)

Use data splitting to avoid memory issues:

# Split data into chunks, load 2 at a time
++dataset_conf.data_split_num=256
# data.list contains paths to split jsonl files:
# data/train.0.jsonl
# data/train.1.jsonl
# ...
++train_data_set_list="data/data.list"

Resume after crash

Set ++train_conf.resume=true (default). Training automatically restarts from the latest checkpoint in output_dir.