Training & Fine-tuning

Fine-tune pretrained models on your own data using FunASR's training framework.

Overview

FunASR's training framework supports:

The training entry point is funasr-train-ds (or funasr/bin/train_ds.py), launched via torchrun for distributed training.

Data Preparation

Standard Format (Paraformer, SenseVoice)

Training data uses JSONL format — one JSON object per line:

{"key": "utt001", "source": "/path/to/audio.wav", "source_len": 90, "target": "这是转写文本", "target_len": 6}
{"key": "utt002", "source": "/path/to/audio2.wav", "source_len": 150, "target": "hello world", "target_len": 2}
FieldTypeDescription
keystrUnique utterance ID
sourcestrAudio file path (local path or URL)
source_lenintAudio length in fbank frames (1 frame = 10ms)
targetstrTranscription text
target_lenintNumber of text tokens

Generate from wav.scp + text.txt

If you have Kaldi-style data files, convert them:

# train_wav.scp (tab-separated: id \t path)
utt001  /data/audio/001.wav
utt002  /data/audio/002.wav

# train_text.txt (tab-separated: id \t text)
utt001  这是转写文本
utt002  hello world
# Convert to jsonl
scp2jsonl \
  ++scp_file_list='["/data/list/train_wav.scp", "/data/list/train_text.txt"]' \
  ++data_type_list='["source", "target"]' \
  ++jsonl_file_out="/data/list/train.jsonl"

ChatML Format (Fun-ASR-Nano)

Fun-ASR-Nano uses ChatML conversation format:

{"messages": [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "语音转写:<|startofspeech|>!/path/to/audio.wav<|endofspeech|>"},
  {"role": "assistant", "content": "几点了?"}
], "speech_length": 145, "text_length": 3}
FieldDescription
messages[0]System prompt (fixed: "You are a helpful assistant.")
messages[1]User: prompt + audio path wrapped in <|startofspeech|>!...<|endofspeech|>
messages[2]Assistant: transcription text
speech_lengthNumber of fbank frames (10ms each)
text_lengthNumber of tokens (tokenized by Qwen3-0.6B)
Prompt variations:
• Chinese: 语音转写:
• English: Speech transcription:
• Cross-language: 语音转写成英文:
• No ITN: 语音转写,不进行文本规整:

Convert from wav.scp + text.txt:

python tools/scp2jsonl.py \
  ++scp_file=data/train_wav.scp \
  ++transcript_file=data/train_text.txt \
  ++jsonl_file=data/train_example.jsonl

Fine-tune Paraformer

cd examples/industrial_data_pretraining/paraformer
bash finetune.sh

Or customize the key parameters:

export CUDA_VISIBLE_DEVICES="0,1"
gpu_num=2

torchrun --nproc_per_node $gpu_num \
  funasr/bin/train_ds.py \
  ++model="iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" \
  ++train_data_set_list="data/train.jsonl" \
  ++valid_data_set_list="data/val.jsonl" \
  ++dataset_conf.batch_size=6000 \
  ++dataset_conf.batch_type="token" \
  ++dataset_conf.num_workers=4 \
  ++train_conf.max_epoch=50 \
  ++train_conf.validate_interval=2000 \
  ++train_conf.save_checkpoint_interval=2000 \
  ++train_conf.keep_nbest_models=20 \
  ++train_conf.avg_nbest_model=10 \
  ++optim_conf.lr=0.0002 \
  ++output_dir="./outputs"

Fine-tune SenseVoice

cd examples/industrial_data_pretraining/sense_voice
bash finetune.sh

Same data format as Paraformer (source/target JSONL). Key difference: SenseVoice uses its own dataset class internally.

Fine-tune Fun-ASR-Nano

cd examples/industrial_data_pretraining/fun_asr_nano
bash finetune.sh

Key differences from Paraformer:

# Freeze encoder + adaptor, only train LLM (recommended for domain adaptation)
++audio_encoder_conf.freeze=true
++audio_adaptor_conf.freeze=true
++llm_conf.freeze=false

# Full fine-tune (all parameters)
++audio_encoder_conf.freeze=false
++audio_adaptor_conf.freeze=false
++llm_conf.freeze=false
Recommended strategy: Start with LLM-only fine-tuning (faster, less data needed). If results are insufficient, unfreeze adaptor. Only unfreeze encoder with very large datasets (>1000h).

Parameter Reference

Dataset Parameters

ParameterDefaultDescription
dataset_conf.batch_type"token""token": dynamic batch by total tokens. "example": fixed batch count.
dataset_conf.batch_size6000Token mode: total frames per batch. Example mode: number of samples.
dataset_conf.sort_size1024Buffer size for length-based sorting (improves padding efficiency).
dataset_conf.num_workers4Data loading threads.
dataset_conf.data_split_num1Split data into N groups for large-scale training (reduces memory).
dataset_conf.max_token_lengthFilter: skip samples longer than this (in frames/tokens).
dataset_conf.min_token_lengthFilter: skip samples shorter than this.

Training Parameters

ParameterDefaultDescription
train_conf.max_epoch50Total training epochs.
train_conf.log_interval1Print loss every N steps.
train_conf.validate_interval2000Run validation every N steps.
train_conf.save_checkpoint_interval2000Save model every N steps.
train_conf.keep_nbest_models20Keep top N models (by validation accuracy).
train_conf.avg_nbest_model10Average top N models for final checkpoint.
train_conf.resumetrueResume from last checkpoint if exists.
train_conf.use_deepspeedfalseEnable DeepSpeed ZeRO optimization.
optim_conf.lr0.0002Learning rate.

Multi-GPU Training

Single Machine, Multiple GPUs

export CUDA_VISIBLE_DEVICES="0,1,2,3"
gpu_num=4

torchrun --nnodes 1 --nproc_per_node $gpu_num \
  funasr/bin/train_ds.py ${train_args}

Multiple Machines

# Machine 1 (master, IP=192.168.1.1)
torchrun --nnodes 2 --node_rank 0 --nproc_per_node 4 \
  --master_addr=192.168.1.1 --master_port=12345 \
  funasr/bin/train_ds.py ${train_args}

# Machine 2
torchrun --nnodes 2 --node_rank 1 --nproc_per_node 4 \
  --master_addr=192.168.1.1 --master_port=12345 \
  funasr/bin/train_ds.py ${train_args}

DeepSpeed

For large models (Fun-ASR-Nano 800M params), enable DeepSpeed ZeRO:

++train_conf.use_deepspeed=true
++train_conf.deepspeed_config=./deepspeed_conf/ds_stage1.json

Stage 1 config (recommended starting point):

{
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 1,
  "bf16": {"enabled": true},
  "zero_optimization": {
    "stage": 1,
    "reduce_bucket_size": 5e8,
    "allgather_bucket_size": 5e8
  }
}
When to use which stage:
• Stage 1: Optimizer state partitioned. Good for most cases.
• Stage 2: + Gradient partitioned. For larger models.
• Stage 3: + Parameter partitioned. Maximum memory savings but slower.

Monitoring Training

Log file

tail -f outputs/log.txt

# Example output:
# train, rank: 0, epoch: 0/50, step: 6990, (loss_avg_rank: 0.327),
# (acc_avg_epoch: 0.795), (lr: 1.165e-04),
# GPU memory: usage: 3.8GB, peak: 18.3GB

Key metrics to watch:

TensorBoard

tensorboard --logdir outputs/log/tensorboard
# Open http://localhost:6006

Use Your Fine-tuned Model

If outputs/ has configuration.json

from funasr import AutoModel
model = AutoModel(model="./outputs")
res = model.generate(input="test.wav")
print(res[0]["text"])

If no configuration.json

funasr ++model="./outputs" \
  ++config-path="./outputs" \
  ++config-name="config.yaml" \
  ++init_param="./outputs/model.pt" \
  ++input="test.wav"

Tips & Troubleshooting

OOM during training

  1. Reduce dataset_conf.batch_size
  2. Add dataset_conf.max_token_length=2000 to filter long utterances
  3. Enable DeepSpeed (partitions optimizer states)
  4. Reduce dataset_conf.num_workers

Training loss stuck / NaN gradients

Validation accuracy not improving

Large-scale data (>10,000 hours)

Use data splitting to avoid memory issues:

# Split data into chunks, load 2 at a time
++dataset_conf.data_split_num=256
# data.list contains paths to split jsonl files:
# data/train.0.jsonl
# data/train.1.jsonl
# ...
++train_data_set_list="data/data.list"

Resume after crash

Set ++train_conf.resume=true (default). Training automatically restarts from the latest checkpoint in output_dir.