Training & Fine-tuning
Fine-tune pretrained models on your own data using FunASR's training framework.
Overview
FunASR's training framework supports:
- Fine-tuning any pretrained model on custom domain data
- Multi-GPU training with PyTorch DDP (single/multi-node)
- DeepSpeed ZeRO Stage 1/2/3 for large model training
- Dynamic batching by token count or example count
- Checkpoint averaging for best performance
- Resume training from interruption
The training entry point is funasr-train-ds (or funasr/bin/train_ds.py), launched via torchrun for distributed training.
Data Preparation
Standard Format (Paraformer, SenseVoice)
Training data uses JSONL format — one JSON object per line:
{"key": "utt001", "source": "/path/to/audio.wav", "source_len": 90, "target": "这是转写文本", "target_len": 6}
{"key": "utt002", "source": "/path/to/audio2.wav", "source_len": 150, "target": "hello world", "target_len": 2}
| Field | Type | Description |
|---|---|---|
key | str | Unique utterance ID |
source | str | Audio file path (local path or URL) |
source_len | int | Audio length in fbank frames (1 frame = 10ms) |
target | str | Transcription text |
target_len | int | Number of text tokens |
Generate from wav.scp + text.txt
If you have Kaldi-style data files, convert them:
# train_wav.scp (tab-separated: id \t path) utt001 /data/audio/001.wav utt002 /data/audio/002.wav # train_text.txt (tab-separated: id \t text) utt001 这是转写文本 utt002 hello world
# Convert to jsonl scp2jsonl \ ++scp_file_list='["/data/list/train_wav.scp", "/data/list/train_text.txt"]' \ ++data_type_list='["source", "target"]' \ ++jsonl_file_out="/data/list/train.jsonl"
ChatML Format (Fun-ASR-Nano)
Fun-ASR-Nano uses ChatML conversation format:
{"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "语音转写:<|startofspeech|>!/path/to/audio.wav<|endofspeech|>"},
{"role": "assistant", "content": "几点了?"}
], "speech_length": 145, "text_length": 3}
| Field | Description |
|---|---|
messages[0] | System prompt (fixed: "You are a helpful assistant.") |
messages[1] | User: prompt + audio path wrapped in <|startofspeech|>!...<|endofspeech|> |
messages[2] | Assistant: transcription text |
speech_length | Number of fbank frames (10ms each) |
text_length | Number of tokens (tokenized by Qwen3-0.6B) |
• Chinese:
语音转写:• English:
Speech transcription:• Cross-language:
语音转写成英文:• No ITN:
语音转写,不进行文本规整:Convert from wav.scp + text.txt:
python tools/scp2jsonl.py \ ++scp_file=data/train_wav.scp \ ++transcript_file=data/train_text.txt \ ++jsonl_file=data/train_example.jsonl
Fine-tune Paraformer
cd examples/industrial_data_pretraining/paraformer bash finetune.sh
Or customize the key parameters:
export CUDA_VISIBLE_DEVICES="0,1" gpu_num=2 torchrun --nproc_per_node $gpu_num \ funasr/bin/train_ds.py \ ++model="iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" \ ++train_data_set_list="data/train.jsonl" \ ++valid_data_set_list="data/val.jsonl" \ ++dataset_conf.batch_size=6000 \ ++dataset_conf.batch_type="token" \ ++dataset_conf.num_workers=4 \ ++train_conf.max_epoch=50 \ ++train_conf.validate_interval=2000 \ ++train_conf.save_checkpoint_interval=2000 \ ++train_conf.keep_nbest_models=20 \ ++train_conf.avg_nbest_model=10 \ ++optim_conf.lr=0.0002 \ ++output_dir="./outputs"
Fine-tune SenseVoice
cd examples/industrial_data_pretraining/sense_voice bash finetune.sh
Same data format as Paraformer (source/target JSONL). Key difference: SenseVoice uses its own dataset class internally.
Fine-tune Fun-ASR-Nano
cd examples/industrial_data_pretraining/fun_asr_nano bash finetune.sh
Key differences from Paraformer:
- Uses ChatML data format (see above)
++trust_remote_code=truerequired- Supports selective freezing: freeze encoder/adaptor while training LLM decoder
# Freeze encoder + adaptor, only train LLM (recommended for domain adaptation) ++audio_encoder_conf.freeze=true ++audio_adaptor_conf.freeze=true ++llm_conf.freeze=false # Full fine-tune (all parameters) ++audio_encoder_conf.freeze=false ++audio_adaptor_conf.freeze=false ++llm_conf.freeze=false
Parameter Reference
Dataset Parameters
| Parameter | Default | Description |
|---|---|---|
dataset_conf.batch_type | "token" | "token": dynamic batch by total tokens. "example": fixed batch count. |
dataset_conf.batch_size | 6000 | Token mode: total frames per batch. Example mode: number of samples. |
dataset_conf.sort_size | 1024 | Buffer size for length-based sorting (improves padding efficiency). |
dataset_conf.num_workers | 4 | Data loading threads. |
dataset_conf.data_split_num | 1 | Split data into N groups for large-scale training (reduces memory). |
dataset_conf.max_token_length | — | Filter: skip samples longer than this (in frames/tokens). |
dataset_conf.min_token_length | — | Filter: skip samples shorter than this. |
Training Parameters
| Parameter | Default | Description |
|---|---|---|
train_conf.max_epoch | 50 | Total training epochs. |
train_conf.log_interval | 1 | Print loss every N steps. |
train_conf.validate_interval | 2000 | Run validation every N steps. |
train_conf.save_checkpoint_interval | 2000 | Save model every N steps. |
train_conf.keep_nbest_models | 20 | Keep top N models (by validation accuracy). |
train_conf.avg_nbest_model | 10 | Average top N models for final checkpoint. |
train_conf.resume | true | Resume from last checkpoint if exists. |
train_conf.use_deepspeed | false | Enable DeepSpeed ZeRO optimization. |
optim_conf.lr | 0.0002 | Learning rate. |
Multi-GPU Training
Single Machine, Multiple GPUs
export CUDA_VISIBLE_DEVICES="0,1,2,3"
gpu_num=4
torchrun --nnodes 1 --nproc_per_node $gpu_num \
funasr/bin/train_ds.py ${train_args}
Multiple Machines
# Machine 1 (master, IP=192.168.1.1)
torchrun --nnodes 2 --node_rank 0 --nproc_per_node 4 \
--master_addr=192.168.1.1 --master_port=12345 \
funasr/bin/train_ds.py ${train_args}
# Machine 2
torchrun --nnodes 2 --node_rank 1 --nproc_per_node 4 \
--master_addr=192.168.1.1 --master_port=12345 \
funasr/bin/train_ds.py ${train_args}
DeepSpeed
For large models (Fun-ASR-Nano 800M params), enable DeepSpeed ZeRO:
++train_conf.use_deepspeed=true ++train_conf.deepspeed_config=./deepspeed_conf/ds_stage1.json
Stage 1 config (recommended starting point):
{
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 1,
"bf16": {"enabled": true},
"zero_optimization": {
"stage": 1,
"reduce_bucket_size": 5e8,
"allgather_bucket_size": 5e8
}
}
• Stage 1: Optimizer state partitioned. Good for most cases.
• Stage 2: + Gradient partitioned. For larger models.
• Stage 3: + Parameter partitioned. Maximum memory savings but slower.
Monitoring Training
Log file
tail -f outputs/log.txt # Example output: # train, rank: 0, epoch: 0/50, step: 6990, (loss_avg_rank: 0.327), # (acc_avg_epoch: 0.795), (lr: 1.165e-04), # GPU memory: usage: 3.8GB, peak: 18.3GB
Key metrics to watch:
loss_avg_epoch: should decrease over timeacc_avg_epoch: should increase (most important metric)lr: learning rate at current stepGPU memory: peak should not exceed your GPU VRAM
TensorBoard
tensorboard --logdir outputs/log/tensorboard # Open http://localhost:6006
Use Your Fine-tuned Model
If outputs/ has configuration.json
from funasr import AutoModel model = AutoModel(model="./outputs") res = model.generate(input="test.wav") print(res[0]["text"])
If no configuration.json
funasr ++model="./outputs" \ ++config-path="./outputs" \ ++config-name="config.yaml" \ ++init_param="./outputs/model.pt" \ ++input="test.wav"
Tips & Troubleshooting
OOM during training
- Reduce
dataset_conf.batch_size - Add
dataset_conf.max_token_length=2000to filter long utterances - Enable DeepSpeed (partitions optimizer states)
- Reduce
dataset_conf.num_workers
Training loss stuck / NaN gradients
- Reduce learning rate (try 0.00005)
- Check data quality — corrupted audio files cause NaN
- For Fun-ASR-Nano: start with encoder frozen
Validation accuracy not improving
- Increase training data (min ~10h for fine-tuning)
- Check domain match — model may not generalize to very different domains
- Try unfreezing more layers gradually
Large-scale data (>10,000 hours)
Use data splitting to avoid memory issues:
# Split data into chunks, load 2 at a time ++dataset_conf.data_split_num=256 # data.list contains paths to split jsonl files: # data/train.0.jsonl # data/train.1.jsonl # ... ++train_data_set_list="data/data.list"
Resume after crash
Set ++train_conf.resume=true (default). Training automatically restarts from the latest checkpoint in output_dir.