This folder contains the scripts to train a Transformer-based speech recognizer or the scripts to fine-tune the Whisper encoder-decoder model.
You can download LibriSpeech at http://www.openslr.org/12
python train_with_whisper.py hparams/train_hf_whisper.yaml
python train.py hparams/transformer.yamlIf you want to run it on the test sets only, you can add the flag --test_only to the following command:
python train_with_whisper.py hparams/train_hf_whisper.yaml --test_only
python train.py hparams/transformer.yaml --test_onlyIf using a HuggingFace pre-trained model, please make sure you have "transformers" installed in your environment (see extra-requirements.txt)
Two SpeechLLM modes are supported:
- SpeechLLM with SSL features
- SpeechLLM with E2E features
In the first mode, the speech features are extracted from the audio waveforms using a pre-trained SSL model, and then projected to the LLM embedding space using a linear layer projection, where everything is trained jointly.
In the second mode, the speech features are already being extracted offline (see: extract_ssl_feats.py script). The LLM is then trained on the frozen SSL representations. This mode is more efficient and faster to train, but at the cost of flexibility on the frozen SSL model.
| Release | Model | hyperparams file | Dev Clean WER | Dev Other WER | Test Clean WER | Test Other WER | HuggingFace link | Model link | GPUs | |:-------------:|:-------------:|:-------------:|:---------------------------:| :-----:| :-----:| :-----:| :-----:| :--------:| | 29-01-26 | WavLM Large + LLama 3.2 1B + LoRA | speechllm_e2e.yaml | 2.79 | 5.03 | 2.72 | 5.34 | HuggingFace | - | 1xA100 80GB |
Following table contains whisper-finetuning results for 1 epoch using Whisper model, freezing encoder and finetuning decoder.
| Release | Model | commit hash | hyperparams file | LM | Dev Clean WER | Test Clean WER | Test Other WER | HuggingFace link | Model link | GPUs |
|---|---|---|---|---|---|---|---|---|---|---|
| 2024-03-28 | large-v3 | e4e2e13 | train_hf_whisper.yaml | No | 2.00% | 1.96% | 4.30% | Not Avail. | DropBox | 2xV100S 32GB |
| 2024-03-28 | medium.en | e4e2e13 | train_hf_whisper.yaml | No | 2.35% | 2.40% | 5.59% | Not Avail. | DropBox | 2xV100S 32GB |
| 2024-07-20 | small.en | 9864011 | train_whisper_lora.yaml | No | 2.81% | 2.90% | 6.57% | Not Avail. | DropBox | 1x1080Ti 12GB |
| Release | hyperparams file | Dev Clean WER (No LM, small beam) | Test Clean WER (Transformer LM) | Test Other WER (Transformer LM) | HuggingFace link | Model link | GPUs |
|---|---|---|---|---|---|---|---|
| 30-09-24 | conformer_large.yaml (new RoPE version) | 1.85 with LM | 1.96 | 4.50 | Not Avail. | Not Avail. | 4xA40 46GB |
| 23-05-23 | branchformer_large.yaml | 2.72 (1.9 with LM) | 2.04 | 4.13 | Not Avail. | DropBox | 4xA100 80GB |
| 10-02-25 | conformer_large.yaml | 1.85 with LM | 1.97 | 4.50 | N/A | N/A | 4xA100 80GB |
| 23-05-23 | conformer_large.yaml | 2.62 (1.9 with LM) | 2.01 | 4.52 | HuggingFace | DropBox | 4xA100 80GB |
| 24-03-22 | transformer.yaml | 3.32 | 2.27 | 5.53 | HuggingFace | DropBox | 4xV100 32GB |
| 24-03-22 | conformer_small.yaml | 4.05 | 2.49 | 6.1 (only 13.3M parameters) | HuggingFace | DropBox | 1xV100 32GB |
| 27-03-23 | hyperconformer_8M.yaml | 4.69 | 2.55 | 6.61 (only 7.9M parameters) | Not Avail. | DropBox | 1xP40 24GB |
| 27-03-23 | hyperconformer_22M.yaml | 3.19 | 2.23 | 5.54 (only 21.7M parameters) | Not Avail. | DropBox | 1xP40 24GB |
| 03-09-23 | hyperbranchformer_13M.yaml | NA | 2.54 | 6.58 | Not Avail. | Not Avail. | 1xP40 24GB |
| 03-09-23 | hyperbranchformer_25M.yaml | NA | 2.36 | 5.89 | Not Avail. | Not Avail. | 1xP40 24GB |
| 05-01-24 | bayesspeech.yaml | 4.28 | 2.84 | 6.27 | Not Avail. | DropBox | 1xV100 32GB |
- Website: https://speechbrain.github.io/
- Code: https://github.com/speechbrain/speechbrain/
- HuggingFace: https://huggingface.co/speechbrain/
Please, cite SpeechBrain if you use it for your research or business.
@misc{speechbrainV1,
title={Open-Source Conversational AI with SpeechBrain 1.0},
author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
year={2024},
eprint={2407.00463},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
title={{SpeechBrain}: A General-Purpose Speech Toolkit},
author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
year={2021},
eprint={2106.04624},
archivePrefix={arXiv},
primaryClass={eess.AS},
note={arXiv:2106.04624}
}