Skip to content

Latest commit

 

History

History

README.md

LibriSpeech ASR with Transformers or Whisper models.

This folder contains the scripts to train a Transformer-based speech recognizer or the scripts to fine-tune the Whisper encoder-decoder model.

You can download LibriSpeech at http://www.openslr.org/12

How to run

python train_with_whisper.py hparams/train_hf_whisper.yaml
python train.py hparams/transformer.yaml

How to run on test sets only

If you want to run it on the test sets only, you can add the flag --test_only to the following command:

python train_with_whisper.py hparams/train_hf_whisper.yaml --test_only
python train.py hparams/transformer.yaml --test_only

If using a HuggingFace pre-trained model, please make sure you have "transformers" installed in your environment (see extra-requirements.txt)

Results

SpeechLLM with SSL features

Two SpeechLLM modes are supported:

  • SpeechLLM with SSL features
  • SpeechLLM with E2E features

In the first mode, the speech features are extracted from the audio waveforms using a pre-trained SSL model, and then projected to the LLM embedding space using a linear layer projection, where everything is trained jointly.

In the second mode, the speech features are already being extracted offline (see: extract_ssl_feats.py script). The LLM is then trained on the frozen SSL representations. This mode is more efficient and faster to train, but at the cost of flexibility on the frozen SSL model.

| Release | Model | hyperparams file | Dev Clean WER | Dev Other WER | Test Clean WER | Test Other WER | HuggingFace link | Model link | GPUs | |:-------------:|:-------------:|:-------------:|:---------------------------:| :-----:| :-----:| :-----:| :-----:| :--------:| | 29-01-26 | WavLM Large + LLama 3.2 1B + LoRA | speechllm_e2e.yaml | 2.79 | 5.03 | 2.72 | 5.34 | HuggingFace | - | 1xA100 80GB |

Whisper Finetuning Result:

Following table contains whisper-finetuning results for 1 epoch using Whisper model, freezing encoder and finetuning decoder.

Release Model commit hash hyperparams file LM Dev Clean WER Test Clean WER Test Other WER HuggingFace link Model link GPUs
2024-03-28 large-v3 e4e2e13 train_hf_whisper.yaml No 2.00% 1.96% 4.30% Not Avail. DropBox 2xV100S 32GB
2024-03-28 medium.en e4e2e13 train_hf_whisper.yaml No 2.35% 2.40% 5.59% Not Avail. DropBox 2xV100S 32GB
2024-07-20 small.en 9864011 train_whisper_lora.yaml No 2.81% 2.90% 6.57% Not Avail. DropBox 1x1080Ti 12GB

Transformers

Release hyperparams file Dev Clean WER (No LM, small beam) Test Clean WER (Transformer LM) Test Other WER (Transformer LM) HuggingFace link Model link GPUs
30-09-24 conformer_large.yaml (new RoPE version) 1.85 with LM 1.96 4.50 Not Avail. Not Avail. 4xA40 46GB
23-05-23 branchformer_large.yaml 2.72 (1.9 with LM) 2.04 4.13 Not Avail. DropBox 4xA100 80GB
10-02-25 conformer_large.yaml 1.85 with LM 1.97 4.50 N/A N/A 4xA100 80GB
23-05-23 conformer_large.yaml 2.62 (1.9 with LM) 2.01 4.52 HuggingFace DropBox 4xA100 80GB
24-03-22 transformer.yaml 3.32 2.27 5.53 HuggingFace DropBox 4xV100 32GB
24-03-22 conformer_small.yaml 4.05 2.49 6.1 (only 13.3M parameters) HuggingFace DropBox 1xV100 32GB
27-03-23 hyperconformer_8M.yaml 4.69 2.55 6.61 (only 7.9M parameters) Not Avail. DropBox 1xP40 24GB
27-03-23 hyperconformer_22M.yaml 3.19 2.23 5.54 (only 21.7M parameters) Not Avail. DropBox 1xP40 24GB
03-09-23 hyperbranchformer_13M.yaml NA 2.54 6.58 Not Avail. Not Avail. 1xP40 24GB
03-09-23 hyperbranchformer_25M.yaml NA 2.36 5.89 Not Avail. Not Avail. 1xP40 24GB
05-01-24 bayesspeech.yaml 4.28 2.84 6.27 Not Avail. DropBox 1xV100 32GB

About SpeechBrain

Citing SpeechBrain

Please, cite SpeechBrain if you use it for your research or business.

@misc{speechbrainV1,
  title={Open-Source Conversational AI with SpeechBrain 1.0},
  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
  year={2024},
  eprint={2407.00463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}