Skip to content

Commit 96b00ca

Browse files
SpeechLLM (with LLaMA) and Conformer recipe for speech translation on CoVoST (Code from Samsung AI Center Cambridge) (#2865)
Co-authored-by: Adel Moumen <adelmoumen.pro@gmail.com> Co-authored-by: Adel Moumen <88119391+Adel-Moumen@users.noreply.github.com>
1 parent 33474cb commit 96b00ca

14 files changed

Lines changed: 2156 additions & 3 deletions

File tree

recipes/CoVoST/AST/README.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# CoVoST speech to text translation
2+
3+
This folder contains script necessary to run automatic speech translation with the [CoVoST dataset](https://github.com/facebookresearch/covost) based on [CommonVoice](https://commonvoice.mozilla.org/en/datasets).
4+
5+
Two heuristics are available:
6+
1. Training from scratch with a conformer encoder-decoder model and multitask speech recognition plus speech translation training.
7+
2. SpeechLLM fine-tuning based on SSL speech encoders and LLaMA large language models (with and without adapters).
8+
9+
# How to run
10+
```shell
11+
python train{_xlsr_llama}.py hparams/{hparam_file}.py
12+
```
13+
14+
# Data preparation
15+
It is important to note that CommonVoice initially offers mp3 audio files. It is feasible to convert these files to .wav during data preparation, this will speed up training but also make the first data preparation to be pretty slow. Audio files are downsampled on the fly within the dataio function of the training script.
16+
17+
# Languages
18+
While CoVoST offers multiple languages, this recipe only was tested on English to German translation. However, there is nothing special to do to select another language pair aside from adding a proper text normalisation on the covost_prepary.py file.
19+
20+
# Results
21+
| Language | hyperparams file | Encoder | LLM | Test BLEU | Hugging Face link | Model link | GPUs |
22+
| ------------- |:-------------:|:---------------------------:| -----:| -----:| -----:| -----:| -----:|
23+
| English - German | conformer.yaml | conformer | None | 13.9 | None | None | 2x A40 |
24+
| English - German | w2v2_llama3.yaml| wavlm-large | LLaMA 3.1 7B | 27.2 | None | None | 2x A100 |
25+
26+
# **About SpeechBrain**
27+
- Website: https://speechbrain.github.io/
28+
- Code: https://github.com/speechbrain/speechbrain/
29+
- HuggingFace: https://huggingface.co/speechbrain/
30+
31+
32+
# **Citing SpeechBrain**
33+
Please, cite SpeechBrain if you use it for your research or business.
34+
35+
```bibtex
36+
@misc{speechbrainV1,
37+
title={Open-Source Conversational AI with SpeechBrain 1.0},
38+
author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
39+
year={2024},
40+
eprint={2407.00463},
41+
archivePrefix={arXiv},
42+
primaryClass={cs.LG},
43+
url={https://arxiv.org/abs/2407.00463},
44+
}
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../covost_prepare.py
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
sacrebleu
Lines changed: 268 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,268 @@
1+
# ############################################################################
2+
# Model: E2E AST with Transformer
3+
# Encoder: Conformer Encoder
4+
# Decoder: Transformer Decoder
5+
# Tokens: unigram
6+
# losses: CTC + KLdiv (Label Smoothing loss)
7+
# Authors: Titouan Parcollet
8+
# ############################################################################
9+
# Seed needs to be set at top of yaml, before objects with parameters are made
10+
seed: 3407
11+
__set_seed: !apply:speechbrain.utils.seed_everything [!ref <seed>]
12+
output_folder: !ref results/conformer_en/<seed>
13+
save_folder: !ref <output_folder>/save
14+
train_log: !ref <output_folder>/train_log.txt
15+
16+
# Data files
17+
data_folder: !PLACEHOLDER # e.g, /localscratch/cv-corpus-4.0-en/fr
18+
train_tsv_file: !PLACEHOLDER # Standard CoVoST .tsv files
19+
dev_tsv_file: !PLACEHOLDER # Standard CoVoST .tsv files
20+
test_tsv_file: !PLACEHOLDER # Standard CoVoST .tsv files
21+
src_language: en
22+
tgt_language: de
23+
train_csv: !ref <output_folder>/train.csv
24+
valid_csv: !ref <output_folder>/dev.csv
25+
test_csv: !ref <output_folder>/test.csv
26+
skip_prep: False # Skip data preparation
27+
convert_to_wav: True # Switch this to True to convert all mp3 files to wav.
28+
29+
# We remove utterance slonger than 10s in the train/dev/test sets as
30+
# longer sentences certainly correspond to "open microphones".
31+
avoid_if_longer_than: 10.0
32+
avoid_if_shorter_than: 1.0
33+
34+
# THIS IS TERRIBLE BUT WE HAVE NO CHOICE.
35+
# Some version of the CV dataset may contain one or two files of more than
36+
# 2 min in the validation and or test. This is an error by design of the dataset
37+
# as these files contain 90% of silence. We exclude them.
38+
avoid_if_longer_than_val_test: 90.0
39+
40+
ckpt_interval_minutes: 15 # save checkpoint every N min
41+
42+
####################### Training Parameters ####################################
43+
number_of_epochs: 200
44+
optimizer_step_limit: 150000
45+
batch_size: 32 # Only used if dynamic batching is not used.
46+
ctc_weight: 0.3
47+
grad_accumulation_factor: 1
48+
loss_reduction: 'batchmean'
49+
sorting: random
50+
num_workers: 4
51+
precision: fp16 # bf16, fp16 or fp32
52+
53+
# stages related parameters
54+
lr_adam: 0.0008
55+
weight_decay: 0.01
56+
asr_warmup_steps: !ref <optimizer_step_limit>
57+
warmup_steps: 20000
58+
augment_warmup: 25000
59+
60+
# BPE parameters
61+
token_type: unigram # ["unigram", "bpe", "char"]
62+
character_coverage: 1.0
63+
64+
# Feature parameters
65+
sample_rate: 16000
66+
n_fft: 400
67+
n_mels: 80
68+
69+
# This setup works well for A40 46GB GPU, adapts it to your needs.
70+
# Or turn it off (but training speed will decrease)
71+
dynamic_batching: True
72+
max_batch_length_train: 300
73+
max_batch_length_val: 300
74+
num_bucket: 200
75+
shuffle: True # if true re-creates batches at each epoch shuffling examples.
76+
batch_ordering: random
77+
max_batch_ex: 256
78+
79+
dynamic_batch_sampler_train:
80+
max_batch_length: !ref <max_batch_length_train>
81+
num_buckets: !ref <num_bucket>
82+
shuffle: !ref <shuffle>
83+
batch_ordering: !ref <batch_ordering>
84+
max_batch_ex: !ref <max_batch_ex>
85+
86+
dynamic_batch_sampler_valid:
87+
max_batch_length: !ref <max_batch_length_val>
88+
num_buckets: !ref <num_bucket>
89+
shuffle: !ref <shuffle>
90+
batch_ordering: !ref <batch_ordering>
91+
max_batch_ex: !ref <max_batch_ex>
92+
93+
# Dataloader options
94+
train_dataloader_opts:
95+
batch_size: !ref <batch_size>
96+
shuffle: True
97+
num_workers: !ref <num_workers>
98+
99+
valid_dataloader_opts:
100+
batch_size: 8
101+
102+
test_dataloader_opts:
103+
batch_size: 8
104+
105+
106+
####################### Model Parameters ###########################
107+
# Transformer
108+
d_model: 256
109+
nhead: 4
110+
num_encoder_layers: 12
111+
num_decoder_layers: 6
112+
d_ffn: 2048
113+
transformer_dropout: 0.1
114+
activation: !name:torch.nn.GELU
115+
output_neurons: 2048
116+
asr_output_neurons: 1024
117+
118+
# Outputs
119+
blank_index: 0
120+
label_smoothing: 0.0
121+
pad_index: 1
122+
bos_index: 2
123+
eos_index: 3
124+
125+
# Decoding parameters
126+
min_decode_ratio: 0.0
127+
max_decode_ratio: 1.0
128+
valid_search_interval: 10
129+
valid_beam_size: 5 # We do greedy here so it's faster to decode ...
130+
test_beam_size: 80
131+
132+
############################## models ################################
133+
134+
CNN: !new:speechbrain.lobes.models.convolution.ConvolutionFrontEnd
135+
input_shape: (8, 10, 80)
136+
num_blocks: 2
137+
num_layers_per_block: 1
138+
out_channels: (64, 32)
139+
kernel_sizes: (3, 3)
140+
strides: (2, 2)
141+
residuals: (False, False)
142+
143+
Transformer: !new:speechbrain.lobes.models.transformer.TransformerASR.TransformerASR # yamllint disable-line rule:line-length
144+
input_size: 640
145+
tgt_vocab: !ref <output_neurons>
146+
d_model: !ref <d_model>
147+
nhead: !ref <nhead>
148+
num_encoder_layers: !ref <num_encoder_layers>
149+
num_decoder_layers: !ref <num_decoder_layers>
150+
d_ffn: !ref <d_ffn>
151+
dropout: !ref <transformer_dropout>
152+
conformer_activation: !ref <activation>
153+
activation: !ref <activation>
154+
encoder_module: conformer
155+
attention_type: RelPosMHAXL
156+
normalize_before: True
157+
causal: False
158+
159+
ctc_lin: !new:speechbrain.nnet.linear.Linear
160+
input_size: !ref <d_model>
161+
n_neurons: !ref <asr_output_neurons>
162+
163+
seq_lin: !new:speechbrain.nnet.linear.Linear
164+
input_size: !ref <d_model>
165+
n_neurons: !ref <output_neurons>
166+
167+
modules:
168+
CNN: !ref <CNN>
169+
Transformer: !ref <Transformer>
170+
seq_lin: !ref <seq_lin>
171+
ctc_lin: !ref <ctc_lin>
172+
173+
model: !new:torch.nn.ModuleList
174+
- [!ref <CNN>, !ref <Transformer>, !ref <seq_lin>, !ref <ctc_lin>]
175+
176+
# We define two optimizers as we have two stages (training + finetuning)
177+
Adam: !name:torch.optim.AdamW
178+
lr: !ref <lr_adam>
179+
weight_decay: !ref <weight_decay>
180+
181+
182+
valid_search: !new:speechbrain.decoders.S2STransformerBeamSearcher
183+
modules: [!ref <Transformer>, !ref <seq_lin>]
184+
bos_index: !ref <bos_index>
185+
eos_index: !ref <eos_index>
186+
min_decode_ratio: !ref <min_decode_ratio>
187+
max_decode_ratio: !ref <max_decode_ratio>
188+
beam_size: !ref <valid_beam_size>
189+
using_eos_threshold: False
190+
length_normalization: True
191+
192+
test_search: !new:speechbrain.decoders.S2STransformerBeamSearcher
193+
modules: [!ref <Transformer>, !ref <seq_lin>]
194+
bos_index: !ref <bos_index>
195+
eos_index: !ref <eos_index>
196+
min_decode_ratio: !ref <min_decode_ratio>
197+
max_decode_ratio: !ref <max_decode_ratio>
198+
beam_size: !ref <test_beam_size>
199+
temperature: 1.15
200+
using_eos_threshold: True
201+
202+
log_softmax: !new:torch.nn.LogSoftmax
203+
dim: -1
204+
205+
ctc_cost: !name:speechbrain.nnet.losses.ctc_loss
206+
blank_index: !ref <blank_index>
207+
reduction: !ref <loss_reduction>
208+
209+
seq_cost: !name:speechbrain.nnet.losses.kldiv_loss
210+
label_smoothing: !ref <label_smoothing>
211+
reduction: !ref <loss_reduction>
212+
213+
noam_annealing: !new:speechbrain.nnet.schedulers.NoamScheduler
214+
lr_initial: !ref <lr_adam>
215+
n_warmup_steps: !ref <warmup_steps>
216+
217+
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
218+
checkpoints_dir: !ref <save_folder>
219+
recoverables:
220+
model: !ref <model>
221+
noam_scheduler: !ref <noam_annealing>
222+
normalizer: !ref <normalize>
223+
counter: !ref <epoch_counter>
224+
225+
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
226+
limit: !ref <number_of_epochs>
227+
228+
normalize: !new:speechbrain.processing.features.InputNormalization
229+
norm_type: sentence
230+
231+
############################## Augmentations ###################################
232+
233+
# Time Drop
234+
time_drop: !new:speechbrain.augment.freq_domain.SpectrogramDrop
235+
drop_length_low: 15
236+
drop_length_high: 25
237+
drop_count_low: 3
238+
drop_count_high: 3
239+
replace: "zeros"
240+
dim: 1
241+
242+
# Frequency Drop
243+
freq_drop: !new:speechbrain.augment.freq_domain.SpectrogramDrop
244+
drop_length_low: 25
245+
drop_length_high: 35
246+
drop_count_low: 2
247+
drop_count_high: 2
248+
replace: "zeros"
249+
dim: 2
250+
251+
fea_augment: !new:speechbrain.augment.augmenter.Augmenter
252+
min_augmentations: 3
253+
max_augmentations: 3
254+
augment_prob: 1.0
255+
augmentations: [
256+
!ref <time_drop>,
257+
!ref <freq_drop>]
258+
259+
compute_features: !new:speechbrain.lobes.features.Fbank
260+
sample_rate: !ref <sample_rate>
261+
n_fft: !ref <n_fft>
262+
n_mels: !ref <n_mels>
263+
264+
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
265+
save_file: !ref <train_log>
266+
267+
acc_computer: !name:speechbrain.utils.Accuracy.AccuracyStats
268+
bleu_computer: !name:speechbrain.utils.bleu.BLEUStats

0 commit comments

Comments
 (0)