Skip to content

Commit f7ede69

Browse files
TParcolletTitouan Parcollet/Embedded AI /SRUK/Engineer/Samsung ElectronicsmravanelliAdel-Moumen
authored
Update CommonVoice transformer recipes (code from Samsung AI Center Cambridge) (#2465)
* Update CV transformer recipes to match latest results with conformer. --------- Co-authored-by: Titouan Parcollet/Embedded AI /SRUK/Engineer/Samsung Electronics <t.parcollet@sruk-ccn4.eu.corp.samsungelectronics.net> Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com> Co-authored-by: Adel Moumen <adelmoumen.pro@gmail.com>
1 parent d086cde commit f7ede69

7 files changed

Lines changed: 273 additions & 677 deletions

File tree

recipes/CommonVoice/ASR/transformer/README.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,8 @@ It is important to note that CommonVoice initially offers mp3 audio files at 42H
2121
# Languages
2222
Here is a list of the different languages that we tested within the CommonVoice dataset
2323
with our transformers:
24-
- French
2524
- Italian
26-
- German
25+
- French
2726

2827
For Whisper-large-v2 and medium finetuning, here is list of the different language that we tested within the CommonVoice.14_0 dataset:
2928
- Hindi
@@ -36,12 +35,12 @@ For Whisper-large-v2 and medium finetuning, here is list of the different langua
3635

3736

3837
# Results
39-
40-
| Language | Release | hyperparams file | LM | Val. CER | Val. WER | Test CER | Test WER | Hugging Face link | Model link | GPUs |
38+
# Transformer Results:
39+
| Language | CV version | hyperparams file | Flags | LM | Val. CER | Val. WER | Test CER | Test WER | Hugging Face link | Model link | GPUs |
4140
| ------------- |:-------------:|:---------------------------:| -----:| -----:| -----:| -----:| -----:|:-----------:| :-----------:| :-----------:|
42-
| French | 2023-08-15 | train_fr.yaml | No | 5.41 | 16.00 | 5.41 | 17.61 | - | [model](https://www.dropbox.com/sh/zvu9h9pctksnuvp/AAD1kyS3-N0YtmcoMgjM-_Tba?dl=0) | 1xV100 32GB |
43-
| Italian | 2023-08-15 | train_it.yaml | No | 3.72 | 16.31 | 4.01 | 16.80 | - | [model](https://www.dropbox.com/sh/yy8du12jgbkm3qe/AACBHhTCM-cU-oGvAKJ9kTtaa?dl=0) | 1xV100 32GB |
44-
| German | 2023-08-15 | train_de.yaml | No | 3.60 | 15.33 | 4.22 | 16.76 |- | [model](https://www.dropbox.com/sh/umfq986o3d9o1px/AAARNF2BFYELOWx3xhIOEoZka?dl=0) | 1xV100 32GB |
41+
| Italian | 14.0 | conformer_large.yaml | No | 2.91 | 9.79 | 2.68 | 9.27 | - | [model](https://www.dropbox.com/scl/fo/tf44itp8f4icf2z5qlxpm/AIOYS_CMov5ss5Q9AonFEno?rlkey=xek5ikbhqoovcao31iniqimrr&dl=0) | 2xV100 32GB |
42+
| French | 14.0 | conformer_large.yaml | No | 2.64 | 7.62 | 3.55 | 9.48 | - | [model](https://www.dropbox.com/scl/fo/y862nl95zoe4sj3347095/ACxmT3_uw1ScLoYs0DSbGRM?rlkey=q66dk13w5nu1lkphtdinnnigm&dl=0) | 2xV100 32GB |
43+
4544

4645
## Whisper Finetuning Result:
4746
Following table contains whisper-finetuning results for 1 epoch using whisper_medium model, freezing encoder and finetuning decoder.

recipes/CommonVoice/ASR/transformer/hparams/train_de.yaml renamed to recipes/CommonVoice/ASR/transformer/hparams/conformer_large.yaml

Lines changed: 68 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,10 @@
77
# Authors: Titouan Parcollet and Jianyuan Zhong
88
# ############################################################################
99
# Seed needs to be set at top of yaml, before objects with parameters are made
10-
seed: 1234
10+
seed: 3407
1111
__set_seed: !apply:torch.manual_seed [!ref <seed>]
12-
output_folder: !ref results/transformer_de/<seed>
13-
test_wer_file: !ref <output_folder>/wer_test.txt
14-
valid_wer_file: !ref <output_folder>/wer_valid.txt
12+
output_folder: !ref results/conformer_en/<seed>
13+
output_wer_folder: !ref <output_folder>/
1514
save_folder: !ref <output_folder>/save
1615
train_log: !ref <output_folder>/train_log.txt
1716

@@ -20,12 +19,13 @@ data_folder: !PLACEHOLDER # e.g, /localscratch/cv-corpus-5.1-2020-06-22/fr
2019
train_tsv_file: !ref <data_folder>/train.tsv # Standard CommonVoice .tsv files
2120
dev_tsv_file: !ref <data_folder>/dev.tsv # Standard CommonVoice .tsv files
2221
test_tsv_file: !ref <data_folder>/test.tsv # Standard CommonVoice .tsv files
23-
accented_letters: True
24-
language: de # use 'it' for Italian, 'rw' for Kinyarwanda, 'en' for english
25-
train_csv: !ref <save_folder>/train.csv
26-
valid_csv: !ref <save_folder>/dev.csv
27-
test_csv: !ref <save_folder>/test.csv
22+
accented_letters: False
23+
language: en # use 'it' for Italian, 'rw' for Kinyarwanda, 'en' for english
24+
train_csv: !ref <output_folder>/train.csv
25+
valid_csv: !ref <output_folder>/dev.csv
26+
test_csv: !ref <output_folder>/test.csv
2827
skip_prep: False # Skip data preparation
28+
convert_to_wav: False # Switch this to True to convert all mp3 files to wav.
2929

3030
# We remove utterance slonger than 10s in the train/dev/test sets as
3131
# longer sentences certainly correspond to "open microphones".
@@ -40,12 +40,14 @@ ctc_weight: 0.3
4040
grad_accumulation_factor: 4
4141
loss_reduction: 'batchmean'
4242
sorting: random
43+
num_workers: 4
4344
precision: fp32 # bf16, fp16 or fp32
4445

4546
# stages related parameters
46-
stage_one_epochs: 40
47-
lr_adam: 1.0
48-
lr_sgd: 0.00003
47+
lr_adam: 0.0008
48+
weight_decay: 0.01
49+
warmup_steps: 25000
50+
augment_warmup: 8000
4951

5052
# BPE parameters
5153
token_type: unigram # ["unigram", "bpe", "char"]
@@ -56,30 +58,53 @@ sample_rate: 16000
5658
n_fft: 400
5759
n_mels: 80
5860

61+
# This setup works well for A100 80GB GPU, adapts it to your needs.
62+
# Or turn it off (but training speed will decrease)
63+
dynamic_batching: True
64+
max_batch_length_train: 500
65+
max_batch_length_val: 100 # we reduce it as the beam is much wider (VRAM)
66+
num_bucket: 200
67+
shuffle: True # if true re-creates batches at each epoch shuffling examples.
68+
batch_ordering: random
69+
max_batch_ex: 256
70+
71+
dynamic_batch_sampler_train:
72+
max_batch_length: !ref <max_batch_length_train>
73+
num_buckets: !ref <num_bucket>
74+
shuffle: !ref <shuffle>
75+
batch_ordering: !ref <batch_ordering>
76+
max_batch_ex: !ref <max_batch_ex>
77+
78+
dynamic_batch_sampler_valid:
79+
max_batch_length: !ref <max_batch_length_val>
80+
num_buckets: !ref <num_bucket>
81+
shuffle: !ref <shuffle>
82+
batch_ordering: !ref <batch_ordering>
83+
max_batch_ex: !ref <max_batch_ex>
84+
5985
# Dataloader options
6086
train_dataloader_opts:
6187
batch_size: !ref <batch_size>
6288
shuffle: True
63-
num_workers: 6
89+
num_workers: !ref <num_workers>
6490

6591
valid_dataloader_opts:
66-
batch_size: !ref <batch_size>
67-
num_workers: 6
92+
batch_size: 1
6893

6994
test_dataloader_opts:
70-
batch_size: !ref <batch_size>
71-
num_workers: 6
95+
batch_size: 1
96+
7297

7398
####################### Model Parameters ###########################
7499
# Transformer
75-
d_model: 768
100+
d_model: 512
76101
nhead: 8
77102
num_encoder_layers: 12
78103
num_decoder_layers: 6
79-
d_ffn: 3072
80-
transformer_dropout: 0.0
104+
d_ffn: 2048
105+
transformer_dropout: 0.1
81106
activation: !name:torch.nn.GELU
82-
output_neurons: 500
107+
output_neurons: 5120
83108

84109
# Outputs
85110
blank_index: 0
@@ -91,8 +116,8 @@ eos_index: 2
91116
# Decoding parameters
92117
min_decode_ratio: 0.0
93118
max_decode_ratio: 1.0
94-
valid_search_interval: 5
95-
valid_beam_size: 10
119+
valid_search_interval: 10
120+
valid_beam_size: 1 # We do greedy here so it's faster to decode ...
96121
test_beam_size: 80
97122
ctc_weight_decode: 0.3
98123
scorer_beam_scale: 0.3
@@ -101,24 +126,28 @@ scorer_beam_scale: 0.3
101126

102127
CNN: !new:speechbrain.lobes.models.convolution.ConvolutionFrontEnd
103128
input_shape: (8, 10, 80)
104-
num_blocks: 3
129+
num_blocks: 2
105130
num_layers_per_block: 1
106-
out_channels: (128, 200, 256)
107-
kernel_sizes: (3, 3, 1)
108-
strides: (2, 2, 1)
109-
residuals: (False, False, False)
131+
out_channels: (64, 32)
132+
kernel_sizes: (3, 3)
133+
strides: (2, 2)
134+
residuals: (False, False)
110135

111136
Transformer: !new:speechbrain.lobes.models.transformer.TransformerASR.TransformerASR # yamllint disable-line rule:line-length
112-
input_size: 5120
137+
input_size: 640
113138
tgt_vocab: !ref <output_neurons>
114139
d_model: !ref <d_model>
115140
nhead: !ref <nhead>
116141
num_encoder_layers: !ref <num_encoder_layers>
117142
num_decoder_layers: !ref <num_decoder_layers>
118143
d_ffn: !ref <d_ffn>
119144
dropout: !ref <transformer_dropout>
145+
conformer_activation: !ref <activation>
120146
activation: !ref <activation>
121-
normalize_before: False
147+
encoder_module: conformer
148+
attention_type: RelPosMHAXL
149+
normalize_before: True
150+
causal: False
122151

123152
ctc_lin: !new:speechbrain.nnet.linear.Linear
124153
input_size: !ref <d_model>
@@ -138,15 +167,9 @@ model: !new:torch.nn.ModuleList
138167
- [!ref <CNN>, !ref <Transformer>, !ref <seq_lin>, !ref <ctc_lin>]
139168

140169
# We define two optimizers as we have two stages (training + finetuning)
141-
Adam: !name:torch.optim.Adam
170+
Adam: !name:torch.optim.AdamW
142171
lr: !ref <lr_adam>
143-
betas: (0.9, 0.98)
144-
eps: 0.000000001
145-
146-
SGD: !name:torch.optim.SGD
147-
lr: !ref <lr_sgd>
148-
momentum: 0.99
149-
nesterov: True
172+
weight_decay: !ref <weight_decay>
150173

151174
# Scorer
152175
ctc_scorer: !new:speechbrain.decoders.scorer.CTCScorer
@@ -195,8 +218,7 @@ seq_cost: !name:speechbrain.nnet.losses.kldiv_loss
195218

196219
noam_annealing: !new:speechbrain.nnet.schedulers.NoamScheduler
197220
lr_initial: !ref <lr_adam>
198-
n_warmup_steps: 25000
199-
model_size: !ref <d_model>
221+
n_warmup_steps: !ref <warmup_steps>
200222

201223
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
202224
checkpoints_dir: !ref <save_folder>
@@ -211,23 +233,26 @@ epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
211233

212234
normalize: !new:speechbrain.processing.features.InputNormalization
213235
norm_type: global
214-
update_until_epoch: 3
236+
update_until_epoch: 4
215237

216238
############################## Augmentations ###################################
217239

218240
# Time Drop
219241
time_drop: !new:speechbrain.augment.freq_domain.SpectrogramDrop
220242
drop_length_low: 15
221243
drop_length_high: 25
222-
drop_count_low: 5
223-
drop_count_high: 5
244+
drop_count_low: 3
245+
drop_count_high: 3
246+
replace: "zeros"
247+
dim: 1
224248

225249
# Frequency Drop
226250
freq_drop: !new:speechbrain.augment.freq_domain.SpectrogramDrop
227251
drop_length_low: 25
228252
drop_length_high: 35
229253
drop_count_low: 2
230254
drop_count_high: 2
255+
replace: "zeros"
231256
dim: 2
232257

233258
# Time warp

0 commit comments

Comments
 (0)