Add Marathi LREC2020 ASR recipe (ESPnet bootcamp)#6274
Add Marathi LREC2020 ASR recipe (ESPnet bootcamp)#6274sw005320 merged 7 commits intoespnet:masterfrom
Conversation
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This PR adds a new ESPnet2 recipe for the Marathi LREC2020 dataset. The overall structure and implementation follow the standard ESPnet recipe format. I've found one configuration issue that needs to be addressed to ensure the training runs with the intended hyperparameters.
| model_conf: | ||
| ctc_weight: 0.3 | ||
| lsm_weight: 0.1 | ||
| length_normalized_loss: false | ||
|
|
||
| # Optimizer | ||
| optim: adam | ||
| optim_conf: | ||
| lr: 0.0005 | ||
| scheduler: warmuplr | ||
| scheduler_conf: | ||
| warmup_steps: 20000 | ||
|
|
||
| # SpecAugment | ||
| specaug: specaug | ||
| specaug_conf: | ||
| apply_time_warp: true | ||
| time_warp_window: 5 | ||
| time_warp_mode: bicubic | ||
| apply_freq_mask: true | ||
| freq_mask_width_range: [0, 30] | ||
| num_freq_mask: 2 | ||
| apply_time_mask: true | ||
| time_mask_width_range: [0, 40] | ||
| num_time_mask: 2 | ||
|
|
||
| # Reporting | ||
| model_conf: | ||
| ctc_weight: 0.3 # hybrid CTC/attention (default) | ||
| report_cer: true | ||
| report_wer: true |
There was a problem hiding this comment.
The model_conf key is defined twice in this configuration file (lines 47 and 74). In YAML, this will result in the second definition overwriting the first, causing the loss of lsm_weight and length_normalized_loss settings. These configurations should be merged into a single model_conf block to ensure all settings are applied correctly.
model_conf:
ctc_weight: 0.3
lsm_weight: 0.1
length_normalized_loss: false
report_cer: true
report_wer: true
# Optimizer
optim: adam
optim_conf:
lr: 0.0005
scheduler: warmuplr
scheduler_conf:
warmup_steps: 20000
# SpecAugment
specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range: [0, 30]
num_freq_mask: 2
apply_time_mask: true
time_mask_width_range: [0, 40]
num_time_mask: 2
# Reporting
|
Hello @sw005320 I’d love your feedback on how I can further tune this recipe. Do you have suggestions on:
Also, I had a small doubt: I’m happy to run more experiments on Delta and update the |
First, the learning rate
Monitor the learning curve and decide for yourself.
Check the other recipes using xlsr
If you use a config with a similar amount of training data to the other recipes, you don't need to change the architecture. Please focus on the optimization hyperparameters.
Good catch. |
|
I want to make sure that you use a single GPU. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #6274 +/- ##
===========================================
+ Coverage 0 70.16% +70.16%
===========================================
Files 0 787 +787
Lines 0 73367 +73367
===========================================
+ Hits 0 51477 +51477
- Misses 0 21890 +21890
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ftshijt
left a comment
There was a problem hiding this comment.
Please also add the data entry at egs2/README.md
| - **XLSR + conformer (BPE and Char)** underperformed in this setup, likely due to limited fine-tuning (also sub-sampling conv2d was disabled. I used linear for this.) | ||
| All the above training was done without any LM model. |
There was a problem hiding this comment.
In these cases, we may not need to include lm related yaml in conf/tuning
| @@ -0,0 +1,205 @@ | |||
| # Automatic Speech Recognition (Multi-tasking) | |||
There was a problem hiding this comment.
I do not see the necessity of this readme. If there is no additional reason, I would recommend we safely remove this readme.
| maxlenratio: 0.0 | ||
| minlenratio: 0.0 | ||
| ctc_weight: 0.5 | ||
| lm_weight: 0.3 |
There was a problem hiding this comment.
We can set it to 0 given that no lm is included.
| @@ -0,0 +1,7 @@ | |||
| batch_size: 16 | |||
There was a problem hiding this comment.
I do not think we support 16 batch size now. Please double check
| @@ -0,0 +1,40 @@ | |||
| # ========================================== | |||
There was a problem hiding this comment.
Since transducer is not included, please remove them.
|
Hi @sw005320 I needed your opinion and guidance for this: I've added all the things I had tried in these weeks in detail. Overall Performance Across Learning Rates
CTC and Attention Loss Trends (Epoch 6–10)
It seemed that 0.0004 gave the most stable training curve and the lowest WER / CER overall. Last week, I also tried Macaron with the lower lr for the first 5 epochs. I observed that the loss_ctc and loss_att values were at the lower end as compared to macaron: off, but they were not going down as the epoch increased Macaron Results (No SpecAug, No LM)These runs used the Macaron-style Conformer variant (same settings, lower learning rates).
Macaron CTC and Attention Loss Trends (Epoch 1–5)
I've also added the LR results where I went above 0.0005.
CTC and Attention Loss (High Learning Rates, Epoch 1–5)
The 5th epoch for 0.0030 was incomplete, so I've just averaged out the values from the mini-batches from that epoch (hence, avg train batch) Current vs New Proposed Model ConfigurationsPrevious (Baseline):
New ProposedModel:
Changes Made (Need Guidance On)
Would you recommend keeping this configuration, or should I change something? Answers to the above queries would help me a lot in deciding the final new architecture before training it. Thanks again for your guidance. |
|
The XLSR results seem worse due to a lack of tuning. |
for more information, see https://pre-commit.ci
|
Thanks, @Aniket-Tathe! |
Summary
This PR adds a new ESPnet2 recipe for the Marathi LREC2020 dataset as part of the WavLab Bootcamp.
Details
conf/,local/, and symbolic links to template scriptsResults
Notes
Dataset Reference