Skip to content

Add Marathi LREC2020 ASR recipe (ESPnet bootcamp)#6274

Merged
sw005320 merged 7 commits intoespnet:masterfrom
Aniket-Tathe:marathi_asr_bootcamp
Apr 20, 2026
Merged

Add Marathi LREC2020 ASR recipe (ESPnet bootcamp)#6274
sw005320 merged 7 commits intoespnet:masterfrom
Aniket-Tathe:marathi_asr_bootcamp

Conversation

@Aniket-Tathe
Copy link
Copy Markdown
Contributor

Summary

This PR adds a new ESPnet2 recipe for the Marathi LREC2020 dataset as part of the WavLab Bootcamp.

Details

  • Models: Conformer + Character/BPE-2000/150 tokenization, XLSR Conformer Char / BPE-2000
  • Framework: ESPnet2
  • Dataset: IndicCorpora Marathi subset (~109 hours)
  • Training platform: NCSA Delta (A100 GPU)
  • Includes conf/, local/, and symbolic links to template scripts
  • No LM used for decoding and training.

Results

Model Token Epochs Train Acc (%) Val Loss (%) WER (%) CER (%)
Char Conformer Char 31 / 50 98.3 47.75 45.2 22.0
BPE-150 Conformer BPE-150 10 / 30 96.8 52.96 90.1 26.5
BPE-2000 Conformer BPE-2000 25 / 30 99.6 53.6 89.2 42.1
XLSR-Conformer (BPE-2000) BPE-2000 22 / 60 97.8 66.83 99.2 51.0
XLSR-Conformer (Char) Char 13 / 60 82.4 75.15 78.7 43.3

Notes

  • Character model achieved the best generalization.
  • BPE models showed overfitting.
  • XLSR + Conformer underperformed due to limited fine-tuning (conv2d subsampling replaced with linear).

Dataset Reference

P. Jyothi et al., “IndicCorpora: A Large Multilingual Corpus for Indic Languages.”
IIT Bombay IndicCorpora Project – Marathi

@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. ASR Automatic speech recogntion ESPnet2 Recipe labels Oct 25, 2025
@mergify mergify Bot added the README label Oct 25, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR adds a new ESPnet2 recipe for the Marathi LREC2020 dataset. The overall structure and implementation follow the standard ESPnet recipe format. I've found one configuration issue that needs to be addressed to ensure the training runs with the intended hyperparameters.

Comment on lines +47 to +77
model_conf:
ctc_weight: 0.3
lsm_weight: 0.1
length_normalized_loss: false

# Optimizer
optim: adam
optim_conf:
lr: 0.0005
scheduler: warmuplr
scheduler_conf:
warmup_steps: 20000

# SpecAugment
specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range: [0, 30]
num_freq_mask: 2
apply_time_mask: true
time_mask_width_range: [0, 40]
num_time_mask: 2

# Reporting
model_conf:
ctc_weight: 0.3 # hybrid CTC/attention (default)
report_cer: true
report_wer: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The model_conf key is defined twice in this configuration file (lines 47 and 74). In YAML, this will result in the second definition overwriting the first, causing the loss of lsm_weight and length_normalized_loss settings. These configurations should be merged into a single model_conf block to ensure all settings are applied correctly.

model_conf:
    ctc_weight: 0.3
    lsm_weight: 0.1
    length_normalized_loss: false
    report_cer: true
    report_wer: true

# Optimizer
optim: adam
optim_conf:
    lr: 0.0005
scheduler: warmuplr
scheduler_conf:
    warmup_steps: 20000

# SpecAugment
specaug: specaug
specaug_conf:
    apply_time_warp: true
    time_warp_window: 5
    time_warp_mode: bicubic
    apply_freq_mask: true
    freq_mask_width_range: [0, 30]
    num_freq_mask: 2
    apply_time_mask: true
    time_mask_width_range: [0, 40]
    num_time_mask: 2

# Reporting

@Aniket-Tathe
Copy link
Copy Markdown
Contributor Author

Aniket-Tathe commented Oct 25, 2025

Hello @sw005320

I’d love your feedback on how I can further tune this recipe.
Currently, the Character Conformer model achieves the best generalization (Val WER ≈ 45.2, CER ≈ 22.0),
but I’d like to lower the validation loss and WER further because I think the current model is overfitting.

Do you have suggestions on:

  • Optimal learning rate range or warmup schedule to try?
  • For how long (in epochs) should I monitor a specific learning rate (for eg, should I change the learning rate if the loss doesn't go down in around ~5 epochs or ~3 epochs?)
  • Recommended fine-tuning strategy for XLSR conformer (since it is underperforming right now, I kept conv2d as linear to avoid sub-sampling error)? What should I change, maybe in architecture?
  • Any specific hyperparameters in current train_asr_conformer_xlsr.yaml and train_asr_transformer.yaml I should focus on?

Also, I had a small doubt:
In transcription, I saw that it literally contains " " (plain space) in place of < space> everywhere.
For eg: It's like "गाण्याचा आवाज थोडा कमी कर ना !"
instead of "गाण्याचा< space>आवाज< space>थोडा< space>कमी< space>कर< space>ना< space>!"
For some reason, during preprocessing, the literal space was not being replaced by < space>. But I see that in tokens.txt of char, i.e, token list, both " " and < space> are present. Does this create an issue during training, which can affect performance, or is this fine? Let me know your thoughts on this, as I don't know if this is significant or not.

I’m happy to run more experiments on Delta and update the RESULTS.md accordingly.
Thanks again for your guidance and your help!

@sw005320
Copy link
Copy Markdown
Contributor

  • Optimal learning rate range or warmup schedule to try?

First, the learning rate
Second, warmup
But mostly, the learning rate is fine.

  • For how long (in epochs) should I monitor a specific learning rate (for eg, should I change the learning rate if the loss doesn't go down in around ~5 epochs or ~3 epochs?)

Monitor the learning curve and decide for yourself.
It depends on the data.
You should monitor it more.
There is no answer.

  • Recommended fine-tuning strategy for XLSR conformer (since it is underperforming right now, I kept conv2d as linear to avoid sub-sampling error)? What should I change, maybe in architecture?

Check the other recipes using xlsr
https://github.com/search?q=repo%3Aespnet%2Fespnet%20xlsr&type=code

  • Any specific hyperparameters in current train_asr_conformer_xlsr.yaml and train_asr_transformer.yaml I should focus on?

If you use a config with a similar amount of training data to the other recipes, you don't need to change the architecture. Please focus on the optimization hyperparameters.

Also, I had a small doubt: In transcription, I saw that it literally contains " " (plain space) in place of < space> everywhere. For eg: It's like "गाण्याचा आवाज थोडा कमी कर ना !" instead of "गाण्याचा< space>आवाज< space>थोडा< space>कमी< space>कर< space>ना< space>!" For some reason, during preprocessing, the literal space was not being replaced by < space>. But I see that in tokens.txt of char, i.e, token list, both " " and < space> are present. Does this create an issue during training, which can affect performance, or is this fine? Let me know your thoughts on this, as I don't know if this is significant or not.

Good catch.
It would probably be better to normalize it for either of them.

@sw005320
Copy link
Copy Markdown
Contributor

I want to make sure that you use a single GPU.
This task does not require multiple GPUs.
Instead, please run several jobs with different learning rates and monitor the learning curves to understand the behavior.

@Fhrozen Fhrozen added this to the v.202512 milestone Oct 26, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Oct 26, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.16%. Comparing base (9938f5e) to head (09af4c1).
⚠️ Report is 528 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #6274       +/-   ##
===========================================
+ Coverage        0   70.16%   +70.16%     
===========================================
  Files           0      787      +787     
  Lines           0    73367    +73367     
===========================================
+ Hits            0    51477    +51477     
- Misses          0    21890    +21890     
Flag Coverage Δ
test_integration_espnet2 46.78% <ø> (?)
test_python_espnet2 61.21% <ø> (?)
test_python_espnet3 17.45% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@ftshijt ftshijt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add the data entry at egs2/README.md

Comment thread egs2/marathi_lrec2020/asr1/README.md Outdated
Comment on lines +37 to +38
- **XLSR + conformer (BPE and Char)** underperformed in this setup, likely due to limited fine-tuning (also sub-sampling conv2d was disabled. I used linear for this.)
All the above training was done without any LM model.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these cases, we may not need to include lm related yaml in conf/tuning

@@ -0,0 +1,205 @@
# Automatic Speech Recognition (Multi-tasking)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see the necessity of this readme. If there is no additional reason, I would recommend we safely remove this readme.

maxlenratio: 0.0
minlenratio: 0.0
ctc_weight: 0.5
lm_weight: 0.3
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can set it to 0 given that no lm is included.

@@ -0,0 +1,7 @@
batch_size: 16
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we support 16 batch size now. Please double check

@@ -0,0 +1,40 @@
# ==========================================
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since transducer is not included, please remove them.

@Aniket-Tathe
Copy link
Copy Markdown
Contributor Author

Hi @sw005320

I needed your opinion and guidance for this:

I've added all the things I had tried in these weeks in detail.
Below are the results after lowering the learning rate.
I've also included the CTC and Attention losses for each learning rate (across the last 5 epochs: 6th–10th) for your reference.


Overall Performance Across Learning Rates

Learning Rate Train Acc (%) Valid Acc (%) WER (%) CER (%)
0.0001 94.2 85.9 41.3 17.3
0.0002 96.6 85.7 41.6 17.2
0.0003 97.2 87.9 37.1 14.9
0.0004 97.4 88.6 35.4 14.1
0.0005 (old) 89.2 77.6 54.9 27.9

CTC and Attention Loss Trends (Epoch 6–10)

Learning Rate Loss Type Epoch 6 Epoch 7 Epoch 8 Epoch 9 Epoch 10
0.0001 loss_ctc 27.125 27.533 28.420 28.031 28.668
loss_att 27.089 29.896 27.088 28.365 28.832
0.0002 loss_ctc 28.755 30.057 31.230 31.385 32.294
loss_att 28.663 30.295 29.980 31.343 32.441
0.0003 loss_ctc 28.025 30.177 31.800 31.508 32.631
loss_att 27.661 29.601 30.393 30.115 31.611
0.0004 loss_ctc 28.352 29.341 30.410 31.211 32.024
loss_att 26.322 28.351 28.652 29.488 30.311
0.0005 (old) loss_ctc 46.582 42.896 39.874 40.816 37.439
loss_att 49.149 41.999 45.019 47.217 41.953

It seemed that 0.0004 gave the most stable training curve and the lowest WER / CER overall.


Last week, I also tried Macaron with the lower lr for the first 5 epochs. I observed that the loss_ctc and loss_att values were at the lower end as compared to macaron: off, but they were not going down as the epoch increased

Macaron Results (No SpecAug, No LM)

These runs used the Macaron-style Conformer variant (same settings, lower learning rates).

Learning Rate Train Acc (%) Valid Acc (%) WER (%) CER (%)
0.0001 94.9 87.7 38.6 15.0
0.0002 97.1 89.0 33.8 13.4
0.0003 97.6 89.0 33.1 13.6
0.0004 97.8 88.6 34.4 13.8

Macaron CTC and Attention Loss Trends (Epoch 1–5)

Learning Rate Loss Type Epoch 1 Epoch 2 Epoch 3 Epoch 4 Epoch 5
0.0001 loss_ctc 27.790 25.751 25.281 25.187 26.846
loss_att 21.624 20.325 19.744 20.473 21.509
0.0002 loss_ctc 24.305 23.761 25.166 25.773 27.759
loss_att 18.216 18.666 20.150 21.415 22.514
0.0003 loss_ctc 23.671 23.770 25.264 25.869 28.357
loss_att 19.363 20.545 22.431 22.719 25.047
0.0004 loss_ctc 23.461 23.853 25.332 26.138 28.339
loss_att 19.382 21.385 22.637 24.487 26.252

I've also added the LR results where I went above 0.0005.
These runs ** Used SpecAug**, no Macaron, and no LM.

Learning Rate Train Acc (%) Valid Acc (%) WER (%) CER (%)
0.0005 89.2 77.6 54.9 27.9
0.0010 78.1 65.8 74.8 40.9
0.0015 70.9 64.0 78.0 42.8
0.0020 58.9 57.8 88.5 49.5
0.0030 54.2 54.6 92.1 53.0

CTC and Attention Loss (High Learning Rates, Epoch 1–5)

Learning Rate Loss Type Epoch 1 Epoch 2 Epoch 3 Epoch 4 Epoch 5
0.0010 loss_ctc 146.339 163.942 160.610 156.909 134.504
loss_att 54.609 56.884 60.222 62.448 65.292
0.0015 loss_ctc 153.727 220.865 238.517 206.666 240.485
loss_att 60.036 56.813 59.422 60.553 62.000
0.0020 loss_ctc 153.202 187.536 227.860 230.023 233.365
loss_att 66.843 63.605 64.502 64.107 63.166
0.0030 loss_ctc 172.739 192.682 207.425 180.629 ≈117.5 (avg train batch)
loss_att 70.300 67.575 67.511 65.386 ≈56.0 (avg train batch)

The 5th epoch for 0.0030 was incomplete, so I've just averaged out the values from the mini-batches from that epoch (hence, avg train batch)


Current vs New Proposed Model Configurations

Previous (Baseline):

3-block Conformer (256-dim, 4 heads, 1024 FFN, no macaron, CNN kernel 17)

  • 3-block Transformer (4 heads, 1024 FFN)
    Adam (LR = 0.0005), Warmup 20k, No SpecAug, No LM

New ProposedModel:

12-block Conformer (512-dim, 4 heads, 2048 FFN, macaron on, CNN kernel 15)

  • 6-block Transformer (4 heads, 2048 FFN)
    Adam (LR = 0.0004), Warmup 20k, SpecAug enabled, No LM

Changes Made (Need Guidance On)

  1. conformer out dim from 256 to 512
  2. conformer block from 3 to 12
  3. decoder block from 3 to 6
  4. FFN from 1024 to 2048 (I'm following FFN ~ 4x conf out dim)
  5. macaron on (not sure about this)
  6. init: xavier_uniform (it was true for all above exp, but don't know if it's needed if we are gonna use a bigger conformer. Online, it said it's mainly good for small to medium transformers)
  7. Yes/No specaug
  8. Yes/No Lm
  9. I remember you said to keep the learning rate 0.0004, but I've added detailed results of all lr and configs above, so I would just like to confirm again, just to be sure.
  10. Warmup: 20000 (default) or should I increase/decrease it?
  11. CNN kernel (was 17, made it 15) or should I change it?

Would you recommend keeping this configuration, or should I change something? Answers to the above queries would help me a lot in deciding the final new architecture before training it.

Thanks again for your guidance.

@sw005320
Copy link
Copy Markdown
Contributor

The XLSR results seem worse due to a lack of tuning.
We may remove them since they are confusing.

@Fhrozen Fhrozen modified the milestones: v.202604, v.202607 Apr 7, 2026
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Apr 19, 2026
@sw005320 sw005320 merged commit 433b308 into espnet:master Apr 20, 2026
109 checks passed
@sw005320
Copy link
Copy Markdown
Contributor

Thanks, @Aniket-Tathe!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ASR Automatic speech recogntion ESPnet2 README Recipe size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants