Add Marathi LREC2020 ASR recipe (ESPnet bootcamp) by Aniket-Tathe · Pull Request #6274 · espnet/espnet

Aniket-Tathe · 2025-10-25T08:57:19Z

Summary

This PR adds a new ESPnet2 recipe for the Marathi LREC2020 dataset as part of the WavLab Bootcamp.

Details

Models: Conformer + Character/BPE-2000/150 tokenization, XLSR Conformer Char / BPE-2000
Framework: ESPnet2
Dataset: IndicCorpora Marathi subset (~109 hours)
Training platform: NCSA Delta (A100 GPU)
Includes conf/, local/, and symbolic links to template scripts
No LM used for decoding and training.

Results

Model	Token	Epochs	Train Acc (%)	Val Loss (%)	WER (%)	CER (%)
Char Conformer	Char	31 / 50	98.3	47.75	45.2	22.0
BPE-150 Conformer	BPE-150	10 / 30	96.8	52.96	90.1	26.5
BPE-2000 Conformer	BPE-2000	25 / 30	99.6	53.6	89.2	42.1
XLSR-Conformer (BPE-2000)	BPE-2000	22 / 60	97.8	66.83	99.2	51.0
XLSR-Conformer (Char)	Char	13 / 60	82.4	75.15	78.7	43.3

Notes

Character model achieved the best generalization.
BPE models showed overfitting.
XLSR + Conformer underperformed due to limited fine-tuning (conv2d subsampling replaced with linear).

Dataset Reference

P. Jyothi et al., “IndicCorpora: A Large Multilingual Corpus for Indic Languages.”
IIT Bombay IndicCorpora Project – Marathi

for more information, see https://pre-commit.ci

gemini-code-assist

Code Review

This PR adds a new ESPnet2 recipe for the Marathi LREC2020 dataset. The overall structure and implementation follow the standard ESPnet recipe format. I've found one configuration issue that needs to be addressed to ensure the training runs with the intended hyperparameters.

gemini-code-assist · 2025-10-25T09:00:41Z

+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1
+    length_normalized_loss: false
+
+# Optimizer
+optim: adam
+optim_conf:
+    lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+    warmup_steps: 20000
+
+# SpecAugment
+specaug: specaug
+specaug_conf:
+    apply_time_warp: true
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range: [0, 30]
+    num_freq_mask: 2
+    apply_time_mask: true
+    time_mask_width_range: [0, 40]
+    num_time_mask: 2
+
+# Reporting
+model_conf:
+    ctc_weight: 0.3 # hybrid CTC/attention (default)
+    report_cer: true
+    report_wer: true


The model_conf key is defined twice in this configuration file (lines 47 and 74). In YAML, this will result in the second definition overwriting the first, causing the loss of lsm_weight and length_normalized_loss settings. These configurations should be merged into a single model_conf block to ensure all settings are applied correctly.

model_conf: ctc_weight: 0.3 lsm_weight: 0.1 length_normalized_loss: false report_cer: true report_wer: true # Optimizer optim: adam optim_conf: lr: 0.0005 scheduler: warmuplr scheduler_conf: warmup_steps: 20000 # SpecAugment specaug: specaug specaug_conf: apply_time_warp: true time_warp_window: 5 time_warp_mode: bicubic apply_freq_mask: true freq_mask_width_range: [0, 30] num_freq_mask: 2 apply_time_mask: true time_mask_width_range: [0, 40] num_time_mask: 2 # Reporting

Aniket-Tathe · 2025-10-25T09:27:38Z

Hello @sw005320

I’d love your feedback on how I can further tune this recipe.
Currently, the Character Conformer model achieves the best generalization (Val WER ≈ 45.2, CER ≈ 22.0),
but I’d like to lower the validation loss and WER further because I think the current model is overfitting.

Do you have suggestions on:

Optimal learning rate range or warmup schedule to try?
For how long (in epochs) should I monitor a specific learning rate (for eg, should I change the learning rate if the loss doesn't go down in around ~5 epochs or ~3 epochs?)
Recommended fine-tuning strategy for XLSR conformer (since it is underperforming right now, I kept conv2d as linear to avoid sub-sampling error)? What should I change, maybe in architecture?
Any specific hyperparameters in current train_asr_conformer_xlsr.yaml and train_asr_transformer.yaml I should focus on?

Also, I had a small doubt:
In transcription, I saw that it literally contains " " (plain space) in place of < space> everywhere.
For eg: It's like "गाण्याचा आवाज थोडा कमी कर ना !"
instead of "गाण्याचा< space>आवाज< space>थोडा< space>कमी< space>कर< space>ना< space>!"
For some reason, during preprocessing, the literal space was not being replaced by < space>. But I see that in tokens.txt of char, i.e, token list, both " " and < space> are present. Does this create an issue during training, which can affect performance, or is this fine? Let me know your thoughts on this, as I don't know if this is significant or not.

I’m happy to run more experiments on Delta and update the RESULTS.md accordingly.
Thanks again for your guidance and your help!

sw005320 · 2025-10-25T12:01:30Z

Optimal learning rate range or warmup schedule to try?

First, the learning rate
Second, warmup
But mostly, the learning rate is fine.

For how long (in epochs) should I monitor a specific learning rate (for eg, should I change the learning rate if the loss doesn't go down in around ~5 epochs or ~3 epochs?)

Monitor the learning curve and decide for yourself.
It depends on the data.
You should monitor it more.
There is no answer.

Recommended fine-tuning strategy for XLSR conformer (since it is underperforming right now, I kept conv2d as linear to avoid sub-sampling error)? What should I change, maybe in architecture?

Check the other recipes using xlsr
https://github.com/search?q=repo%3Aespnet%2Fespnet%20xlsr&type=code

Any specific hyperparameters in current train_asr_conformer_xlsr.yaml and train_asr_transformer.yaml I should focus on?

If you use a config with a similar amount of training data to the other recipes, you don't need to change the architecture. Please focus on the optimization hyperparameters.

Also, I had a small doubt: In transcription, I saw that it literally contains " " (plain space) in place of < space> everywhere. For eg: It's like "गाण्याचा आवाज थोडा कमी कर ना !" instead of "गाण्याचा< space>आवाज< space>थोडा< space>कमी< space>कर< space>ना< space>!" For some reason, during preprocessing, the literal space was not being replaced by < space>. But I see that in tokens.txt of char, i.e, token list, both " " and < space> are present. Does this create an issue during training, which can affect performance, or is this fine? Let me know your thoughts on this, as I don't know if this is significant or not.

Good catch.
It would probably be better to normalize it for either of them.

sw005320 · 2025-10-25T12:02:39Z

I want to make sure that you use a single GPU.
This task does not require multiple GPUs.
Instead, please run several jobs with different learning rates and monitor the learning curves to understand the behavior.

codecov · 2025-10-26T13:43:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.16%. Comparing base (9938f5e) to head (09af4c1).
⚠️ Report is 528 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #6274       +/-   ##
===========================================
+ Coverage        0   70.16%   +70.16%     
===========================================
  Files           0      787      +787     
  Lines           0    73367    +73367     
===========================================
+ Hits            0    51477    +51477     
- Misses          0    21890    +21890

Flag	Coverage Δ
test_integration_espnet2	`46.78% <ø> (?)`
test_python_espnet2	`61.21% <ø> (?)`
test_python_espnet3	`17.45% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ftshijt

Please also add the data entry at egs2/README.md

ftshijt · 2025-11-03T04:44:33Z

+- **XLSR + conformer (BPE and Char)** underperformed in this setup, likely due to limited fine-tuning (also sub-sampling conv2d was disabled. I used linear for this.)
+All the above training was done without any LM model.


In these cases, we may not need to include lm related yaml in conf/tuning

ftshijt · 2025-11-03T04:45:30Z

@@ -0,0 +1,205 @@
+# Automatic Speech Recognition (Multi-tasking)


I do not see the necessity of this readme. If there is no additional reason, I would recommend we safely remove this readme.

ftshijt · 2025-11-03T04:47:51Z

+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.5
+lm_weight: 0.3


We can set it to 0 given that no lm is included.

ftshijt · 2025-11-03T04:49:57Z

@@ -0,0 +1,7 @@
+batch_size: 16


I do not think we support 16 batch size now. Please double check

ftshijt · 2025-11-03T04:50:24Z

@@ -0,0 +1,40 @@
+# ==========================================


Since transducer is not included, please remove them.

Aniket-Tathe · 2025-11-08T04:31:26Z

Hi @sw005320

I needed your opinion and guidance for this:

I've added all the things I had tried in these weeks in detail.
Below are the results after lowering the learning rate.
I've also included the CTC and Attention losses for each learning rate (across the last 5 epochs: 6th–10th) for your reference.

Overall Performance Across Learning Rates

Learning Rate	Train Acc (%)	Valid Acc (%)	WER (%)	CER (%)
0.0001	94.2	85.9	41.3	17.3
0.0002	96.6	85.7	41.6	17.2
0.0003	97.2	87.9	37.1	14.9
0.0004	97.4	88.6	35.4	14.1
0.0005 (old)	89.2	77.6	54.9	27.9

CTC and Attention Loss Trends (Epoch 6–10)

Learning Rate	Loss Type	Epoch 6	Epoch 7	Epoch 8	Epoch 9	Epoch 10
0.0001	loss_ctc	27.125	27.533	28.420	28.031	28.668
	loss_att	27.089	29.896	27.088	28.365	28.832
0.0002	loss_ctc	28.755	30.057	31.230	31.385	32.294
	loss_att	28.663	30.295	29.980	31.343	32.441
0.0003	loss_ctc	28.025	30.177	31.800	31.508	32.631
	loss_att	27.661	29.601	30.393	30.115	31.611
0.0004	loss_ctc	28.352	29.341	30.410	31.211	32.024
	loss_att	26.322	28.351	28.652	29.488	30.311
0.0005 (old)	loss_ctc	46.582	42.896	39.874	40.816	37.439
	loss_att	49.149	41.999	45.019	47.217	41.953

It seemed that 0.0004 gave the most stable training curve and the lowest WER / CER overall.

Last week, I also tried Macaron with the lower lr for the first 5 epochs. I observed that the loss_ctc and loss_att values were at the lower end as compared to macaron: off, but they were not going down as the epoch increased

Macaron Results (No SpecAug, No LM)

These runs used the Macaron-style Conformer variant (same settings, lower learning rates).

Learning Rate	Train Acc (%)	Valid Acc (%)	WER (%)	CER (%)
0.0001	94.9	87.7	38.6	15.0
0.0002	97.1	89.0	33.8	13.4
0.0003	97.6	89.0	33.1	13.6
0.0004	97.8	88.6	34.4	13.8

Macaron CTC and Attention Loss Trends (Epoch 1–5)

Learning Rate	Loss Type	Epoch 1	Epoch 2	Epoch 3	Epoch 4	Epoch 5
0.0001	loss_ctc	27.790	25.751	25.281	25.187	26.846
	loss_att	21.624	20.325	19.744	20.473	21.509
0.0002	loss_ctc	24.305	23.761	25.166	25.773	27.759
	loss_att	18.216	18.666	20.150	21.415	22.514
0.0003	loss_ctc	23.671	23.770	25.264	25.869	28.357
	loss_att	19.363	20.545	22.431	22.719	25.047
0.0004	loss_ctc	23.461	23.853	25.332	26.138	28.339
	loss_att	19.382	21.385	22.637	24.487	26.252

I've also added the LR results where I went above 0.0005.
These runs ** Used SpecAug**, no Macaron, and no LM.

Learning Rate	Train Acc (%)	Valid Acc (%)	WER (%)	CER (%)
0.0005	89.2	77.6	54.9	27.9
0.0010	78.1	65.8	74.8	40.9
0.0015	70.9	64.0	78.0	42.8
0.0020	58.9	57.8	88.5	49.5
0.0030	54.2	54.6	92.1	53.0

CTC and Attention Loss (High Learning Rates, Epoch 1–5)

Learning Rate	Loss Type	Epoch 1	Epoch 2	Epoch 3	Epoch 4	Epoch 5
0.0010	loss_ctc	146.339	163.942	160.610	156.909	134.504
	loss_att	54.609	56.884	60.222	62.448	65.292
0.0015	loss_ctc	153.727	220.865	238.517	206.666	240.485
	loss_att	60.036	56.813	59.422	60.553	62.000
0.0020	loss_ctc	153.202	187.536	227.860	230.023	233.365
	loss_att	66.843	63.605	64.502	64.107	63.166
0.0030	loss_ctc	172.739	192.682	207.425	180.629	≈117.5 (avg train batch)
	loss_att	70.300	67.575	67.511	65.386	≈56.0 (avg train batch)

The 5th epoch for 0.0030 was incomplete, so I've just averaged out the values from the mini-batches from that epoch (hence, avg train batch)

Current vs New Proposed Model Configurations

Previous (Baseline):

3-block Conformer (256-dim, 4 heads, 1024 FFN, no macaron, CNN kernel 17)

3-block Transformer (4 heads, 1024 FFN)
Adam (LR = 0.0005), Warmup 20k, No SpecAug, No LM

New ProposedModel:

12-block Conformer (512-dim, 4 heads, 2048 FFN, macaron on, CNN kernel 15)

6-block Transformer (4 heads, 2048 FFN)
Adam (LR = 0.0004), Warmup 20k, SpecAug enabled, No LM

Changes Made (Need Guidance On)

conformer out dim from 256 to 512
conformer block from 3 to 12
decoder block from 3 to 6
FFN from 1024 to 2048 (I'm following FFN ~ 4x conf out dim)
macaron on (not sure about this)
init: xavier_uniform (it was true for all above exp, but don't know if it's needed if we are gonna use a bigger conformer. Online, it said it's mainly good for small to medium transformers)
Yes/No specaug
Yes/No Lm
I remember you said to keep the learning rate 0.0004, but I've added detailed results of all lr and configs above, so I would just like to confirm again, just to be sure.
Warmup: 20000 (default) or should I increase/decrease it?
CNN kernel (was 17, made it 15) or should I change it?

Would you recommend keeping this configuration, or should I change something? Answers to the above queries would help me a lot in deciding the final new architecture before training it.

Thanks again for your guidance.

sw005320 · 2026-01-16T14:50:04Z

The XLSR results seem worse due to a lack of tuning.
We may remove them since they are confusing.

…remove extras

for more information, see https://pre-commit.ci

sw005320 · 2026-04-20T12:46:54Z

Thanks, @Aniket-Tathe!

Add Marathi LREC2020 ASR recipe (ESPnet bootcamp)

3d9a75d

dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. ASR Automatic speech recogntion ESPnet2 Recipe labels Oct 25, 2025

mergify Bot added the README label Oct 25, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

94d4288

for more information, see https://pre-commit.ci

gemini-code-assist Bot reviewed Oct 25, 2025

View reviewed changes

Fhrozen added this to the v.202512 milestone Oct 26, 2025

Merge branch 'master' into marathi_asr_bootcamp

9b3d7f4

ftshijt reviewed Nov 3, 2025

View reviewed changes

Fhrozen and others added 2 commits December 15, 2025 21:48

Merge branch 'master' into marathi_asr_bootcamp

10e499a

Merge branch 'master' into marathi_asr_bootcamp

6688eac

Fhrozen modified the milestones: v.202604, v.202607 Apr 7, 2026

marathi_lrec2020: flat training/decode configs, README test results, …

f1fb37d

…remove extras

dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Apr 19, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

09af4c1

for more information, see https://pre-commit.ci

sw005320 merged commit 433b308 into espnet:master Apr 20, 2026
109 checks passed

		- XLSR + conformer (BPE and Char) underperformed in this setup, likely due to limited fine-tuning (also sub-sampling conv2d was disabled. I used linear for this.)
		All the above training was done without any LM model.

		@@ -0,0 +1,205 @@
		# Automatic Speech Recognition (Multi-tasking)

		@@ -0,0 +1,40 @@
		# ==========================================

Conversation

Aniket-Tathe commented Oct 25, 2025

Summary

Details

Results

Notes

Dataset Reference

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Aniket-Tathe commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sw005320 commented Oct 25, 2025

Uh oh!

sw005320 commented Oct 25, 2025

Uh oh!

codecov Bot commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ftshijt left a comment

Choose a reason for hiding this comment

Uh oh!

ftshijt Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

ftshijt Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

ftshijt Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

ftshijt Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

ftshijt Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Aniket-Tathe commented Nov 8, 2025

Overall Performance Across Learning Rates

CTC and Attention Loss Trends (Epoch 6–10)

Macaron Results (No SpecAug, No LM)

Macaron CTC and Attention Loss Trends (Epoch 1–5)

CTC and Attention Loss (High Learning Rates, Epoch 1–5)

Current vs New Proposed Model Configurations

Changes Made (Need Guidance On)

Uh oh!

sw005320 commented Jan 16, 2026

Uh oh!

Uh oh!

sw005320 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Aniket-Tathe commented Oct 25, 2025 •

edited

Loading

codecov Bot commented Oct 26, 2025 •

edited

Loading