speechbrain · Usanter · Nov 17, 2025 · Nov 17, 2025 · Nov 17, 2025 · Nov 17, 2025
diff --git a/recipes/Myst/ASR/README.md b/recipes/Myst/ASR/README.md
@@ -0,0 +1,91 @@
+# Myst ASR with Transformers or Whisper models.
+This folder contains the scripts to train a Transformer-based speech recognizer or the scripts to fine-tune the Whisper model for the My Science Tutor (MyST).
+MyST is one of the largest publicly accessible collections of English children’s speech, comprising approximately 400 hours. It encompasses dialogues between
+children and a virtual tutor across eight scientific domains, involving 1,372 students in grades three to five. The corpus is pre-partitioned, ensuring equitable
+representation of scientific domains and unique student occurrences within each partition. However, only 45% of utterances are transcribed at the word level.
+
+You can find Myst dataset at https://catalog.ldc.upenn.edu/LDC2021S05
+
+# How to run
+```shell
+python train_with_whisper.py hparams/train_hf_whisper.yaml # Finetune Whisper
+python train_with_whisper.py hparams/train_whisper_lora.yaml # Use LoRa to finetune Whisper
+python train.py hparams/transformer.yaml # Train from scratch Transformer model
+
+```
+
+# How to run on test sets only
+If you want to run it on the test sets only, you can add the flag `--test_only` to the following command:
+
+```shell
+python train_with_whisper.py hparams/train_hf_whisper.yaml --test_only
+python train_with_whisper.py hparams/train_whisper_lora.yaml --test_only
+python train.py hparams/transformer.yaml --test_only
+```
+
+**If using a HuggingFace pre-trained model, please make sure you have "transformers"
+installed in your environment (see extra-requirements.txt)**
+
+
+# Note about data preparation
+
+In accordance with the methodology presented in [1], we offer an optional WER filtering mechanism. This filters out all utterances that exceed a specified threshold, which may result in a longer data preparation time, as every file must be decoded using a pre-trained Whisper model. We highly recommend running the data preparation process only once and saving the resulting CSV files for future use.
+
+Note that this data filtering will take couple of hours to run.
+
+[1] A. A. Attia et al., “Kid-whisper: Towards bridging the performance gap in automatic speech recognition for children vs. adults,” in *Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society*, vol. 7, 2024, pp. 74–80.
+
+# Results
+
+## Whisper Finetuning Result:
+
+Following table contains whisper-finetuning results for 1 epoch using Whisper model, using different configurations.
+
+| Release | Model | Configuration  | hyperparams file | LM | WER | Model link |
+| -------------| :-------------|:-------------| :-------------| :-------------| :-------------| :-------------
+2025-11-13 | large-v3 | Decoder | train_hf_whisper.yaml | No | 8.36% | [Save](https://cloud.inesc-id.pt/s/eknR4y73RHKSB7F) |
+2025-11-13 | medium.en | Decoder | train_hf_whisper.yaml | No | 8.50% | [Save](https://cloud.inesc-id.pt/s/oJeyJCM7R2tGmPG) |
+2025-11-13 | medium.en | Encoder + Decoder | train_hf_whisper.yaml |No | 8.75% |[Save](https://cloud.inesc-id.pt/s/px3KWAditRo7wHH) |
+2025-11-13 | medium.en | LoRA (r=16) in Decoder | train_whisper_lora.yaml | No | 9.38% | [Save](https://cloud.inesc-id.pt/s/6YrRKPjNpKdMgoW)|
+
+
+
+
+
+## Transformers
+
+| Release | Model |  hyperparams file | LM | WER | Model link |
+| -------------| :-------------| :-------------| :-------------| :-------------| :-------------
+2025-11-15 | Transformer | transformer.yaml | LibriSpeech LM | 12.95% | [Save](https://cloud.inesc-id.pt/s/ooG53HSjsTJTZPY) |
+
+
+
+# **About SpeechBrain**
+- Website: https://speechbrain.github.io/
+- Code: https://github.com/speechbrain/speechbrain/
+- HuggingFace: https://huggingface.co/speechbrain/
+
+
+# **Citing SpeechBrain**
+Please, cite SpeechBrain if you use it for your research or business.
+
+```bibtex
+@misc{speechbrainV1,
+  title={Open-Source Conversational AI with SpeechBrain 1.0},
+  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
+  year={2024},
+  eprint={2407.00463},
+  archivePrefix={arXiv},
+  primaryClass={cs.LG},
+  url={https://arxiv.org/abs/2407.00463},
+}
+@misc{speechbrain,
+  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
+  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
+  year={2021},
+  eprint={2106.04624},
+  archivePrefix={arXiv},
+  primaryClass={eess.AS},
+  note={arXiv:2106.04624}
+}
+```
diff --git a/recipes/Myst/ASR/hparams/train_hf_whisper.yaml b/recipes/Myst/ASR/hparams/train_hf_whisper.yaml
@@ -0,0 +1,166 @@
+# ################################
+# Model: Whisper (Encoder-Decoder) + NLL
+# Augmentation: TimeDomainSpecAugment
+# Authors: Adel Moumen 2022 & 2024, Titouan Parcollet 2022, Thomas Rolland 2025
+# ################################
+
+# Seed needs to be set at top of yaml, before objects with parameters are made
+seed: 1986
+__set_seed: !apply:speechbrain.utils.seed_everything [!ref <seed>]
+output_folder: !ref results/whisper/<seed>
+output_wer_folder: !ref <output_folder>/
+save_folder: !ref <output_folder>/save
+train_log: !ref <output_folder>/train_log.txt
+
+# URL for the biggest Fairseq english whisper model.
+whisper_hub: openai/whisper-large-v3
+whisper_folder: !ref <save_folder>/whisper_checkpoint
+
+# Normalize the english inputs with
+# the same normalization done in the paper
+normalized_transcripts: True
+
+# Data files
+data_folder: !PLACEHOLDER # e.g., /path/to/myst/data/
+train_splits: ["train"]
+dev_splits: ["development"]
+test_splits: ["test"]
+skip_prep: False
+train_csv: !ref <output_folder>/train.csv
+valid_csv: !ref <output_folder>/development.csv
+test_csv: !ref <output_folder>/test.csv
+
+# Data preparation
+enable_wer_filter: True
+wer_threshold: 50.0
+asr_model: "openai/whisper-large-v3"
+
+ckpt_interval_minutes: 10 # save checkpoint every N min
+
+############################## Training Parameters #############################
+freeze_encoder: True
+number_of_epochs: 1
+weight_decay: 0.01
+lr_whisper: 1e-5
+warmup_steps: 500
+max_grad_norm: 2.0
+sorting: ascending
+precision: fp16 # bf16, fp16 or fp32
+eval_precision: fp16
+sampling_rate: 16_000
+
+# With data_parallel batch_size is split into N jobs
+# With DDP batch_size is multiplied by N jobs
+# This setup works well with 1x 32GB GPU
+batch_size: 16
+test_batch_size: 4
+grad_accumulation_factor: 1
+
+# Decoding parameters
+min_decode_ratio: 0.0
+max_decode_ratio: 1.0
+test_beam_size: 8
+
+####################### Model Parameters #######################################
+
+train_loader_kwargs:
+    batch_size: !ref <batch_size>
+
+valid_loader_kwargs:
+    batch_size: !ref <test_batch_size>
+
+test_loader_kwargs:
+    batch_size: !ref <test_batch_size>
+
+
+epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
+    limit: !ref <number_of_epochs>
+
+############################## Augmentations ###################################
+
+# UNCOMMENT THIS SECTION TO ADD AUGMENTATIONS
+# speed_perturb: !new:speechbrain.augment.time_domain.SpeedPerturb
+#     orig_freq: !ref <sample_rate>
+#     speeds: [95, 100, 105]
+
+# # Frequency drop: randomly drops a number of frequency bands to zero.
+# drop_freq: !new:speechbrain.augment.time_domain.DropFreq
+#     drop_freq_low: 0  # Min frequency band dropout probability
+#     drop_freq_high: 1  # Max frequency band dropout probability
+#     drop_freq_count_low: 1  # Min number of frequency bands to drop
+#     drop_freq_count_high: 3  # Max number of frequency bands to drop
+#     drop_freq_width: 0.05  # Width of frequency bands to drop
+
+# # Time drop: randomly drops a number of temporal chunks.
+# drop_chunk: !new:speechbrain.augment.time_domain.DropChunk
+#     drop_length_low: 1
+#     drop_length_high: 5
+#     drop_count_low: 1000
+#     drop_count_high: 2000
+
+# # Augmenter: Combines previously defined augmentations to perform data augmentation
+# wav_augment: !new:speechbrain.augment.augmenter.Augmenter
+#     concat_original: True
+#     min_augmentations: 3
+#     max_augmentations: 3
+#     augment_prob: 1.0
+#     augmentations: [
+#         !ref <speed_perturb>,
+#         !ref <drop_freq>,
+#         !ref <drop_chunk>]
+
+############################## Models ##########################################
+
+whisper: !new:speechbrain.integrations.huggingface.whisper.Whisper
+    source: !ref <whisper_hub>
+    freeze_encoder: !ref <freeze_encoder>
+    save_path: !ref <whisper_folder>
+    language: "english"
+    task: "transcribe"
+    sampling_rate: !ref <sampling_rate>
+
+log_softmax: !new:speechbrain.nnet.activations.Softmax
+    apply_log: True
+
+nll_loss: !name:speechbrain.nnet.losses.nll_loss
+
+modules:
+    whisper: !ref <whisper>
+
+############################## Decoding & optimiser ############################
+
+whisper_opt_class: !name:torch.optim.AdamW
+    lr: !ref <lr_whisper>
+    weight_decay: !ref <weight_decay>
+
+valid_search: !new:speechbrain.decoders.seq2seq.S2SWhisperGreedySearcher
+    model: !ref <whisper>
+    min_decode_ratio: !ref <min_decode_ratio>
+    max_decode_ratio: !ref <max_decode_ratio>
+
+test_search: !new:speechbrain.decoders.seq2seq.S2SWhisperBeamSearcher
+    module: [!ref <whisper>]
+    min_decode_ratio: !ref <min_decode_ratio>
+    max_decode_ratio: !ref <max_decode_ratio>
+    beam_size: !ref <test_beam_size>
+
+lr_annealing_whisper: !new:speechbrain.nnet.schedulers.NoamScheduler
+    lr_initial: !ref <lr_whisper>
+    n_warmup_steps: !ref <warmup_steps>
+
+############################## Logging and Pretrainer ##########################
+
+checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
+    checkpoints_dir: !ref <save_folder>
+    recoverables:
+        whisper: !ref <whisper>
+        scheduler_whisper: !ref <lr_annealing_whisper>
+        counter: !ref <epoch_counter>
+
+train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
+    save_file: !ref <train_log>
+
+error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
+
+cer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
+    split_tokens: True