-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Add Myst children speech recipe #2997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from 5 commits
2d5624d
832e41d
94ada40
d9a922a
2f4b322
83fbfe4
68b5e12
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,91 @@ | ||
| # Myst ASR with Transformers or Whisper models. | ||
| This folder contains the scripts to train a Transformer-based speech recognizer or the scripts to fine-tune the Whisper model for the My Science Tutor (MyST). | ||
| MyST is one of the largest publicly accessible collections of English children’s speech, comprising approximately 400 hours. It encompasses dialogues between | ||
| children and a virtual tutor across eight scientific domains, involving 1,372 students in grades three to five. The corpus is pre-partitioned, ensuring equitable | ||
| representation of scientific domains and unique student occurrences within each partition. However, only 45% of utterances are transcribed at the word level. | ||
|
|
||
| You can find Myst dataset at https://catalog.ldc.upenn.edu/LDC2021S05 | ||
|
|
||
| # How to run | ||
| ```shell | ||
| python train_with_whisper.py hparams/train_hf_whisper.yaml # Finetune Whisper | ||
| python train_with_whisper.py hparams/train_whisper_lora.yaml # Use LoRa to finetune Whisper | ||
| python train.py hparams/transformer.yaml # Train from scratch Transformer model | ||
|
|
||
| ``` | ||
|
|
||
| # How to run on test sets only | ||
| If you want to run it on the test sets only, you can add the flag `--test_only` to the following command: | ||
|
|
||
| ```shell | ||
| python train_with_whisper.py hparams/train_hf_whisper.yaml --test_only | ||
| python train_with_whisper.py hparams/train_whisper_lora.yaml --test_only | ||
| python train.py hparams/transformer.yaml --test_only | ||
| ``` | ||
|
|
||
| **If using a HuggingFace pre-trained model, please make sure you have "transformers" | ||
| installed in your environment (see extra-requirements.txt)** | ||
|
|
||
|
|
||
| # Note about data preparation | ||
|
|
||
| In accordance with the methodology presented in [1], we offer an optional WER filtering mechanism. This filters out all utterances that exceed a specified threshold, which may result in a longer data preparation time, as every file must be decoded using a pre-trained Whisper model. We highly recommend running the data preparation process only once and saving the resulting CSV files for future use. | ||
|
|
||
| Note that this data filtering will take couple of hours to run. | ||
|
|
||
| [1] A. A. Attia et al., “Kid-whisper: Towards bridging the performance gap in automatic speech recognition for children vs. adults,” in *Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society*, vol. 7, 2024, pp. 74–80. | ||
|
|
||
| # Results | ||
|
|
||
| ## Whisper Finetuning Result: | ||
|
|
||
| Following table contains whisper-finetuning results for 1 epoch using Whisper model, using different configurations. | ||
|
|
||
| | Release | Model | Configuration | hyperparams file | LM | WER | Model link | | ||
| | -------------| :-------------|:-------------| :-------------| :-------------| :-------------| :------------- | ||
| 2025-11-13 | large-v3 | Decoder | train_hf_whisper.yaml | No | 8.36% | [Save](https://cloud.inesc-id.pt/s/eknR4y73RHKSB7F) | | ||
| 2025-11-13 | medium.en | Decoder | train_hf_whisper.yaml | No | 8.50% | [Save](https://cloud.inesc-id.pt/s/oJeyJCM7R2tGmPG) | | ||
| 2025-11-13 | medium.en | Encoder + Decoder | train_hf_whisper.yaml |No | 8.75% |[Save](https://cloud.inesc-id.pt/s/px3KWAditRo7wHH) | | ||
| 2025-11-13 | medium.en | LoRA (r=16) in Decoder | train_whisper_lora.yaml | No | 9.38% | [Save](https://cloud.inesc-id.pt/s/6YrRKPjNpKdMgoW)| | ||
|
Comment on lines
+48
to
+51
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. may I ask you how competitive are your results? Btw, thanks for uploading the models. I will transfer them on dropbox so that we can host them ourselves
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SOTA for Myst is approximately 8-9% WER, which significantly varies based on data preparation and filtering methods. Consequently, direct comparison is challenging. This was my motivation for this PR, to provide a standardised data preparation method that facilitates comparison among works. |
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
| ## Transformers | ||
|
|
||
| | Release | Model | hyperparams file | LM | WER | Model link | | ||
| | -------------| :-------------| :-------------| :-------------| :-------------| :------------- | ||
| 2025-11-15 | Transformer | transformer.yaml | LibriSpeech LM | 12.95% | [Save](https://cloud.inesc-id.pt/s/ooG53HSjsTJTZPY) | | ||
|
|
||
|
|
||
|
|
||
| # **About SpeechBrain** | ||
| - Website: https://speechbrain.github.io/ | ||
| - Code: https://github.com/speechbrain/speechbrain/ | ||
| - HuggingFace: https://huggingface.co/speechbrain/ | ||
|
|
||
|
|
||
| # **Citing SpeechBrain** | ||
| Please, cite SpeechBrain if you use it for your research or business. | ||
|
|
||
| ```bibtex | ||
| @misc{speechbrainV1, | ||
| title={Open-Source Conversational AI with SpeechBrain 1.0}, | ||
| author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve}, | ||
| year={2024}, | ||
| eprint={2407.00463}, | ||
| archivePrefix={arXiv}, | ||
| primaryClass={cs.LG}, | ||
| url={https://arxiv.org/abs/2407.00463}, | ||
| } | ||
| @misc{speechbrain, | ||
| title={{SpeechBrain}: A General-Purpose Speech Toolkit}, | ||
| author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio}, | ||
| year={2021}, | ||
| eprint={2106.04624}, | ||
| archivePrefix={arXiv}, | ||
| primaryClass={eess.AS}, | ||
| note={arXiv:2106.04624} | ||
| } | ||
| ``` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,166 @@ | ||
| # ################################ | ||
| # Model: Whisper (Encoder-Decoder) + NLL | ||
| # Augmentation: TimeDomainSpecAugment | ||
| # Authors: Adel Moumen 2022 & 2024, Titouan Parcollet 2022, Thomas Rolland 2025 | ||
| # ################################ | ||
|
|
||
| # Seed needs to be set at top of yaml, before objects with parameters are made | ||
| seed: 1986 | ||
| __set_seed: !apply:speechbrain.utils.seed_everything [!ref <seed>] | ||
| output_folder: !ref results/whisper/<seed> | ||
| output_wer_folder: !ref <output_folder>/ | ||
| save_folder: !ref <output_folder>/save | ||
| train_log: !ref <output_folder>/train_log.txt | ||
|
|
||
| # URL for the biggest Fairseq english whisper model. | ||
| whisper_hub: openai/whisper-large-v3 | ||
| whisper_folder: !ref <save_folder>/whisper_checkpoint | ||
|
|
||
| # Normalize the english inputs with | ||
| # the same normalization done in the paper | ||
| normalized_transcripts: True | ||
|
|
||
| # Data files | ||
| data_folder: !PLACEHOLDER # e.g., /path/to/myst/data/ | ||
| train_splits: ["train"] | ||
| dev_splits: ["development"] | ||
| test_splits: ["test"] | ||
| skip_prep: False | ||
| train_csv: !ref <output_folder>/train.csv | ||
| valid_csv: !ref <output_folder>/development.csv | ||
| test_csv: !ref <output_folder>/test.csv | ||
|
|
||
| # Data preparation | ||
| enable_wer_filter: True | ||
| wer_threshold: 50.0 | ||
| asr_model: "openai/whisper-large-v3" | ||
|
|
||
| ckpt_interval_minutes: 10 # save checkpoint every N min | ||
|
|
||
| ############################## Training Parameters ############################# | ||
| freeze_encoder: True | ||
| number_of_epochs: 1 | ||
| weight_decay: 0.01 | ||
| lr_whisper: 1e-5 | ||
| warmup_steps: 500 | ||
| max_grad_norm: 2.0 | ||
| sorting: ascending | ||
| precision: fp16 # bf16, fp16 or fp32 | ||
| eval_precision: fp16 | ||
| sampling_rate: 16_000 | ||
|
|
||
| # With data_parallel batch_size is split into N jobs | ||
| # With DDP batch_size is multiplied by N jobs | ||
| # This setup works well with 1x 32GB GPU | ||
| batch_size: 16 | ||
| test_batch_size: 4 | ||
| grad_accumulation_factor: 1 | ||
|
|
||
| # Decoding parameters | ||
| min_decode_ratio: 0.0 | ||
| max_decode_ratio: 1.0 | ||
| test_beam_size: 8 | ||
|
|
||
| ####################### Model Parameters ####################################### | ||
|
|
||
| train_loader_kwargs: | ||
| batch_size: !ref <batch_size> | ||
|
|
||
| valid_loader_kwargs: | ||
| batch_size: !ref <test_batch_size> | ||
|
|
||
| test_loader_kwargs: | ||
| batch_size: !ref <test_batch_size> | ||
|
|
||
|
|
||
| epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter | ||
| limit: !ref <number_of_epochs> | ||
|
|
||
| ############################## Augmentations ################################### | ||
|
|
||
| # UNCOMMENT THIS SECTION TO ADD AUGMENTATIONS | ||
| # speed_perturb: !new:speechbrain.augment.time_domain.SpeedPerturb | ||
| # orig_freq: !ref <sample_rate> | ||
| # speeds: [95, 100, 105] | ||
|
|
||
| # # Frequency drop: randomly drops a number of frequency bands to zero. | ||
| # drop_freq: !new:speechbrain.augment.time_domain.DropFreq | ||
| # drop_freq_low: 0 # Min frequency band dropout probability | ||
| # drop_freq_high: 1 # Max frequency band dropout probability | ||
| # drop_freq_count_low: 1 # Min number of frequency bands to drop | ||
| # drop_freq_count_high: 3 # Max number of frequency bands to drop | ||
| # drop_freq_width: 0.05 # Width of frequency bands to drop | ||
|
|
||
| # # Time drop: randomly drops a number of temporal chunks. | ||
| # drop_chunk: !new:speechbrain.augment.time_domain.DropChunk | ||
| # drop_length_low: 1 | ||
| # drop_length_high: 5 | ||
| # drop_count_low: 1000 | ||
| # drop_count_high: 2000 | ||
|
|
||
| # # Augmenter: Combines previously defined augmentations to perform data augmentation | ||
| # wav_augment: !new:speechbrain.augment.augmenter.Augmenter | ||
| # concat_original: True | ||
| # min_augmentations: 3 | ||
| # max_augmentations: 3 | ||
| # augment_prob: 1.0 | ||
| # augmentations: [ | ||
| # !ref <speed_perturb>, | ||
| # !ref <drop_freq>, | ||
| # !ref <drop_chunk>] | ||
|
|
||
| ############################## Models ########################################## | ||
|
|
||
| whisper: !new:speechbrain.integrations.huggingface.whisper.Whisper | ||
| source: !ref <whisper_hub> | ||
| freeze_encoder: !ref <freeze_encoder> | ||
| save_path: !ref <whisper_folder> | ||
| language: "english" | ||
| task: "transcribe" | ||
| sampling_rate: !ref <sampling_rate> | ||
|
|
||
| log_softmax: !new:speechbrain.nnet.activations.Softmax | ||
| apply_log: True | ||
|
|
||
| nll_loss: !name:speechbrain.nnet.losses.nll_loss | ||
|
|
||
| modules: | ||
| whisper: !ref <whisper> | ||
|
|
||
| ############################## Decoding & optimiser ############################ | ||
|
|
||
| whisper_opt_class: !name:torch.optim.AdamW | ||
| lr: !ref <lr_whisper> | ||
| weight_decay: !ref <weight_decay> | ||
|
|
||
| valid_search: !new:speechbrain.decoders.seq2seq.S2SWhisperGreedySearcher | ||
| model: !ref <whisper> | ||
| min_decode_ratio: !ref <min_decode_ratio> | ||
| max_decode_ratio: !ref <max_decode_ratio> | ||
|
|
||
| test_search: !new:speechbrain.decoders.seq2seq.S2SWhisperBeamSearcher | ||
| module: [!ref <whisper>] | ||
| min_decode_ratio: !ref <min_decode_ratio> | ||
| max_decode_ratio: !ref <max_decode_ratio> | ||
| beam_size: !ref <test_beam_size> | ||
|
|
||
| lr_annealing_whisper: !new:speechbrain.nnet.schedulers.NoamScheduler | ||
| lr_initial: !ref <lr_whisper> | ||
| n_warmup_steps: !ref <warmup_steps> | ||
|
|
||
| ############################## Logging and Pretrainer ########################## | ||
|
|
||
| checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer | ||
| checkpoints_dir: !ref <save_folder> | ||
| recoverables: | ||
| whisper: !ref <whisper> | ||
| scheduler_whisper: !ref <lr_annealing_whisper> | ||
| counter: !ref <epoch_counter> | ||
|
|
||
| train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger | ||
| save_file: !ref <train_log> | ||
|
|
||
| error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats | ||
|
|
||
| cer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats | ||
| split_tokens: True |
Uh oh!
There was an error while loading. Please reload this page.