Skip to content

Commit 9eadc9e

Browse files
committed
add transformers
1 parent 07e2458 commit 9eadc9e

3 files changed

Lines changed: 12 additions & 12 deletions

File tree

recipes/RescueSpeech/README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -23,15 +23,15 @@ During training, both speech enhancement and ASR is kept unfrozen- i.e. both ASR
2323

2424

2525
## Fine-tuned models
26-
1. Firstly, the SepFormer model is trained on the Microsoft-DNS dataset. Subsequently, it undergoes fine-tuning with our RescueSpeech dataset (first row in the table below).
27-
2. The Whisper ASR is fine-tuned on the RescueSpeech dataset (second row in the table below).
28-
3. Finally, the fine-tuned SepFormer and Whisper ASR models are jointly fine-tuned using our RescueSpeech dataset. This represents the best model reported in the table above, with its pretrained models and logs accessible in the third row of the table below.
29-
30-
| Model | HuggingFace link | Full Model link |
31-
|----------------|------------------------------------------------|------------------------------------------------|
32-
| Whisper ASR | [HuggingFace](https://huggingface.co/speechbrain/whisper_rescuespeech) | [Dropbox](https://www.dropbox.com/sh/45wk44h8e0wkc5f/AABjEJJJ_OJp2fDYz3zEihmPa?dl=0) |
33-
| Sepformer Enhancement | [HuggingFace](https://huggingface.co/speechbrain/sepformer_rescuespeech) | [Dropbox](https://www.dropbox.com/sh/02c3wesc65402f6/AAApoxBApft-JwqHK-bddedBa?dl=0) |
34-
| Sepformer + Whisper ASR (fine-tuned) | [HuggingFace](https://huggingface.co/sangeet2020/noisy-whisper-resucespeech) | [Dropbox](https://www.dropbox.com/sh/7tryj6n7cfy0poe/AADpl4b8rGRSnoQ5j6LCj9tua?dl=0) |
26+
1. Firstly, the SepFormer enhancement model is trained on the Microsoft-DNS dataset. Subsequently, it undergoes fine-tuning with our RescueSpeech *enhancement* dataset (first row in the table below).
27+
2. The Whisper ASR is fine-tuned on the RescueSpeech *ASR* dataset (second row in the table below).
28+
3. Finally, the fine-tuned SepFormer and Whisper ASR models are jointly fine-tuned using our RescueSpeech *ASR* dataset. This represents the best model reported in the table above, with its pretrained models and logs accessible in the third row of the table below.
29+
30+
|S. No. | Model | HuggingFace link | Full Model link |
31+
|---|----------------|------------------------------------------------|------------------------------------------------|
32+
| 1. | Whisper ASR | [HuggingFace](https://huggingface.co/speechbrain/whisper_rescuespeech) | [Dropbox](https://www.dropbox.com/sh/45wk44h8e0wkc5f/AABjEJJJ_OJp2fDYz3zEihmPa?dl=0) |
33+
| 2. | Sepformer Enhancement | [HuggingFace](https://huggingface.co/speechbrain/sepformer_rescuespeech) | [Dropbox](https://www.dropbox.com/sh/02c3wesc65402f6/AAApoxBApft-JwqHK-bddedBa?dl=0) |
34+
| 3. | Sepformer + Whisper ASR (fine-tuned) | [HuggingFace](https://huggingface.co/sangeet2020/noisy-whisper-resucespeech) | [Dropbox](https://www.dropbox.com/sh/7tryj6n7cfy0poe/AADpl4b8rGRSnoQ5j6LCj9tua?dl=0) |
3535

3636

3737
# **About SpeechBrain**

recipes/RescueSpeech/dataset.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@ We are thrilled to introduce our latest release - the **RescueSpeech** audio dat
44

55
The RescueSpeech dataset is divided into two sets, each designed for different tasks: Automatic Speech Recognition (ASR) and Speech Enhancement.
66

7-
1. For the ASR task, the dataset spans a duration of 1 hour and 36 minutes. It comprises a collection of clean-noisy pairs, where the noisy utterances are created by introducing contaminations from five different noise types sourced from the AudioSet dataset. These noise types include emergency vehicle siren, breathing, engine, chopper, and static radio noise. To match the 2412 clean utterances in the dataset, we have synthesized an equal number of corresponding noisy utterances. Additionally, we have provided the noise waveform files used to create the noisy utterances, ensuring transparency and reproducibility in the research community.
7+
1. `Task_ASR.tar.gz`: For the ASR task, the dataset spans a duration of 1 hour and 36 minutes. It comprises a collection of clean-noisy pairs, where the noisy utterances are created by introducing contaminations from five different noise types sourced from the AudioSet dataset. These noise types include emergency vehicle siren, breathing, engine, chopper, and static radio noise. To match the 2412 clean utterances in the dataset, we have synthesized an equal number of corresponding noisy utterances. Additionally, we have provided the noise waveform files used to create the noisy utterances, ensuring transparency and reproducibility in the research community.
88

9-
2. The Speech Enhancement task dataset is larger in size compared to the ASR dataset. The primary objective of this dataset is to facilitate the fine-tuning of speech enhancement models, particularly for the five SAR noise types mentioned earlier: emergency vehicle siren, breathing, engine, chopper, and static radio noise. Given the limited duration of clean audio available (1 hour and 36 minutes), we have synthesized multiple noisy utterances with varying noise types and signal-to-noise ratio (SNR) levels, all derived from a single clean utterance. This augmentation approach allows us to generate a more extensive dataset for speech enhancement purposes while preserving the original speaker distribution.
9+
2. `Task_enhancement.tar.gz`: The Speech Enhancement task dataset is larger in size compared to the ASR dataset. The primary objective of this dataset is to facilitate the fine-tuning of speech enhancement models, particularly for the five SAR noise types mentioned earlier: emergency vehicle siren, breathing, engine, chopper, and static radio noise. Given the limited duration of clean audio available (1 hour and 36 minutes), we have synthesized multiple noisy utterances with varying noise types and signal-to-noise ratio (SNR) levels, all derived from a single clean utterance. This augmentation approach allows us to generate a more extensive dataset for speech enhancement purposes while preserving the original speaker distribution.
1010

1111
By providing these diverse datasets, we aim to support advancements in ASR and Speech Enhancement research, enabling the development and evaluation of robust systems that can handle real-world scenarios encountered during search and rescue operations.
1212

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
librosa
21
mir_eval
32
pesq
43
pystoi
4+
transformers

0 commit comments

Comments
 (0)