add transformers

sangeet2020 · sangeet2020 · commit 9eadc9e5984a · 2023-07-03T00:58:08.000+02:00
diff --git a/recipes/RescueSpeech/README.md b/recipes/RescueSpeech/README.md
@@ -23,15 +23,15 @@ During training, both speech enhancement and ASR is kept unfrozen- i.e. both ASR
 
 
 ## Fine-tuned models
-1. Firstly, the SepFormer model is trained on the Microsoft-DNS dataset. Subsequently, it undergoes fine-tuning with our RescueSpeech dataset (first row in the table below).
-2. The Whisper ASR is fine-tuned on the RescueSpeech dataset (second row in the table below).
-3. Finally, the fine-tuned SepFormer and Whisper ASR models are jointly fine-tuned using our RescueSpeech dataset. This represents the best model reported in the table above, with its pretrained models and logs accessible in the third row of the table below.
-
-|  Model        | HuggingFace link                               | Full Model link                                |
-|----------------|------------------------------------------------|------------------------------------------------|
-| Whisper ASR    | [HuggingFace](https://huggingface.co/speechbrain/whisper_rescuespeech)             | [Dropbox](https://www.dropbox.com/sh/45wk44h8e0wkc5f/AABjEJJJ_OJp2fDYz3zEihmPa?dl=0)             |
-| Sepformer Enhancement   | [HuggingFace](https://huggingface.co/speechbrain/sepformer_rescuespeech)            | [Dropbox](https://www.dropbox.com/sh/02c3wesc65402f6/AAApoxBApft-JwqHK-bddedBa?dl=0)            |
-| Sepformer +  Whisper ASR  (fine-tuned)  |  [HuggingFace](https://huggingface.co/sangeet2020/noisy-whisper-resucespeech)            | [Dropbox](https://www.dropbox.com/sh/7tryj6n7cfy0poe/AADpl4b8rGRSnoQ5j6LCj9tua?dl=0)            |
+1. Firstly, the SepFormer enhancement model is trained on the Microsoft-DNS dataset. Subsequently, it undergoes fine-tuning with our RescueSpeech *enhancement* dataset (first row in the table below).
+2. The Whisper ASR is fine-tuned on the RescueSpeech *ASR* dataset (second row in the table below).
+3. Finally, the fine-tuned SepFormer and Whisper ASR models are jointly fine-tuned using our RescueSpeech *ASR* dataset. This represents the best model reported in the table above, with its pretrained models and logs accessible in the third row of the table below.
+
+|S. No. |  Model        | HuggingFace link                               | Full Model link                                |
+|---|----------------|------------------------------------------------|------------------------------------------------|
+| 1. | Whisper ASR    | [HuggingFace](https://huggingface.co/speechbrain/whisper_rescuespeech)             | [Dropbox](https://www.dropbox.com/sh/45wk44h8e0wkc5f/AABjEJJJ_OJp2fDYz3zEihmPa?dl=0)             |
+| 2. | Sepformer Enhancement   | [HuggingFace](https://huggingface.co/speechbrain/sepformer_rescuespeech)            | [Dropbox](https://www.dropbox.com/sh/02c3wesc65402f6/AAApoxBApft-JwqHK-bddedBa?dl=0)            |
+| 3. | Sepformer +  Whisper ASR  (fine-tuned)  |  [HuggingFace](https://huggingface.co/sangeet2020/noisy-whisper-resucespeech)            | [Dropbox](https://www.dropbox.com/sh/7tryj6n7cfy0poe/AADpl4b8rGRSnoQ5j6LCj9tua?dl=0)            |
 
 
 # **About SpeechBrain**
diff --git a/recipes/RescueSpeech/dataset.md b/recipes/RescueSpeech/dataset.md
@@ -4,9 +4,9 @@ We are thrilled to introduce our latest release - the **RescueSpeech** audio dat
 
 The RescueSpeech dataset is divided into two sets, each designed for different tasks: Automatic Speech Recognition (ASR) and Speech Enhancement.
 
-1. For the ASR task, the dataset spans a duration of 1 hour and 36 minutes. It comprises a collection of clean-noisy pairs, where the noisy utterances are created by introducing contaminations from five different noise types sourced from the AudioSet dataset. These noise types include emergency vehicle siren, breathing, engine, chopper, and static radio noise. To match the 2412 clean utterances in the dataset, we have synthesized an equal number of corresponding noisy utterances. Additionally, we have provided the noise waveform files used to create the noisy utterances, ensuring transparency and reproducibility in the research community.
+1. `Task_ASR.tar.gz`: For the ASR task, the dataset spans a duration of 1 hour and 36 minutes. It comprises a collection of clean-noisy pairs, where the noisy utterances are created by introducing contaminations from five different noise types sourced from the AudioSet dataset. These noise types include emergency vehicle siren, breathing, engine, chopper, and static radio noise. To match the 2412 clean utterances in the dataset, we have synthesized an equal number of corresponding noisy utterances. Additionally, we have provided the noise waveform files used to create the noisy utterances, ensuring transparency and reproducibility in the research community.
 
-2. The Speech Enhancement task dataset is larger in size compared to the ASR dataset. The primary objective of this dataset is to facilitate the fine-tuning of speech enhancement models, particularly for the five SAR noise types mentioned earlier: emergency vehicle siren, breathing, engine, chopper, and static radio noise. Given the limited duration of clean audio available (1 hour and 36 minutes), we have synthesized multiple noisy utterances with varying noise types and signal-to-noise ratio (SNR) levels, all derived from a single clean utterance. This augmentation approach allows us to generate a more extensive dataset for speech enhancement purposes while preserving the original speaker distribution.
+2. `Task_enhancement.tar.gz`: The Speech Enhancement task dataset is larger in size compared to the ASR dataset. The primary objective of this dataset is to facilitate the fine-tuning of speech enhancement models, particularly for the five SAR noise types mentioned earlier: emergency vehicle siren, breathing, engine, chopper, and static radio noise. Given the limited duration of clean audio available (1 hour and 36 minutes), we have synthesized multiple noisy utterances with varying noise types and signal-to-noise ratio (SNR) levels, all derived from a single clean utterance. This augmentation approach allows us to generate a more extensive dataset for speech enhancement purposes while preserving the original speaker distribution.
 
 By providing these diverse datasets, we aim to support advancements in ASR and Speech Enhancement research, enabling the development and evaluation of robust systems that can handle real-world scenarios encountered during search and rescue operations.
 
diff --git a/recipes/RescueSpeech/extra_requirements.txt b/recipes/RescueSpeech/extra_requirements.txt
@@ -1,4 +1,4 @@
-librosa
 mir_eval
 pesq
 pystoi
+transformers

-Original file line number
+Diff line change
@@ @@ -1,4 +1,4 @@ @@
 -librosa
 mir_eval
 pesq
 pystoi
 +transformers