speechbrain
diff --git a/‎.github/workflows/pythonapp.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/pythonapp.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 1 deletion b/‎.gitignore‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 10 additions & 9 deletions b/‎README.md‎
Lines changed: 10 additions & 9 deletions
diff --git a/‎SECURITY.md‎
Lines changed: 10 additions & 0 deletions b/‎SECURITY.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎docs/compilation.md‎
Lines changed: 53 additions & 0 deletions b/‎docs/compilation.md‎
Lines changed: 53 additions & 0 deletions
diff --git a/‎docs/experiment.md‎
Lines changed: 10 additions & 2 deletions b/‎docs/experiment.md‎
Lines changed: 10 additions & 2 deletions
diff --git a/‎lint-requirements.txt‎
Lines changed: 1 addition & 1 deletion b/‎lint-requirements.txt‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎recipes/AISHELL-1/ASR/CTC/README.md‎
Lines changed: 1 addition & 1 deletion b/‎recipes/AISHELL-1/ASR/CTC/README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎recipes/AISHELL-1/ASR/CTC/extra_requirements.txt‎
Lines changed: 0 additions & 2 deletions b/‎recipes/AISHELL-1/ASR/CTC/extra_requirements.txt‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎recipes/AISHELL-1/ASR/seq2seq/README.md‎
Lines changed: 1 addition & 1 deletion b/‎recipes/AISHELL-1/ASR/seq2seq/README.md‎
Lines changed: 1 addition & 1 deletion
@@ -14,7 +14,7 @@ jobs:
         runs-on: ubuntu-latest
         strategy:
             matrix:
-                python-version: [3.7, 3.8, 3.9]
+                python-version: [3.9, "3.10", 3.11]
         steps:
             - uses: actions/checkout@v2
             - name: Set up Python ${{ matrix.python-version }}
 
@@ -52,6 +52,7 @@ coverage.xml
 .pytest_cache/
 cover/
 tests/tmp/
+tests/download/
 
 # Translations
 *.mo
@@ -157,4 +158,4 @@ dmypy.json
 **/log/
 
 # Mac OS
-.DS_Store
+.DS_Store
@@ -72,7 +72,7 @@ SpeechBrain provides different models for speaker recognition, identification, a
 - Libraries to extract speaker embeddings with a pre-trained model on your data.
 
 ### Text-to-Speech (TTS) and Vocoders
-- Recipes for training TTS systems such as [Tacotron2](https://github.com/speechbrain/speechbrain/tree/develop/recipes/LJSpeech) with LJSpeech.
+- Recipes for training TTS systems such as [Tacotron2](https://github.com/speechbrain/speechbrain/tree/develop/recipes/LJSpeech/) and [FastSpeech2](https://github.com/speechbrain/speechbrain/tree/develop/recipes/LJSpeech/) with LJSpeech.
 - Recipes for training Vocoders such as [HiFIGAN](https://github.com/speechbrain/speechbrain/tree/develop/recipes/LJSpeech).
 
 ### Grapheme-to-Phoneme (G2P)
@@ -95,28 +95,29 @@ Combining multiple microphones is a powerful approach to achieving robustness in
 - Speaker localization.
 
 ### Emotion Recognition
-- Recipes for emotion recognition using SSL and ECAPA-TDNN models.
+- Recipes for emotion recognition using SSL and ECAPA-TDNN models on the [IEMOCAP](https://sail.usc.edu/iemocap/iemocap_release.htm) dataset.
+- Recipe for emotion diarization using SSL models on the [ZaionEmotionDataset](https://zaion.ai/en/resources/zaion-lab-blog/zaion-emotion-dataset/).
 
 ### Interpretability
 - Recipes for various intepretability techniques on the ESC50 dataset.
 
 ### Spoken Language Understanding
-- Recipes for training wav2vec 2.0 models with the [MEDIA](https://catalogue.elra.info/en-us/repository/browse/ELRA-E0024/) dataset.
+- Recipes for training wav2vec 2.0 models on, [SLURP](https://zenodo.org/record/4274930#.YEFCYHVKg5k), [MEDIA](https://catalogue.elra.info/en-us/repository/browse/ELRA-E0024/) and [timers-and-such](https://zenodo.org/record/4623772#.YGeMMHVKg5k) datasets.
 
 ### Performance
 The recipes released with speechbrain implement speech processing systems with competitive or state-of-the-art performance. In the following, we report the best performance achieved on some popular benchmarks:
 
 | Dataset        | Task           | System  | Performance  |
 | ------------- |:-------------:| -----:|-----:|
 | LibriSpeech      | Speech Recognition | wav2vec2 | WER=1.90% (test-clean) |
-| LibriSpeech      | Speech Recognition | CNN + Conformer | WER=2.2% (test-clean) |
+| LibriSpeech      | Speech Recognition | CNN + Conformer | WER=2.0% (test-clean) |
 | TIMIT      | Speech Recognition | CRDNN + distillation | PER=13.1% (test) |
 | TIMIT      | Speech Recognition | wav2vec2 + CTC/Att. | PER=8.04% (test) |
 | CommonVoice (English) | Speech Recognition | wav2vec2 + CTC | WER=15.69% (test) |
 | CommonVoice (French) | Speech Recognition | wav2vec2 + CTC | WER=9.96% (test) |
 | CommonVoice (Italian) | Speech Recognition | wav2vec2 + seq2seq | WER=9.86% (test) |
 | CommonVoice (Kinyarwanda) | Speech Recognition | wav2vec2 + seq2seq | WER=18.91% (test) |
-| AISHELL (Mandarin) | Speech Recognition | wav2vec2 + seq2seq | CER=5.58% (test) |
+| AISHELL (Mandarin) | Speech Recognition | wav2vec2 + CTC | CER=5.06% (test) |
 | Fisher-callhome (spanish) | Speech translation | conformer (ST + ASR) | BLEU=48.04 (test) |
 | VoxCeleb2      | Speaker Verification | ECAPA-TDNN | EER=0.80% (vox1-test) |
 | AMI      | Speaker Diarization | ECAPA-TDNN | DER=3.01% (eval)|
@@ -128,10 +129,10 @@ The recipes released with speechbrain implement speech processing systems with c
 | Libri2Mix     | Speech Separation | SepFormer| SDRi= 20.6 dB (test-clean)|
 | Libri3Mix     | Speech Separation | SepFormer| SDRi= 18.7 dB (test-clean)|
 | LibryParty | Voice Activity Detection | CRDNN | F-score=0.9477 (test) |
-| IEMOCAP | Emotion Recognition | wav2vec | Accuracy=79.8% (test) |
+| IEMOCAP | Emotion Recognition | wav2vec2 | Accuracy=79.8% (test) |
 | CommonLanguage | Language Recognition | ECAPA-TDNN | Accuracy=84.9% (test) |
 | Timers and Such | Spoken Language Understanding | CRDNN | Intent Accuracy=89.2% (test) |
-| SLURP | Spoken Language Understanding | CRDNN | Intent Accuracy=87.54% (test) |
+| SLURP | Spoken Language Understanding | HuBERT | Intent Accuracy=87.54% (test) |
 | VoxLingua 107 | Identification | ECAPA-TDNN | Sentence Accuracy=93.3% (test) |
 
 For more details, take a look at the corresponding implementation in recipes/dataset/.
@@ -148,7 +149,7 @@ Beyond providing recipes for training the models from scratch, SpeechBrain share
 | Speech Recognition | CommonVoice(French) | [wav2vec + CTC](https://huggingface.co/speechbrain/asr-crdnn-commonvoice-fr) |
 | Speech Recognition | CommonVoice(Italian) | [wav2vec + CTC](https://huggingface.co/speechbrain/asr-wav2vec2-commonvoice-it) |
 | Speech Recognition | CommonVoice(Kinyarwanda) | [wav2vec + CTC](https://huggingface.co/speechbrain/asr-wav2vec2-commonvoice-rw) |
-| Speech Recognition | AISHELL(Mandarin) | [wav2vec + CTC](https://huggingface.co/speechbrain/asr-wav2vec2-transformer-aishell) |
+| Speech Recognition | AISHELL(Mandarin) | [wav2vec + seq2seq](https://huggingface.co/speechbrain/asr-wav2vec2-transformer-aishell) |
 | Text-to-Speech | LJSpeech | [Tacotron2](https://huggingface.co/speechbrain/tts-tacotron2-ljspeech) |
 | Speaker Recognition | Voxceleb | [ECAPA-TDNN](https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb) |
 | Speech Separation | WHAMR! | [SepFormer](https://huggingface.co/speechbrain/sepformer-whamr) |
@@ -210,7 +211,7 @@ We are currently implementing speech synthesis pipelines and real-time speech pr
 
 # Conference Tutorials
 SpeechBrain has been presented at Interspeech 2021 and 2022 as well as ASRU 2021. When possible, we will provide some ressources here:
-- [Interspeech 2022 slides.](https://drive.google.com/drive/folders/1d6GAquxw6rZBI-7JvfUQ_-upeiKstJEo?usp=sharing)
+- [Interspeech 2022 slides.](https://drive.google.com/drive/folders/1d6GAquxw6rZBI-7JvfUQ_-upeiKstJEo)
 - [Interspeech 2021 YouTube recordings.](https://www.youtube.com/results?search_query=Interspeech+speechbrain+)
 
 # Quick installation
 
@@ -0,0 +1,10 @@
+# Security Policy
+
+## Supported Versions
+
+Since SpeechBrain is a beta release research-oriented toolkit, it aims to support the latest major version (at x.y level, e.g. 0.5 until 0.6 is released) with security updates, but unfortunately cannot promise long-term security updates for old versions.
+
+## Reporting a Vulnerability
+
+Vulnerabilities may be reported confidentially to speechbrainproject@gmail.com
+
@@ -0,0 +1,53 @@
+#  Compilation
+
+Compilation of your models in SpeechBrain can potentially improve their speed and reduce memory demand. SpeechBrain inherits the compilation methods supported by PyTorch, including the just-in-time compiler (JIT) and the `torch.compile` method introduced in PyTorch version >=2.0.
+
+## Compile with `torch.compile`
+The `torch.compile` feature was introduced with PyTorch version >=2.0 to gradually replace JIT. Although this feature is valuable, it is still in the beta phase, and improvements are ongoing. Please have a look at the [PyTorch documentation](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) for more information.
+
+### How to use `torch.compile`
+Compiling all modules in SpeechBrain is straightforward. You can enable compilation by using the `--compile` flag in the command line when running a training recipe. For example:
+
+```bash
+python train.py train.yaml --data_folder=your/data/folder --compile
+```
+
+This will automatically compile all the modules declared in the YAML file under the `modules` section.
+
+Note that you might need to configure additional compilation flags correctly (e.g., `--compile_mode`, `--compile_using_fullgraph`, `--compile_using_dynamic_shape_tracing`) to ensure successful model compilation or achieve the best performance. For a deeper understanding of their roles, refer to the documentation in the [PyTorch documentation](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html).
+
+In some cases, you may want to compile only specific modules. To achieve this, add a list of the module keys you want to compile in the YAML file using `compile_module_keys`. For instance:
+
+```yaml
+compile_module_keys: [encoder, decoder]
+```
+
+This will compile only the encoder and decoder models, which should be declared in the YAML file before using the respective keys.
+
+Remember to call the training script with the `--compile` flag.
+
+**Note of caution**: Compiling a model can be a complex process and may take some time. Additionally, it may fail in certain cases. The speed-up achieved through compilation is highly dependent on the system and GPU being used. For example, higher-end GPUs like the A100 tend to yield better speed-ups, while you may not observe significant improvements with V100 GPUs. We support this feature with the hope that `torch.compile` will constantly improve over time.
+
+## Compile with JIT
+JIT was the first compilation method supported by PyTorch. It is important to note that JIT is expected to be replaced soon by `torch.compile`. Please have a look at the [PyTorch documentation](https://pytorch.org/docs/stable/jit.html) for more information.
+
+### How to use JIT
+To compile all modules in SpeechBrain using JIT, use the `--jit` flag in the command line when running a training recipe:
+
+```bash
+python train.py train.yaml --data_folder=your/data/folder --jit
+```
+
+This will automatically compile all the modules declared in the YAML file under the `modules` section.
+
+If you only want to compile specific modules, add a list of the module keys you want to compile in the YAML file using `jit_module_keys`. For example:
+
+```yaml
+jit_module_keys: [encoder, decoder]
+```
+This will compile only the encoder and decoder models, provided they are declared in the YAML file using the specified keys.
+
+Remember to call the training script with the `--jit` flag.
+
+**Note of caution**: JIT has specific requirements for supported syntax, and many popular Python syntaxes are not supported. Therefore, when designing a model with JIT in mind, ensure that it meets the necessary syntax requirements for successful compilation. Additionally, the speed-up achieved through JIT compilation varies depending on the model type. We found it most beneficial for custom RNNs, such as the Li-GRU used in SpeechBrain's TIMIT/ASR/CTC. Custom RNNs often require "for loops," which can be slow in Python. The compilation with JIT provides a significant speed-up in such cases.
+
@@ -15,6 +15,14 @@ The YAML syntax offers an elegant way to specify the hyperparameters of a recipe
 In SpeechBrain, the YAML file is not a plain list of parameters, but for each parameter, we specify the function (or class) that is using it.
 This not only makes the specification of the parameters more transparent but also allows us to properly initialize all the entries by simply calling the `load_extended_yaml` (in `speechbrain.utils.data_utils`).
 
+### Security note
+Loading HyperPyYAML allows arbitrary code execution.
+This is a feature: HyperPyYAML allows you to construct *anything* and *everything*
+you need in your experiment.
+However, take care to verify any untrusted recipes' YAML files just as you would verify the Python code.
+
+### Features
+
 Let's now take a quick look at the extended YAML features, using an example:
 
 ```
@@ -38,7 +46,7 @@ model: !new:speechbrain.lobes.models.CRDNN.CRDNN
   every user either by editing the yaml, or with an override (passed to
   `load_extended_yaml`).
 
-For more details on YAML and our extensions, please see our dedicated [tutorial](https://colab.research.google.com/drive/1Pg9by4b6-8QD2iC0U7Ic3Vxq4GEwEdDz?usp=sharing).
+For more details on YAML and our extensions, please see our dedicated [tutorial](https://colab.research.google.com/drive/1Pg9by4b6-8QD2iC0U7Ic3Vxq4GEwEdDz).
 
 ## Running arguments
 SpeechBrain defines a set of running arguments that can be set from the command line args (or within the YAML file).
@@ -50,7 +58,7 @@ SpeechBrain defines a set of running arguments that can be set from the command
 - `distributed_backend`: default "nccl", options: `["nccl", "gloo", "mpi"]`, this backend will be used as a DDP communication protocol. See PyTorch documentation for more details.
 - Additional runtime arguments are documented in the Brain class.
 
-Please note that we provide a dedicated [tutorial](https://colab.research.google.com/drive/13pBUacPiotw1IvyffvGZ-HrtBr9T6l15?usp=sharing) to document the different multi-gpu training strategies:
+Please note that we provide a dedicated [tutorial](https://colab.research.google.com/drive/13pBUacPiotw1IvyffvGZ-HrtBr9T6l15) to document the different multi-gpu training strategies:
 
 You can also override parameters in YAML in this way:
 
 
@@ -2,5 +2,5 @@ black==19.10b0
 click==8.0.4
 flake8==3.7.9
 pycodestyle==2.5.0
-pytest==5.4.1
+pytest==7.4.0
 yamllint==1.23.0
@@ -24,7 +24,7 @@ Results are reported in terms of Character Error Rate (CER).
 |:--------------------------:|:-----:| :-----:| :-----:| :-----: |
 | train_with_wav2vec.yaml | No | 5.06 | 4.52 | 1xRTX 8000 Ti 48GB |
 
-You can checkout our results (models, training logs, etc,) [here](https://drive.google.com/drive/folders/1GTB5IzQPl57j-0I1IpmvKg722Ti4ahLz?usp=sharing)
+You can checkout our results (models, training logs, etc,) [here](https://www.dropbox.com/sh/e4bth1bylk7c6h8/AADFq3cWzBBKxuDv09qjvUMta?dl=0)
 
 # Training Time
 It takes about 2h on 1 RTX 8000 (48GB)
 
@@ -30,7 +30,7 @@ Results are reported in terms of Character Error Rate (CER). It is not clear fro
 | Base (keep spaces) | 7.51 |
 
 You can checkout our results (models, training logs, etc,) here:
-https://drive.google.com/drive/folders/1zlTBib0XEwWeyhaXDXnkqtPsIBI18Uzs?usp=sharing
+https://www.dropbox.com/sh/kefuzzf6jaljqbr/AADBRWRzHz74GCMDqJY9BES4a?dl=0
 
 # Training Time
 It takes about 1h 30 minutes on a NVIDIA V100 (32GB).