Skip to content

Commit ead6ab3

Browse files
authored
Merge branch 'develop' into develop
2 parents 16fd082 + 0e8b81e commit ead6ab3

445 files changed

Lines changed: 24113 additions & 2300 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/pythonapp.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ jobs:
1414
runs-on: ubuntu-latest
1515
strategy:
1616
matrix:
17-
python-version: [3.7, 3.8, 3.9]
17+
python-version: [3.9, "3.10", 3.11]
1818
steps:
1919
- uses: actions/checkout@v2
2020
- name: Set up Python ${{ matrix.python-version }}

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ coverage.xml
5252
.pytest_cache/
5353
cover/
5454
tests/tmp/
55+
tests/download/
5556

5657
# Translations
5758
*.mo
@@ -157,4 +158,4 @@ dmypy.json
157158
**/log/
158159

159160
# Mac OS
160-
.DS_Store
161+
.DS_Store

README.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ SpeechBrain provides different models for speaker recognition, identification, a
7272
- Libraries to extract speaker embeddings with a pre-trained model on your data.
7373

7474
### Text-to-Speech (TTS) and Vocoders
75-
- Recipes for training TTS systems such as [Tacotron2](https://github.com/speechbrain/speechbrain/tree/develop/recipes/LJSpeech) with LJSpeech.
75+
- Recipes for training TTS systems such as [Tacotron2](https://github.com/speechbrain/speechbrain/tree/develop/recipes/LJSpeech/) and [FastSpeech2](https://github.com/speechbrain/speechbrain/tree/develop/recipes/LJSpeech/) with LJSpeech.
7676
- Recipes for training Vocoders such as [HiFIGAN](https://github.com/speechbrain/speechbrain/tree/develop/recipes/LJSpeech).
7777

7878
### Grapheme-to-Phoneme (G2P)
@@ -95,28 +95,29 @@ Combining multiple microphones is a powerful approach to achieving robustness in
9595
- Speaker localization.
9696

9797
### Emotion Recognition
98-
- Recipes for emotion recognition using SSL and ECAPA-TDNN models.
98+
- Recipes for emotion recognition using SSL and ECAPA-TDNN models on the [IEMOCAP](https://sail.usc.edu/iemocap/iemocap_release.htm) dataset.
99+
- Recipe for emotion diarization using SSL models on the [ZaionEmotionDataset](https://zaion.ai/en/resources/zaion-lab-blog/zaion-emotion-dataset/).
99100

100101
### Interpretability
101102
- Recipes for various intepretability techniques on the ESC50 dataset.
102103

103104
### Spoken Language Understanding
104-
- Recipes for training wav2vec 2.0 models with the [MEDIA](https://catalogue.elra.info/en-us/repository/browse/ELRA-E0024/) dataset.
105+
- Recipes for training wav2vec 2.0 models on, [SLURP](https://zenodo.org/record/4274930#.YEFCYHVKg5k), [MEDIA](https://catalogue.elra.info/en-us/repository/browse/ELRA-E0024/) and [timers-and-such](https://zenodo.org/record/4623772#.YGeMMHVKg5k) datasets.
105106

106107
### Performance
107108
The recipes released with speechbrain implement speech processing systems with competitive or state-of-the-art performance. In the following, we report the best performance achieved on some popular benchmarks:
108109

109110
| Dataset | Task | System | Performance |
110111
| ------------- |:-------------:| -----:|-----:|
111112
| LibriSpeech | Speech Recognition | wav2vec2 | WER=1.90% (test-clean) |
112-
| LibriSpeech | Speech Recognition | CNN + Conformer | WER=2.2% (test-clean) |
113+
| LibriSpeech | Speech Recognition | CNN + Conformer | WER=2.0% (test-clean) |
113114
| TIMIT | Speech Recognition | CRDNN + distillation | PER=13.1% (test) |
114115
| TIMIT | Speech Recognition | wav2vec2 + CTC/Att. | PER=8.04% (test) |
115116
| CommonVoice (English) | Speech Recognition | wav2vec2 + CTC | WER=15.69% (test) |
116117
| CommonVoice (French) | Speech Recognition | wav2vec2 + CTC | WER=9.96% (test) |
117118
| CommonVoice (Italian) | Speech Recognition | wav2vec2 + seq2seq | WER=9.86% (test) |
118119
| CommonVoice (Kinyarwanda) | Speech Recognition | wav2vec2 + seq2seq | WER=18.91% (test) |
119-
| AISHELL (Mandarin) | Speech Recognition | wav2vec2 + seq2seq | CER=5.58% (test) |
120+
| AISHELL (Mandarin) | Speech Recognition | wav2vec2 + CTC | CER=5.06% (test) |
120121
| Fisher-callhome (spanish) | Speech translation | conformer (ST + ASR) | BLEU=48.04 (test) |
121122
| VoxCeleb2 | Speaker Verification | ECAPA-TDNN | EER=0.80% (vox1-test) |
122123
| AMI | Speaker Diarization | ECAPA-TDNN | DER=3.01% (eval)|
@@ -128,10 +129,10 @@ The recipes released with speechbrain implement speech processing systems with c
128129
| Libri2Mix | Speech Separation | SepFormer| SDRi= 20.6 dB (test-clean)|
129130
| Libri3Mix | Speech Separation | SepFormer| SDRi= 18.7 dB (test-clean)|
130131
| LibryParty | Voice Activity Detection | CRDNN | F-score=0.9477 (test) |
131-
| IEMOCAP | Emotion Recognition | wav2vec | Accuracy=79.8% (test) |
132+
| IEMOCAP | Emotion Recognition | wav2vec2 | Accuracy=79.8% (test) |
132133
| CommonLanguage | Language Recognition | ECAPA-TDNN | Accuracy=84.9% (test) |
133134
| Timers and Such | Spoken Language Understanding | CRDNN | Intent Accuracy=89.2% (test) |
134-
| SLURP | Spoken Language Understanding | CRDNN | Intent Accuracy=87.54% (test) |
135+
| SLURP | Spoken Language Understanding | HuBERT | Intent Accuracy=87.54% (test) |
135136
| VoxLingua 107 | Identification | ECAPA-TDNN | Sentence Accuracy=93.3% (test) |
136137

137138
For more details, take a look at the corresponding implementation in recipes/dataset/.
@@ -148,7 +149,7 @@ Beyond providing recipes for training the models from scratch, SpeechBrain share
148149
| Speech Recognition | CommonVoice(French) | [wav2vec + CTC](https://huggingface.co/speechbrain/asr-crdnn-commonvoice-fr) |
149150
| Speech Recognition | CommonVoice(Italian) | [wav2vec + CTC](https://huggingface.co/speechbrain/asr-wav2vec2-commonvoice-it) |
150151
| Speech Recognition | CommonVoice(Kinyarwanda) | [wav2vec + CTC](https://huggingface.co/speechbrain/asr-wav2vec2-commonvoice-rw) |
151-
| Speech Recognition | AISHELL(Mandarin) | [wav2vec + CTC](https://huggingface.co/speechbrain/asr-wav2vec2-transformer-aishell) |
152+
| Speech Recognition | AISHELL(Mandarin) | [wav2vec + seq2seq](https://huggingface.co/speechbrain/asr-wav2vec2-transformer-aishell) |
152153
| Text-to-Speech | LJSpeech | [Tacotron2](https://huggingface.co/speechbrain/tts-tacotron2-ljspeech) |
153154
| Speaker Recognition | Voxceleb | [ECAPA-TDNN](https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb) |
154155
| Speech Separation | WHAMR! | [SepFormer](https://huggingface.co/speechbrain/sepformer-whamr) |
@@ -210,7 +211,7 @@ We are currently implementing speech synthesis pipelines and real-time speech pr
210211

211212
# Conference Tutorials
212213
SpeechBrain has been presented at Interspeech 2021 and 2022 as well as ASRU 2021. When possible, we will provide some ressources here:
213-
- [Interspeech 2022 slides.](https://drive.google.com/drive/folders/1d6GAquxw6rZBI-7JvfUQ_-upeiKstJEo?usp=sharing)
214+
- [Interspeech 2022 slides.](https://drive.google.com/drive/folders/1d6GAquxw6rZBI-7JvfUQ_-upeiKstJEo)
214215
- [Interspeech 2021 YouTube recordings.](https://www.youtube.com/results?search_query=Interspeech+speechbrain+)
215216

216217
# Quick installation

SECURITY.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Security Policy
2+
3+
## Supported Versions
4+
5+
Since SpeechBrain is a beta release research-oriented toolkit, it aims to support the latest major version (at x.y level, e.g. 0.5 until 0.6 is released) with security updates, but unfortunately cannot promise long-term security updates for old versions.
6+
7+
## Reporting a Vulnerability
8+
9+
Vulnerabilities may be reported confidentially to speechbrainproject@gmail.com
10+

docs/compilation.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Compilation
2+
3+
Compilation of your models in SpeechBrain can potentially improve their speed and reduce memory demand. SpeechBrain inherits the compilation methods supported by PyTorch, including the just-in-time compiler (JIT) and the `torch.compile` method introduced in PyTorch version >=2.0.
4+
5+
## Compile with `torch.compile`
6+
The `torch.compile` feature was introduced with PyTorch version >=2.0 to gradually replace JIT. Although this feature is valuable, it is still in the beta phase, and improvements are ongoing. Please have a look at the [PyTorch documentation](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) for more information.
7+
8+
### How to use `torch.compile`
9+
Compiling all modules in SpeechBrain is straightforward. You can enable compilation by using the `--compile` flag in the command line when running a training recipe. For example:
10+
11+
```bash
12+
python train.py train.yaml --data_folder=your/data/folder --compile
13+
```
14+
15+
This will automatically compile all the modules declared in the YAML file under the `modules` section.
16+
17+
Note that you might need to configure additional compilation flags correctly (e.g., `--compile_mode`, `--compile_using_fullgraph`, `--compile_using_dynamic_shape_tracing`) to ensure successful model compilation or achieve the best performance. For a deeper understanding of their roles, refer to the documentation in the [PyTorch documentation](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html).
18+
19+
In some cases, you may want to compile only specific modules. To achieve this, add a list of the module keys you want to compile in the YAML file using `compile_module_keys`. For instance:
20+
21+
```yaml
22+
compile_module_keys: [encoder, decoder]
23+
```
24+
25+
This will compile only the encoder and decoder models, which should be declared in the YAML file before using the respective keys.
26+
27+
Remember to call the training script with the `--compile` flag.
28+
29+
**Note of caution**: Compiling a model can be a complex process and may take some time. Additionally, it may fail in certain cases. The speed-up achieved through compilation is highly dependent on the system and GPU being used. For example, higher-end GPUs like the A100 tend to yield better speed-ups, while you may not observe significant improvements with V100 GPUs. We support this feature with the hope that `torch.compile` will constantly improve over time.
30+
31+
## Compile with JIT
32+
JIT was the first compilation method supported by PyTorch. It is important to note that JIT is expected to be replaced soon by `torch.compile`. Please have a look at the [PyTorch documentation](https://pytorch.org/docs/stable/jit.html) for more information.
33+
34+
### How to use JIT
35+
To compile all modules in SpeechBrain using JIT, use the `--jit` flag in the command line when running a training recipe:
36+
37+
```bash
38+
python train.py train.yaml --data_folder=your/data/folder --jit
39+
```
40+
41+
This will automatically compile all the modules declared in the YAML file under the `modules` section.
42+
43+
If you only want to compile specific modules, add a list of the module keys you want to compile in the YAML file using `jit_module_keys`. For example:
44+
45+
```yaml
46+
jit_module_keys: [encoder, decoder]
47+
```
48+
This will compile only the encoder and decoder models, provided they are declared in the YAML file using the specified keys.
49+
50+
Remember to call the training script with the `--jit` flag.
51+
52+
**Note of caution**: JIT has specific requirements for supported syntax, and many popular Python syntaxes are not supported. Therefore, when designing a model with JIT in mind, ensure that it meets the necessary syntax requirements for successful compilation. Additionally, the speed-up achieved through JIT compilation varies depending on the model type. We found it most beneficial for custom RNNs, such as the Li-GRU used in SpeechBrain's TIMIT/ASR/CTC. Custom RNNs often require "for loops," which can be slow in Python. The compilation with JIT provides a significant speed-up in such cases.
53+

docs/experiment.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,14 @@ The YAML syntax offers an elegant way to specify the hyperparameters of a recipe
1515
In SpeechBrain, the YAML file is not a plain list of parameters, but for each parameter, we specify the function (or class) that is using it.
1616
This not only makes the specification of the parameters more transparent but also allows us to properly initialize all the entries by simply calling the `load_extended_yaml` (in `speechbrain.utils.data_utils`).
1717

18+
### Security note
19+
Loading HyperPyYAML allows arbitrary code execution.
20+
This is a feature: HyperPyYAML allows you to construct *anything* and *everything*
21+
you need in your experiment.
22+
However, take care to verify any untrusted recipes' YAML files just as you would verify the Python code.
23+
24+
### Features
25+
1826
Let's now take a quick look at the extended YAML features, using an example:
1927

2028
```
@@ -38,7 +46,7 @@ model: !new:speechbrain.lobes.models.CRDNN.CRDNN
3846
every user either by editing the yaml, or with an override (passed to
3947
`load_extended_yaml`).
4048

41-
For more details on YAML and our extensions, please see our dedicated [tutorial](https://colab.research.google.com/drive/1Pg9by4b6-8QD2iC0U7Ic3Vxq4GEwEdDz?usp=sharing).
49+
For more details on YAML and our extensions, please see our dedicated [tutorial](https://colab.research.google.com/drive/1Pg9by4b6-8QD2iC0U7Ic3Vxq4GEwEdDz).
4250

4351
## Running arguments
4452
SpeechBrain defines a set of running arguments that can be set from the command line args (or within the YAML file).
@@ -50,7 +58,7 @@ SpeechBrain defines a set of running arguments that can be set from the command
5058
- `distributed_backend`: default "nccl", options: `["nccl", "gloo", "mpi"]`, this backend will be used as a DDP communication protocol. See PyTorch documentation for more details.
5159
- Additional runtime arguments are documented in the Brain class.
5260

53-
Please note that we provide a dedicated [tutorial](https://colab.research.google.com/drive/13pBUacPiotw1IvyffvGZ-HrtBr9T6l15?usp=sharing) to document the different multi-gpu training strategies:
61+
Please note that we provide a dedicated [tutorial](https://colab.research.google.com/drive/13pBUacPiotw1IvyffvGZ-HrtBr9T6l15) to document the different multi-gpu training strategies:
5462

5563
You can also override parameters in YAML in this way:
5664

lint-requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,5 @@ black==19.10b0
22
click==8.0.4
33
flake8==3.7.9
44
pycodestyle==2.5.0
5-
pytest==5.4.1
5+
pytest==7.4.0
66
yamllint==1.23.0

recipes/AISHELL-1/ASR/CTC/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ Results are reported in terms of Character Error Rate (CER).
2424
|:--------------------------:|:-----:| :-----:| :-----:| :-----: |
2525
| train_with_wav2vec.yaml | No | 5.06 | 4.52 | 1xRTX 8000 Ti 48GB |
2626

27-
You can checkout our results (models, training logs, etc,) [here](https://drive.google.com/drive/folders/1GTB5IzQPl57j-0I1IpmvKg722Ti4ahLz?usp=sharing)
27+
You can checkout our results (models, training logs, etc,) [here](https://www.dropbox.com/sh/e4bth1bylk7c6h8/AADFq3cWzBBKxuDv09qjvUMta?dl=0)
2828

2929
# Training Time
3030
It takes about 2h on 1 RTX 8000 (48GB)

recipes/AISHELL-1/ASR/CTC/extra_requirements.txt

Lines changed: 0 additions & 2 deletions
This file was deleted.

recipes/AISHELL-1/ASR/seq2seq/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Results are reported in terms of Character Error Rate (CER). It is not clear fro
3030
| Base (keep spaces) | 7.51 |
3131

3232
You can checkout our results (models, training logs, etc,) here:
33-
https://drive.google.com/drive/folders/1zlTBib0XEwWeyhaXDXnkqtPsIBI18Uzs?usp=sharing
33+
https://www.dropbox.com/sh/kefuzzf6jaljqbr/AADBRWRzHz74GCMDqJY9BES4a?dl=0
3434

3535
# Training Time
3636
It takes about 1h 30 minutes on a NVIDIA V100 (32GB).

0 commit comments

Comments
 (0)