Skip to content

Commit 637f0a5

Browse files
FocalCodec [NeurIPS 2025] (#3000)
Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com>
1 parent db2db45 commit 637f0a5

14 files changed

Lines changed: 2467 additions & 0 deletions

File tree

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
# FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks
2+
3+
**Project Page**: https://lucadellalib.github.io/focalcodec-web/
4+
5+
This folder contains recipes for training FocalCodec on LibriTTS. You can download LibriTTS from https://www.openslr.org/60/.
6+
FocalCodec is a low-bitrate single-codebook speech codec based on [focal modulation](https://arxiv.org/abs/2203.11926).
7+
8+
For more information, check our papers:
9+
10+
- [FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks](https://arxiv.org/abs/2502.04465)
11+
12+
- [FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation](https://arxiv.org/abs/2509.16195)
13+
14+
<img src="https://raw.githubusercontent.com/lucadellalib/focalcodec/refs/heads/main/focalcodec.png" width="700">
15+
16+
---------------------------------------------------------------------------------------------------------
17+
18+
## Installing Extra Dependencies
19+
20+
Before proceeding, ensure you have installed the necessary additional dependencies.
21+
To do so, simply run the following command in your terminal:
22+
23+
```bash
24+
pip install -r extra_requirements.txt
25+
```
26+
27+
---------------------------------------------------------------------------------------------------------
28+
29+
## Running an Experiment
30+
31+
Training FocalCodec is a two-stage process:
32+
33+
1. **Train the decoder** to reconstruct waveforms from continuous speech representations (WavLM6 in our case).
34+
2. **Train the quantization pipeline** (compressor, quantizer, decompressor) using the same representations.
35+
36+
---------------------------------------------------------------------------------------------------------
37+
38+
### 1. Train the Decoder
39+
40+
```bash
41+
python train_decoder.py hparams/vocos.yaml --data_folder <path-to-dataset>
42+
```
43+
44+
This step trains a decoder to map encoder features back into high-quality audio.
45+
UTMOS, dWER, and speaker similarity are computed on test set to assess the resynthesis performance.
46+
47+
---------------------------------------------------------------------------------------------------------
48+
49+
### 2. Train the Quantization Pipeline
50+
51+
```bash
52+
python train_quantizer.py hparams/bsq.yaml --data_folder <path-to-dataset>
53+
```
54+
55+
This stage trains the compressor, quantizer, and decompressor.
56+
Note that it can be run in parallel with decoder training, since both stages operate on the same continuous encoder representations.
57+
58+
To monitor the end-to-end resynthesis performance during training, you can provide the previously trained decoder checkpoint:
59+
60+
```bash
61+
python train_quantizer.py hparams/bsq.yaml --data_folder <path-to-dataset> --decoder_checkpoint <path-to-decoder-checkpoint>
62+
```
63+
64+
---------------------------------------------------------------------------------------------------------
65+
66+
## Results
67+
68+
Note that this is a SpeechBrain adaptation of the original training code.
69+
Some implementation details may differ, which can lead to slightly different results compared to the original implementation.
70+
71+
For reference, we include the resynthesis results from the paper, obtained on **LibriSpeech test-clean**:
72+
73+
| Checkpoint | Train Data | Sample<br/>Rate (kHz) | Token<br/>Rate (Hz) | Codebooks | Bitrate<br/>(kbps) | UTMOS | dWER (%) | Sim |
74+
| :-------------------------------------------------------------------------------------: | :----------: |:---------------------:|:-------------------:| :-------: |:------------------:| :---: | :------: |:----:|
75+
| [lucadellalib/focalcodec_50hz](https://huggingface.co/lucadellalib/focalcodec_50hz) | LibriTTS-960 | 16 | 50.0 | 1x8192 | 0.65 | 4.05 | 2.18 | 97.4 |
76+
| [lucadellalib/focalcodec_25hz](https://huggingface.co/lucadellalib/focalcodec_25hz) | LibriTTS-960 | 16 | 25.0 | 1x8192 | 0.33 | 4.14 | 3.30 | 96.3 |
77+
| [lucadellalib/focalcodec_12_5hz](https://huggingface.co/lucadellalib/focalcodec_12_5hz) | LibriTTS-960 | 16 | 12.5 | 1x8192 | 0.16 | 4.22 | 7.94 | 93.9 |
78+
79+
The original training logs can be found at: [https://www.dropbox.com/scl/fo/o652m0qow1hs428ppocx3/ABiZp8xIK4d6iTcl-JXbn0s?rlkey=6cka0iabo2kzjg44if2kdgsvu&st=yqwv7x0w&dl=0](https://www.dropbox.com/scl/fo/o652m0qow1hs428ppocx3/ABiZp8xIK4d6iTcl-JXbn0s?rlkey=6cka0iabo2kzjg44if2kdgsvu&st=yqwv7x0w&dl=0).
80+
81+
The original checkpoints can be found at: [https://huggingface.co/collections/lucadellalib/focalcodec](https://huggingface.co/collections/lucadellalib/focalcodec).
82+
83+
The inference code can be found at: [https://github.com/lucadellalib/focalcodec](https://github.com/lucadellalib/focalcodec).
84+
85+
---------------------------------------------------------------------------------------------------------
86+
87+
## About SpeechBrain
88+
89+
- Website: https://speechbrain.github.io/
90+
- Code: https://github.com/speechbrain/speechbrain/
91+
- HuggingFace: https://huggingface.co/speechbrain/
92+
93+
---------------------------------------------------------------------------------------------------------
94+
95+
## Citing FocalCodec
96+
97+
Please, cite FocalCodec if you use it for your research or business.
98+
99+
```bibtex
100+
@inproceedings{dellalibera2025focalcodec,
101+
title = {{FocalCodec}: Low-Bitrate Speech Coding via Focal Modulation Networks},
102+
author = {Luca {Della Libera} and Francesco Paissan and Cem Subakan and Mirco Ravanelli},
103+
booktitle = {Advances in Neural Information Processing Systems},
104+
year = {2025},
105+
}
106+
```
107+
108+
```bibtex
109+
@article{dellalibera2025focalcodecstream,
110+
title = {{FocalCodec-Stream}: Streaming Low-Bitrate Speech Coding via Causal Distillation},
111+
author = {Luca {Della Libera} and Cem Subakan and Mirco Ravanelli},
112+
journal = {arXiv preprint arXiv:2509.16195},
113+
year = {2025},
114+
}
115+
```
116+
117+
---------------------------------------------------------------------------------------------------------
118+
119+
## Citing SpeechBrain
120+
121+
Please, cite SpeechBrain if you use it for your research or business.
122+
123+
```bibtex
124+
@article{speechbrainV1,
125+
author = {Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca {Della Libera} and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Ha Nguyen and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Ga{{\"e}}lle Laperri{{\`e}}re and Mickael Rouvier and Renato De Mori and Yannick Est{{\`e}}ve},
126+
title = {Open-Source Conversational {AI} with {SpeechBrain} 1.0},
127+
journal = {Journal of Machine Learning Research},
128+
year = {2024},
129+
volume = {25},
130+
number = {333},
131+
pages = {1--11},
132+
url = {http://jmlr.org/papers/v25/24-0991.html}
133+
}
134+
```
135+
136+
```bibtex
137+
@article{ravanelli2021speechbrain,
138+
author = {Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
139+
title = {{SpeechBrain}: A General-Purpose Speech Toolkit},
140+
journal = {arXiv preprint arXiv:2106.04624},
141+
year = {2021},
142+
url = {https://arxiv.org/abs/2106.04624},
143+
}
144+
```
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
focalcodec@git+https://github.com/lucadellalib/focalcodec.git@main#egg=focalcodec
2+
transformers

0 commit comments

Comments
 (0)