|
| 1 | +# FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks |
| 2 | + |
| 3 | +**Project Page**: https://lucadellalib.github.io/focalcodec-web/ |
| 4 | + |
| 5 | +This folder contains recipes for training FocalCodec on LibriTTS. You can download LibriTTS from https://www.openslr.org/60/. |
| 6 | +FocalCodec is a low-bitrate single-codebook speech codec based on [focal modulation](https://arxiv.org/abs/2203.11926). |
| 7 | + |
| 8 | +For more information, check our papers: |
| 9 | + |
| 10 | +- [FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks](https://arxiv.org/abs/2502.04465) |
| 11 | + |
| 12 | +- [FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation](https://arxiv.org/abs/2509.16195) |
| 13 | + |
| 14 | +<img src="https://raw.githubusercontent.com/lucadellalib/focalcodec/refs/heads/main/focalcodec.png" width="700"> |
| 15 | + |
| 16 | +--------------------------------------------------------------------------------------------------------- |
| 17 | + |
| 18 | +## Installing Extra Dependencies |
| 19 | + |
| 20 | +Before proceeding, ensure you have installed the necessary additional dependencies. |
| 21 | +To do so, simply run the following command in your terminal: |
| 22 | + |
| 23 | +```bash |
| 24 | +pip install -r extra_requirements.txt |
| 25 | +``` |
| 26 | + |
| 27 | +--------------------------------------------------------------------------------------------------------- |
| 28 | + |
| 29 | +## Running an Experiment |
| 30 | + |
| 31 | +Training FocalCodec is a two-stage process: |
| 32 | + |
| 33 | +1. **Train the decoder** to reconstruct waveforms from continuous speech representations (WavLM6 in our case). |
| 34 | +2. **Train the quantization pipeline** (compressor, quantizer, decompressor) using the same representations. |
| 35 | + |
| 36 | +--------------------------------------------------------------------------------------------------------- |
| 37 | + |
| 38 | +### 1. Train the Decoder |
| 39 | + |
| 40 | +```bash |
| 41 | +python train_decoder.py hparams/vocos.yaml --data_folder <path-to-dataset> |
| 42 | +``` |
| 43 | + |
| 44 | +This step trains a decoder to map encoder features back into high-quality audio. |
| 45 | +UTMOS, dWER, and speaker similarity are computed on test set to assess the resynthesis performance. |
| 46 | + |
| 47 | +--------------------------------------------------------------------------------------------------------- |
| 48 | + |
| 49 | +### 2. Train the Quantization Pipeline |
| 50 | + |
| 51 | +```bash |
| 52 | +python train_quantizer.py hparams/bsq.yaml --data_folder <path-to-dataset> |
| 53 | +``` |
| 54 | + |
| 55 | +This stage trains the compressor, quantizer, and decompressor. |
| 56 | +Note that it can be run in parallel with decoder training, since both stages operate on the same continuous encoder representations. |
| 57 | + |
| 58 | +To monitor the end-to-end resynthesis performance during training, you can provide the previously trained decoder checkpoint: |
| 59 | + |
| 60 | +```bash |
| 61 | +python train_quantizer.py hparams/bsq.yaml --data_folder <path-to-dataset> --decoder_checkpoint <path-to-decoder-checkpoint> |
| 62 | +``` |
| 63 | + |
| 64 | +--------------------------------------------------------------------------------------------------------- |
| 65 | + |
| 66 | +## Results |
| 67 | + |
| 68 | +Note that this is a SpeechBrain adaptation of the original training code. |
| 69 | +Some implementation details may differ, which can lead to slightly different results compared to the original implementation. |
| 70 | + |
| 71 | +For reference, we include the resynthesis results from the paper, obtained on **LibriSpeech test-clean**: |
| 72 | + |
| 73 | +| Checkpoint | Train Data | Sample<br/>Rate (kHz) | Token<br/>Rate (Hz) | Codebooks | Bitrate<br/>(kbps) | UTMOS | dWER (%) | Sim | |
| 74 | +| :-------------------------------------------------------------------------------------: | :----------: |:---------------------:|:-------------------:| :-------: |:------------------:| :---: | :------: |:----:| |
| 75 | +| [lucadellalib/focalcodec_50hz](https://huggingface.co/lucadellalib/focalcodec_50hz) | LibriTTS-960 | 16 | 50.0 | 1x8192 | 0.65 | 4.05 | 2.18 | 97.4 | |
| 76 | +| [lucadellalib/focalcodec_25hz](https://huggingface.co/lucadellalib/focalcodec_25hz) | LibriTTS-960 | 16 | 25.0 | 1x8192 | 0.33 | 4.14 | 3.30 | 96.3 | |
| 77 | +| [lucadellalib/focalcodec_12_5hz](https://huggingface.co/lucadellalib/focalcodec_12_5hz) | LibriTTS-960 | 16 | 12.5 | 1x8192 | 0.16 | 4.22 | 7.94 | 93.9 | |
| 78 | + |
| 79 | +The original training logs can be found at: [https://www.dropbox.com/scl/fo/o652m0qow1hs428ppocx3/ABiZp8xIK4d6iTcl-JXbn0s?rlkey=6cka0iabo2kzjg44if2kdgsvu&st=yqwv7x0w&dl=0](https://www.dropbox.com/scl/fo/o652m0qow1hs428ppocx3/ABiZp8xIK4d6iTcl-JXbn0s?rlkey=6cka0iabo2kzjg44if2kdgsvu&st=yqwv7x0w&dl=0). |
| 80 | + |
| 81 | +The original checkpoints can be found at: [https://huggingface.co/collections/lucadellalib/focalcodec](https://huggingface.co/collections/lucadellalib/focalcodec). |
| 82 | + |
| 83 | +The inference code can be found at: [https://github.com/lucadellalib/focalcodec](https://github.com/lucadellalib/focalcodec). |
| 84 | + |
| 85 | +--------------------------------------------------------------------------------------------------------- |
| 86 | + |
| 87 | +## About SpeechBrain |
| 88 | + |
| 89 | +- Website: https://speechbrain.github.io/ |
| 90 | +- Code: https://github.com/speechbrain/speechbrain/ |
| 91 | +- HuggingFace: https://huggingface.co/speechbrain/ |
| 92 | + |
| 93 | +--------------------------------------------------------------------------------------------------------- |
| 94 | + |
| 95 | +## Citing FocalCodec |
| 96 | + |
| 97 | +Please, cite FocalCodec if you use it for your research or business. |
| 98 | + |
| 99 | +```bibtex |
| 100 | +@inproceedings{dellalibera2025focalcodec, |
| 101 | + title = {{FocalCodec}: Low-Bitrate Speech Coding via Focal Modulation Networks}, |
| 102 | + author = {Luca {Della Libera} and Francesco Paissan and Cem Subakan and Mirco Ravanelli}, |
| 103 | + booktitle = {Advances in Neural Information Processing Systems}, |
| 104 | + year = {2025}, |
| 105 | +} |
| 106 | +``` |
| 107 | + |
| 108 | +```bibtex |
| 109 | +@article{dellalibera2025focalcodecstream, |
| 110 | + title = {{FocalCodec-Stream}: Streaming Low-Bitrate Speech Coding via Causal Distillation}, |
| 111 | + author = {Luca {Della Libera} and Cem Subakan and Mirco Ravanelli}, |
| 112 | + journal = {arXiv preprint arXiv:2509.16195}, |
| 113 | + year = {2025}, |
| 114 | +} |
| 115 | +``` |
| 116 | + |
| 117 | +--------------------------------------------------------------------------------------------------------- |
| 118 | + |
| 119 | +## Citing SpeechBrain |
| 120 | + |
| 121 | +Please, cite SpeechBrain if you use it for your research or business. |
| 122 | + |
| 123 | +```bibtex |
| 124 | +@article{speechbrainV1, |
| 125 | + author = {Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca {Della Libera} and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Ha Nguyen and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Ga{{\"e}}lle Laperri{{\`e}}re and Mickael Rouvier and Renato De Mori and Yannick Est{{\`e}}ve}, |
| 126 | + title = {Open-Source Conversational {AI} with {SpeechBrain} 1.0}, |
| 127 | + journal = {Journal of Machine Learning Research}, |
| 128 | + year = {2024}, |
| 129 | + volume = {25}, |
| 130 | + number = {333}, |
| 131 | + pages = {1--11}, |
| 132 | + url = {http://jmlr.org/papers/v25/24-0991.html} |
| 133 | +} |
| 134 | +``` |
| 135 | + |
| 136 | +```bibtex |
| 137 | +@article{ravanelli2021speechbrain, |
| 138 | + author = {Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio}, |
| 139 | + title = {{SpeechBrain}: A General-Purpose Speech Toolkit}, |
| 140 | + journal = {arXiv preprint arXiv:2106.04624}, |
| 141 | + year = {2021}, |
| 142 | + url = {https://arxiv.org/abs/2106.04624}, |
| 143 | +} |
| 144 | +``` |
0 commit comments