speechbrain
diff --git a/‎docs/tutorials/basics/data-loading-pipeline.ipynb‎
Lines changed: 2666 additions & 2163 deletions b/‎docs/tutorials/basics/data-loading-pipeline.ipynb‎
Lines changed: 2666 additions & 2163 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 1 addition & 1 deletion b/‎pyproject.toml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎recipes/LibriSpeech/ASR/transformer/README.md‎
Lines changed: 14 additions & 20 deletions b/‎recipes/LibriSpeech/ASR/transformer/README.md‎
Lines changed: 14 additions & 20 deletions
diff --git a/‎recipes/LibriSpeech/ASR/transformer/extract_ssl_feats.py‎
Lines changed: 139 additions & 0 deletions b/‎recipes/LibriSpeech/ASR/transformer/extract_ssl_feats.py‎
Lines changed: 139 additions & 0 deletions
diff --git a/‎recipes/LibriSpeech/ASR/transformer/hparams/extract_ssl_feats.yaml‎
Lines changed: 42 additions & 0 deletions b/‎recipes/LibriSpeech/ASR/transformer/hparams/extract_ssl_feats.yaml‎
Lines changed: 42 additions & 0 deletions
@@ -114,7 +114,7 @@ ignore = [
 combine-as-imports = true
 force-wrap-aliases = true
 known-first-party = ["speechbrain"]
-known-third-party = ["torch", "torchaudio", "numpy", "scipy", "hyperpyyaml", "joblib", "packaging", "sentencepiece", "tqdm", "huggingface_hub"]
+known-third-party = ["torch", "torchaudio", "numpy", "scipy", "hyperpyyaml", "joblib", "packaging", "requests", "sentencepiece", "tqdm", "huggingface_hub"]
 split-on-trailing-comma = false
 
 [tool.ruff.lint.per-file-ignores]
 
@@ -7,7 +7,6 @@ You can download LibriSpeech at http://www.openslr.org/12
 ```shell
 python train_with_whisper.py hparams/train_hf_whisper.yaml
 python train.py hparams/transformer.yaml
-
 ```
 
 # How to run on test sets only
@@ -23,6 +22,20 @@ installed in your environment (see extra-requirements.txt)**
 
 # Results
 
+## SpeechLLM with SSL features
+
+Two SpeechLLM modes are supported:
+- SpeechLLM with SSL features
+- SpeechLLM with E2E features
+
+In the first mode, the speech features are extracted from the audio waveforms using a pre-trained SSL model, and then projected to the LLM embedding space using a linear layer projection, where everything is trained jointly.
+
+In the second mode, the speech features are already being extracted offline (see: `extract_ssl_feats.py` script). The LLM is then trained on the frozen SSL representations. This mode is more efficient and faster to train, but at the cost of flexibility on the frozen SSL model.
+
+| Release | Model | hyperparams file | Dev Clean WER | Dev Other WER | Test Clean WER | Test Other WER | HuggingFace link | Model link | GPUs |
+|:-------------:|:-------------:|:-------------:|:---------------------------:| :-----:| :-----:| :-----:| :-----:| :--------:|
+| 29-01-26 | WavLM Large + LLama 3.2 1B + LoRA | speechllm_e2e.yaml | 2.79 | 5.03 | 2.72 | 5.34 | [HuggingFace](https://huggingface.co/speechbrain/asr-wavlm-large-llama3.2-1b-lora-librispeech) | - | 1xA100 80GB |
+
 ## Whisper Finetuning Result:
 
 Following table contains whisper-finetuning results for 1 epoch using Whisper model, freezing encoder and finetuning decoder.
@@ -49,25 +62,6 @@ Following table contains whisper-finetuning results for 1 epoch using Whisper mo
 | 03-09-23 | hyperbranchformer_25M.yaml | NA | 2.36 | 5.89 | Not Avail. | Not Avail. | 1xP40 24GB
 | 05-01-24 | bayesspeech.yaml | 4.28 | 2.84 | 6.27 | Not Avail. | [DropBox](https://www.dropbox.com/scl/fo/cdken4jqfj96ev1v84jxm/h?rlkey=25eu1ytgm5ac51zqj8p65zwxd&dl=0) | 1xV100 32GB |
 
-# **About HyperConformer**
-HyperConformer is a new architecture, which replaces the self-attention mechanism of Conformer with the linear-time token mixing architecture HyperMixer.
-It achieves competitive or better results than Conformer while requiring less memory and compute.
-
-- Paper: https://arxiv.org/abs/2305.18281
-- HyperMixer code: https://github.com/idiap/hypermixing
-
-Please cite HyperConformer if you use it for your research or business.
-
-```bibtex
-@inproceedings{mai23_interspeech,
-  author={Florian Mai and Juan Zuluaga-Gomez and Titouan Parcollet and Petr Motlicek},
-  title={{HyperConformer}: Multi-head HyperMixer for Efficient Speech Recognition},
-  year=2023,
-  booktitle={Proc. Interspeech 2023},
-  pages={2213--2217},
-  doi={10.21437/Interspeech.2023-1611}
-}
-```
 
 # **About SpeechBrain**
 - Website: https://speechbrain.github.io/
 
@@ -0,0 +1,139 @@
+#!/usr/bin/env python3
+"""Script to extract SSL features from the audio waveforms.
+
+The script uses the `speechbrain.integrations.hdf5.cached_item` module to cache the features.
+The cached features are used in the `train_speechllm.py` script to train the SpeechLLM ASR system.
+
+Since we do the extractions within the pipeline in the dataloader, we must place
+our hparams elements directly on device, and use a default bsize of 1.
+
+Example
+-------
+python extract_ssl_feats.py hparams/extract_ssl_feats.yaml
+    --data_folder path/to/LibriSpeech \
+    --output_folder path/to/feats_cache \
+    --ssl_hub path/to/wavlm-large \
+    --feats_cache_dir path/to/feats_cache
+    ...other_hparams...
+
+Authors
+-------
+ * Adel Moumen, 2025
+"""
+
+import sys
+from pathlib import Path
+
+import torch
+from hyperpyyaml import load_hyperpyyaml
+
+import speechbrain as sb
+from speechbrain.integrations.hdf5.cached_item import CachedHDF5DynamicItem
+from speechbrain.utils.distributed import run_on_main
+from speechbrain.utils.logger import get_logger
+
+logger = get_logger(__name__)
+
+
+def dataio_prepare(hparams):
+    """This function prepares the datasets to be used in the brain class.
+    It also defines the data processing pipeline through user-defined functions.
+    """
+    data_folder = hparams["data_folder"]
+
+    # 2. Define audio pipeline:
+    @sb.utils.data_pipeline.takes("wav")
+    @sb.utils.data_pipeline.provides("sig")
+    def audio_pipeline(wav):
+        sig = sb.dataio.dataio.read_audio(wav)
+        return sig
+
+    normalizer = hparams["normalize"].to(hparams["device"]).eval()
+    ssl_encoder = hparams["ssl"].to(hparams["device"]).eval()
+
+    # Base compute function used by all cached wrappers (no file bound yet)
+    @CachedHDF5DynamicItem.cache(hparams["feats_cache_dir"], compression="gzip")
+    @sb.utils.data_pipeline.takes("id", "sig")
+    @sb.utils.data_pipeline.provides("feats")
+    def compute_feats(uid, sig):
+        sig = sig.to(hparams["device"]).unsqueeze(0)
+        length = torch.ones(1, device=hparams["device"])
+        with torch.no_grad(), torch.cuda.amp.autocast(dtype=hparams["dtype"]):
+            feats = normalizer(sig, length)
+            feats = ssl_encoder(feats, length)
+        return feats.squeeze(0).cpu()
+
+    dynamic_items = [audio_pipeline, compute_feats]
+    output_keys = ["id", "sig", "feats"]
+
+    train_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
+        csv_path=hparams["train_csv"],
+        replacements={"data_root": data_folder},
+        dynamic_items=dynamic_items,
+        output_keys=output_keys,
+    )
+
+    # Build valid dataset with its own cached wrapper
+    valid_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
+        csv_path=hparams["valid_csv"],
+        replacements={"data_root": data_folder},
+        dynamic_items=dynamic_items,
+        output_keys=output_keys,
+    )
+
+    # test is separate
+    test_datasets = {}
+    for csv_file in hparams["test_csv"]:
+        name = Path(csv_file).stem
+        test_datasets[name] = sb.dataio.dataset.DynamicItemDataset.from_csv(
+            csv_path=csv_file,
+            replacements={"data_root": data_folder},
+            dynamic_items=dynamic_items,
+            output_keys=output_keys,
+        )
+
+    datasets = {"train": train_data, "valid": valid_data} | {
+        k: v for k, v in test_datasets.items()
+    }
+
+    for stage, dataset in datasets.items():
+        logger.info(f"Iterating {stage} dataset to warm the cache.")
+        dataset.iterate_once(output_keys=["feats"])
+
+
+if __name__ == "__main__":
+    # CLI:
+    hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])
+    with open(hparams_file, encoding="utf-8") as fin:
+        hparams = load_hyperpyyaml(fin, overrides)
+
+    # create ddp_group with the right communication protocol
+    sb.utils.distributed.ddp_init_group(run_opts)
+
+    # 1.  # Dataset prep (parsing Librispeech)
+    from librispeech_prepare import prepare_librispeech  # noqa
+
+    # Create experiment directory
+    sb.create_experiment_directory(
+        experiment_directory=hparams["output_folder"],
+        hyperparams_to_save=hparams_file,
+        overrides=overrides,
+    )
+
+    # multi-gpu (ddp) save data preparation
+    run_on_main(
+        prepare_librispeech,
+        kwargs={
+            "data_folder": hparams["data_folder"],
+            "tr_splits": hparams["train_splits"],
+            "dev_splits": hparams["dev_splits"],
+            "te_splits": hparams["test_splits"],
+            "save_folder": hparams["output_folder"],
+            "merge_lst": hparams["train_splits"],
+            "merge_name": "train.csv",
+            "skip_prep": hparams["skip_prep"],
+        },
+    )
+    logger.info("Preparing data...")
+    dataio_prepare(hparams)
+    logger.info("Done preparing data")
@@ -0,0 +1,42 @@
+# ############################################################################
+# Task : Extraction of self-supervised (SSL) speech features from LibriSpeech
+# Usage: Precompute and cache SSL representations for downstream SpeechLLM ASR
+# Authors:
+#  * Adel Moumen, 2025
+# ############################################################################
+# Seed needs to be set at top of yaml, before objects with parameters are made
+seed: 3407
+__set_seed: !apply:speechbrain.utils.seed_everything [!ref <seed>]
+experiment_name: ssl_feats_extraction
+output_folder: !ref results/<experiment_name>/<seed>
+save_folder: !ref <output_folder>/save
+feats_cache_dir: !ref <output_folder>/feats_cache
+
+# Data files
+data_folder: !PLACEHOLDER
+train_splits: ["train-clean-100", "train-clean-360", "train-other-500"]
+dev_splits: ["dev-clean"]
+test_splits: ["test-clean", "test-other"]
+skip_prep: False
+train_csv: !ref <output_folder>/train.csv
+valid_csv: !ref <output_folder>/dev-clean.csv
+test_csv:
+    - !ref <output_folder>/test-clean.csv
+    - !ref <output_folder>/test-other.csv
+dtype: !name:torch.bfloat16
+device: cuda
+
+####################### Training Parameters ####################################
+ssl_hub: !PLACEHOLDER
+ssl_folder: !ref <save_folder>/ssl_checkpoint
+ssl_frozen: True
+
+####################### Model Components ####################################
+normalize: !new:speechbrain.processing.features.InputNormalization
+    norm_type: sentence
+ssl: !new:speechbrain.integrations.huggingface.wav2vec2.Wav2Vec2
+    source: !ref <ssl_hub>
+    output_norm: True
+    freeze: !ref <ssl_frozen>
+    save_path: !ref <ssl_folder>
+    device_map: !ref <device>