speechbrain
diff --git a/‎recipes/Voicebank/enhance/SGMSE/README.md‎
Lines changed: 90 additions & 0 deletions b/‎recipes/Voicebank/enhance/SGMSE/README.md‎
Lines changed: 90 additions & 0 deletions
diff --git a/‎recipes/Voicebank/enhance/SGMSE/enhance.py‎
Lines changed: 118 additions & 0 deletions b/‎recipes/Voicebank/enhance/SGMSE/enhance.py‎
Lines changed: 118 additions & 0 deletions
diff --git a/‎recipes/Voicebank/enhance/SGMSE/extra_requirements.txt‎
Lines changed: 29 additions & 0 deletions b/‎recipes/Voicebank/enhance/SGMSE/extra_requirements.txt‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎recipes/Voicebank/enhance/SGMSE/hparams.yaml‎
Lines changed: 88 additions & 0 deletions b/‎recipes/Voicebank/enhance/SGMSE/hparams.yaml‎
Lines changed: 88 additions & 0 deletions
@@ -0,0 +1,90 @@
+# VoiceBank Speech Enhancement with SGMSE
+This recipe implements a speech enhancement system based on the SGMSE architecture using the VoiceBank dataset (based on the paper: [https://arxiv.org/abs/2208.05830](https://arxiv.org/abs/2208.05830)).
+
+## Results
+
+Experiment Date | PESQ | SI-SDR | STOI
+-|-|-|-
+2025-07-24 | 2.78 | 17.8 | 95.7
+
+You can find the full experiment folder (i.e., checkpoints, logs, etc) here:
+https://www.dropbox.com/scl/fo/bi8sln2de6ep8nrv38jt5/ACWQAOAIsYSMyjhcu2ZSavc?rlkey=xtqlon9xjcy43ghncnlbtruii&st=sql8s5r8&dl=0
+
+## How to Run
+### Training
+
+To train the SGMSE speech enhancement model, execute:
+
+```bash
+python recipes/Voicebank/enhance/SGMSE/train.py recipes/Voicebank/enhance/SGMSE/hparams.yaml
+```
+
+This will:
+
+* Prepare the VoiceBank dataset automatically (if not already prepared).
+* Train the model based on hyperparameters defined in `hparams.yaml`.
+* Create a `run_name`, unique to each run.
+* Store checkpoints, logs, and validation / testing samples in `output_dir/run_name` (specified within the `hparams.yaml` file).
+
+### Resume Training from a previous run
+
+Point --resume to the existing run directory (the folder that contains hyperparams.yaml and checkpoints):
+
+```bash
+python recipes/Voicebank/enhance/SGMSE/train.py --resume path/to/results/run_YYYY-MM-DD_HH-MM-SS
+```
+
+When --resume is provided:
+
+*	The script loads hyperparams.yaml from the given run directory and uses that saved configuration.
+*	Training continues from the latest checkpoint in that directory (if present), keeping the same run_name.
+*	CLI overrides still work, but a new run_name is not generated.
+
+
+### Inference (Speech Enhancement)
+You can enhance single audio files or entire directories using a trained model:
+
+* **Single-file enhancement:**
+
+```bash
+python recipes/Voicebank/enhance/SGMSE/enhancement.py --run_dir /path/to/trained_model noisy_audio.wav
+```
+
+* **Batch enhancement (whole directory):**
+
+```bash
+python recipes/Voicebank/enhance/SGMSE/enhancement.py --run_dir /path/to/trained_model /path/to/noisy_directory
+```
+
+Enhanced audio files will be stored in a newly created subdirectory specified in `inference_dir` within the `hparams.yaml` file, preserving the original filenames.
+
+## Results and Outputs
+During training, all results and model checkpoints are saved in:
+
+```
+<output_dir>/<run_name>/
+```
+
+During inference, enhanced audio outputs are saved in:
+
+```
+<output_dir>/<run_name>/<inference_dir>/
+```
+
+## About SpeechBrain
+* Website: [https://speechbrain.github.io/](https://speechbrain.github.io/)
+* Code: [https://github.com/speechbrain/speechbrain/](https://github.com/speechbrain/speechbrain/)
+* HuggingFace: [https://huggingface.co/speechbrain/](https://huggingface.co/speechbrain/)
+
+## Citing SGMSE
+```bibtex
+@article{richter2023speech,
+  title={Speech enhancement and dereverberation with diffusion-based generative models},
+  author={Richter, Julius and Welker, Simon and Lemercier, Jean-Marie and Lay, Bunlong and Gerkmann, Timo},
+  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
+  volume={31},
+  pages={2351--2364},
+  year={2023},
+  publisher={IEEE}
+}
+```
@@ -0,0 +1,118 @@
+"""
+Single-file or batch speech enhancement with SGMSE.
+Single file:
+python enhance.py --run_dir /path/to/run  noisy.wav
+
+Whole directory:
+python enhance.py --run_dir /path/to/run  /path/to/noisy_dir
+"""
+
+import argparse
+import sys
+from pathlib import Path
+
+import torch
+import torchaudio
+from hyperpyyaml import load_hyperpyyaml
+from train import SGMSEBrain
+
+from speechbrain.utils.checkpoints import Checkpointer
+
+
+# Helpers
+def is_audio_file(path):
+    return path.suffix.lower() in {".wav", ".flac", ".ogg"}
+
+
+def collect_audio_files(src):
+    return [p for p in src.iterdir() if p.is_file() and is_audio_file(p)]
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Run SGMSE enhancement (torchaudio I/O)"
+    )
+    parser.add_argument(
+        "--run_dir",
+        "-r",
+        type=Path,
+        required=True,
+        help="Path to the trained run directory (the folder that "
+        "contains hyperparams.yaml and checkpoints/).",
+    )
+    parser.add_argument(
+        "input",
+        type=Path,
+        help="Path to a noisy audio file OR a directory of audio files.",
+    )
+    args = parser.parse_args()
+
+    run_dir = args.run_dir.expanduser().resolve()
+    if not run_dir.exists():
+        sys.exit(f"--run_dir '{run_dir}' does not exist.")
+
+    hparams_file = run_dir / "hyperparams.yaml"
+    checkpoints_dir = run_dir / "checkpoints"
+
+    with open(hparams_file, encoding="utf-8") as f:
+        hparams = load_hyperpyyaml(f)
+
+    target_sr = hparams["sample_rate"]
+    inference_dir = Path(run_dir / "enhanced_inference")
+    inference_dir.mkdir(parents=True, exist_ok=True)
+
+    modules = hparams["modules"]
+    brain = SGMSEBrain(
+        modules=modules,
+        hparams=hparams,
+        run_opts={"device": "cuda" if torch.cuda.is_available() else "cpu"},
+        checkpointer=Checkpointer(
+            checkpoints_dir=checkpoints_dir,
+            recoverables={"score_model": modules["score_model"]},
+        ),
+    )
+    brain.setup_inference()  # loads latest checkpoint, ema ...
+
+    # Enhancement routine
+    def enhance_file(noisy_path, dst_dir):
+        wav, sr = torchaudio.load(noisy_path)
+        if sr != target_sr:
+            wav = torchaudio.functional.resample(wav, sr, target_sr)
+
+        if wav.shape[0] > 1:
+            wav = wav.mean(0, keepdim=True)
+
+        with torch.no_grad():
+            wav = wav.to(brain.device)
+            enhanced = brain.enhance(wav).cpu()
+
+        out_path = dst_dir / f"{noisy_path.stem}_enhanced{noisy_path.suffix}"
+        torchaudio.save(out_path.as_posix(), enhanced, target_sr, format="wav")
+        return out_path
+
+    src = args.input.expanduser().resolve()
+
+    if src.is_file():
+        if not is_audio_file(src):
+            sys.exit(f"{src} is not a supported audio file.")
+        out_path = enhance_file(src, inference_dir)
+        print(f"Enhanced file written to {out_path}")
+
+    elif src.is_dir():
+        files = collect_audio_files(src)
+        if not files:
+            sys.exit(f"{src} contains no enhanceable audio files.")
+
+        batch_out_dir = inference_dir / f"{src.name}_enhanced"
+        batch_out_dir.mkdir(parents=True, exist_ok=True)
+
+        print(f"Enhancing {len(files)} file(s) > {batch_out_dir}")
+        for idx, fpath in enumerate(files, 1):
+            out_path = enhance_file(fpath, batch_out_dir)
+            print(f"[{idx}/{len(files)}] > {out_path}")
+    else:
+        sys.exit(f"{src} is neither a file nor a directory.")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,29 @@
+gdown
+h5py
+hyperpyyaml
+ipympl
+librosa
+ninja
+numpy<2.0
+pandas
+pesq
+pillow
+protobuf
+pyarrow
+pyroomacoustics
+pystoi
+scipy
+sdeint
+seaborn
+setuptools
+git+https://github.com/sp-uhh/sgmse.git@main#egg=sgmse
+tensorboard
+torch
+torch-ema
+torch-pesq
+torchaudio
+torchinfo
+torchmetrics
+torchsde
+torchvision
+tqdm
@@ -0,0 +1,88 @@
+output_folder: results    # Main directory to store experiment results
+run_name: "RUN_NAME" # Will be updated with a unique name at runtime
+
+save_dir: !ref <output_folder>/<run_name>/checkpoints                    # Directory to save checkpoints
+enhanced_dir: !ref <output_folder>/<run_name>/enhanced_training          # Directory to store waveforms at validation during training
+
+data_dir: !PLACEHOLDER        # Root dir for the dataset
+train_annotation: !ref <data_dir>/train.json    # JSON file listing training samples
+valid_annotation: !ref <data_dir>/valid.json    # JSON file listing validation samples
+test_annotation: !ref <data_dir>/test.json     # JSON file listing test samples
+
+skip_prep: False          # If True, skip data preparation steps
+segment_frames: 256       # Number of STFT frames fed into the model. Has to align with what the model ‘wants’ to see due to u net architecture
+random_crop: True         # Whether to crop segments randomly from longer waveforms in training
+random_crop_valid: False  # Whether to crop segments randomly from longer waveforms in validation
+random_crop_test: False   # Whether to crop segments randomly from longer waveforms in testing
+
+normalize: noisy        # Waveforms are normalized with respect to ... (noisy / clean / not)
+sample_rate: 16000      # Sampling rate (in Hz) for audio data
+batch_size: 8           # Batch size for the training set
+number_of_epochs: 160   # Total epochs to train
+num_to_keep: 2          # Numbers of checkpoints to keep
+lr: 0.0001              # Learning rate
+sorting: ascending      # Sorting strategy for data loading (e.g., ascending, descending)
+
+n_fft: 510          # FFT size for STFT
+hop_length: 128     # Hop length (stride) for STFT
+window_type: hann   # Type of window function for STFT
+
+transform_type: exponent    # Type of spectral transform (log, exponent, none)
+spec_factor: 0.15           # Factor to scale the transformed spectrogram
+spec_abs_exponent: 0.5      # Exponent to apply to spectrogram magnitude if needed
+
+train_dataloader_opts:
+  batch_size: !ref <batch_size>
+  shuffle: True           # Shuffle training data each epoch
+
+valid_dataloader_opts:
+  batch_size: 1           # Validation batch size
+
+test_dataloader_opts:
+  batch_size: 1           # Test batch size
+
+sampling:
+  sampler_type: pc
+  predictor: reverse_diffusion
+  corrector: ald
+  N: 30
+  corrector_steps: 1
+  snr: 0.5
+
+epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
+  limit: !ref <number_of_epochs>     # Sets the upper bound on training epochs
+
+modules:
+  score_model: !new:speechbrain.integrations.models.sgmse_plus.ScoreModel
+    backbone: ncsnpp_v2                # Name of the backbone neural network architecture
+    sde: ouve                          # Which SDE to use (Ornstein-Uhlenbeck VE SDE)
+    theta: 1.5                         # Stiffness parameter for the OU SDE
+    sigma_min: 0.05                    # Minimum sigma value for OU SDE
+    sigma_max: 0.5                     # Maximum sigma value for OU SDE
+    lr: !ref <lr>                      # Learning rate for the model
+    ema_decay: 0.999                   # Decay factor for EMA of model parameters
+    t_eps: 0.03                        # Min time-step to avoid zero in continuous diffusion
+    num_eval_files: 5                  # Number of files to process for evaluation
+    loss_type: score_matching          # Which loss approach to use (score matching, etc.)
+    loss_weighting: sigma^2            # Weighting in the loss function
+    network_scaling: 1/t               # Scaling strategy (if any) for network outputs
+    c_in: "1"                          # Input scaling scheme
+    c_out: "1"                         # Output scaling scheme
+    c_skip: "0"                        # Skip connection scaling scheme
+    sigma_data: 0.1                    # Data STD for EDM-based parameterizations
+    l1_weight: 0.001                   # Weight factor for L1 (time-domain) loss
+    pesq_weight: 0.0                   # Weight factor for PESQ-based loss (0 = disabled)
+    N: !ref <sampling[N]>              # Sampler steps
+    corrector_steps: !ref <sampling[corrector_steps]> # Corrector updates per step
+    sampler_type: !ref <sampling[sampler_type]> # SDE sampler type
+    snr: !ref <sampling[snr]>           # SNR for sampler
+    sr: !ref <sample_rate>             # Sample rate for model references
+
+opt_class: !name:torch.optim.Adam
+  lr: !ref <lr>                        # LR used in the Adam optimizer
+
+checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
+  checkpoints_dir: !ref <save_dir>         # Directory to store checkpoint files
+  recoverables:
+    score_model: !ref <modules[score_model]>  # Model parameters to be saved
+    counter: !ref <epoch_counter>             # Epoch counter to be saved