Documentation fixes and removing/fixing hard-coded numbers

pplantinga · pplantinga · commit 181e8df19d8b · 2025-02-26T16:01:53.000-05:00
diff --git a/docs/tutorials/preprocessing/voice-analysis.ipynb b/docs/tutorials/preprocessing/voice-analysis.ipynb
@@ -25,7 +25,9 @@
    "source": [
     "# Analyzing Vocal Features for Pathology\n",
     "\n",
-    "This notebook goes through a simple voice analysis of a few speech samples. First we download a public Parkinson's dataset and cut to just the sustained phonation."
+    "This notebook goes through a simple voice analysis of a few speech samples. If you are new to speech feature extraction, we recommend reading through [Aalto Speech Processing Ch. 3 Basic Representations](https://speechprocessingbook.aalto.fi/Representations/Representations.html)  before going through the notebook to understand the background and theory behind the signal processing techniques used here.\n",
+    "\n",
+    "As a sample vocalization for demonstration purposes, we first download a public sample from a person with Parkinson's disease and cut to just the sustained phonation."
    ]
   },
   {
@@ -155,10 +157,14 @@
    "source": [
     "## Compute autocorrelation and related features\n",
     "\n",
-    "Autocorrelation is the cross-correlation of a signal with itself at various lags.\n",
-    "For harmonic signals, there are peaks at regular lag intervals corresponding to the period.\n",
+    "Autocorrelation is the cross-correlation of a signal with itself at each lag from min_lag to max_lag.\n",
+    "For periodic/harmonic signals, there are peaks at regular lag intervals corresponding to the period.\n",
     "The autocorrelation ratio is the ratio of the strongest peak against the theoretical maximum\n",
-    "which occurs when the lag is zero."
+    "which occurs when the lag is zero.\n",
+    "\n",
+    "For animations which may be helpful for understanding the concept, see the following:\n",
+    "* [https://tahull.github.io/blog/2020/08/acf-animated](https://tahull.github.io/blog/2020/08/acf-animated)\n",
+    "* [https://github.com/chautruonglong/Fundamental-Frequency](https://github.com/chautruonglong/Fundamental-Frequency)"
    ]
   },
   {
@@ -193,9 +199,11 @@
     "frames = audio.unfold(-1, window_samples, step_samples)\n",
     "autocorrelation = autocorrelate(frames)\n",
     "\n",
-    "# Use autocorrelation to estimate harmonicity and best lags\n",
+    "# Use autocorrelation maxima to estimate harmonicity and lags corresponding to period\n",
     "harmonicity, lags = autocorrelation[:, :, min_lag:max_lag].max(dim=-1)\n",
-    "lags = torch.nn.functional.pad(lags, pad=(3, 3)) \n",
+    "lags = torch.nn.functional.pad(lags, pad=(3, 3))\n",
+    "\n",
+    "# Take the median of 7 frames to avoid short octave jumps\n",
     "best_lags, _ = lags.unfold(-1, 7, 1).median(dim=-1)\n",
     "\n",
     "# Re-add the min_lag back in after previous step removed it\n",
@@ -212,7 +220,7 @@
     "xticks = (torch.arange(1, 7) / 2 / step_size).int().tolist()\n",
     "plt.xticks(xticks, xs[xticks].tolist())\n",
     "yticks = torch.linspace(0, max_lag - min_lag, 5).int()\n",
-    "plt.yticks(yticks.tolist(), ((yticks + min_lag) / 441).numpy().round(decimals=2))\n",
+    "plt.yticks(yticks.tolist(), ((yticks + min_lag) / step_samples).numpy().round(decimals=2))\n",
     "plt.show()\n",
     "\n",
     "# Show autocorrelation-based features, harmonicity (usually represented in log scale as HNR) and f0\n",
@@ -505,7 +513,20 @@
    "source": [
     "## Compute GNE step-by-step\n",
     "\n",
-    "The algorithm is best described in \"The Effectiveness of the Glottal to Noise Excitation Ratio for the Screening of Voice Disorders\" by Godino-Llorente et al.\n"
+    "An algorithm for GNE computation from the original paper:\n",
+    "\n",
+    "\"Glottal-to-Noise Excitation Ratio - a New Measure for Describing\n",
+    "Pathological Voices\" by D. Michaelis, T. Oramss, and H. W. Strube.\n",
+    "\n",
+    "This algorithm divides the signal into frequency bands, and compares\n",
+    "the correlation between the bands. High correlation indicates a\n",
+    "relatively low amount of noise in the signal, whereas lower correlation\n",
+    "could be a sign of pathology in the vocal signal.\n",
+    "\n",
+    "Godino-Llorente et al. in \"The Effectiveness of the Glottal to Noise\n",
+    "Excitation Ratio for the Screening of Voice Disorders\" explore the\n",
+    "goodness of the bandwidth and frequency shift parameters, and write out\n",
+    "a clear description of how to compute the measure, used here."
    ]
   },
   {
@@ -700,7 +721,10 @@
    "source": [
     "## PRAAT-Parselmouth\n",
     "\n",
-    "We'll run a similar analysis to verify that our numbers look accurate."
+    "The following is a side-by-side analysis as PRAAT, a commonly used voice analysis tool, to verify that our numbers look accurate. To read more about PRAAT and Parselmouth, see here:\n",
+    "\n",
+    "* [https://www.fon.hum.uva.nl/praat/](https://www.fon.hum.uva.nl/praat/)\n",
+    "* [https://parselmouth.readthedocs.io/en/stable/](https://parselmouth.readthedocs.io/en/stable/)"
    ]
   },
   {
@@ -868,7 +892,9 @@
    "source": [
     "## Comparison with OpenSMILE\n",
     "\n",
-    "Unlike PRAAT, we can do a frame-by-frame comparison with OpenSMILE."
+    "Unlike PRAAT, we can do a frame-by-frame comparison with OpenSMILE, which is helpful for further verification of our approach.\n",
+    "\n",
+    "* [https://www.audeering.com/opensmile/](https://www.audeering.com/opensmile/)"
    ]
   },
   {
@@ -1276,7 +1302,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.7"
+   "version": "3.13.1"
   }
  },
  "nbformat": 4,
diff --git a/speechbrain/lobes/features.py b/speechbrain/lobes/features.py
@@ -34,8 +34,6 @@
 from speechbrain.utils.autocast import fwd_default_precision
 from speechbrain.utils.filter_analysis import FilterProperties
 
-VOICE_EPSILON = 1e-3
-
 
 class Fbank(torch.nn.Module):
     """Generate features for input to the speech pipeline.
@@ -695,12 +693,17 @@ class VocalFeatures(torch.nn.Module):
     sample_rate: int
         The number of samples in a second.
     log_scores: bool
-        Whether to represent the jitter/shimmer/hnr on a log scale.
+        Whether to represent the jitter/shimmer/hnr/gne on a log scale,
+        as these features are typically close to zero.
     eps: float
         The minimum value before log transformation, default of
         1e-3 results in a maximum value of 30 dB.
     sma_neighbors: int
         Number of frames to average -- default 3
+    n_mels: int (default: 23)
+        Number of filters to use for creating filterbank.
+    n_mfcc: int (default: 4)
+        Number of output coefficients
 
     Example
     -------
@@ -721,6 +724,8 @@ def __init__(
         log_scores: bool = True,
         eps: float = 1e-3,
         sma_neighbors: int = 3,
+        n_mels: int = 23,
+        n_mfcc: int = 4,
     ):
         super().__init__()
 
@@ -738,7 +743,6 @@ def __init__(
             self.max_lag * PERIODIC_NEIGHBORS <= self.window_samples
         ), f"Need at least {PERIODIC_NEIGHBORS} periods in a window"
 
-        n_mels, n_mfcc = 23, 4
         self.compute_fbanks = Filterbank(
             sample_rate=sample_rate,
             n_fft=self.window_samples,
@@ -760,7 +764,7 @@ def forward(self, audio: torch.Tensor):
         Returns
         -------
         features: torch.Tensor
-            A [batch, frame, 17] tensor with the following features per-frame.
+            A [batch, frame, 13+n_mfcc] tensor with the following features per-frame.
              * autocorr_f0: A per-frame estimate of the f0 in Hz.
              * autocorr_hnr: harmonicity-to-noise ratio for each frame.
              * periodic_jitter: Average deviation in period length.
@@ -774,10 +778,7 @@ def forward(self, audio: torch.Tensor):
              * spectral_flatness: The ratio of geometric mean to arithmetic mean.
              * spectral_crest: The ratio of spectral maximum to arithmetic mean.
              * spectral_flux: The 2-normed diff between successive spectral values.
-             * mfcc_0: The first mel cepstral coefficient.
-             * mfcc_1: The second mel cepstral coefficient.
-             * mfcc_2: The third mel cepstral coefficient.
-             * mfcc_3: The fourth mel cepstral coefficient.
+             * mfcc_{0-n_mfcc}: The mel cepstral coefficients.
         """
         assert (
             audio.dim() == 2
diff --git a/speechbrain/processing/vocal_features.py b/speechbrain/processing/vocal_features.py
@@ -11,8 +11,6 @@
 import torch
 import torchaudio
 
-# Minimum value for log measures, results in max of 30dB
-EPSILON = 10**-3
 PERIODIC_NEIGHBORS = 4
 
 
@@ -285,8 +283,7 @@ def compute_gne(
     Godino-Llorente et al. in "The Effectiveness of the Glottal to Noise
     Excitation Ratio for the Screening of Voice Disorders." explore the
     goodness of the bandwidth and frequency shift parameters, the defaults
-    here are the ones recommended in that work. They also suggest using
-    log( 1 - GNE ), which they called GNE_L as the final score, as done here.
+    here are the ones recommended in that work.
 
     Arguments
     ---------