Skip to content

Commit 181e8df

Browse files
committed
Documentation fixes and removing/fixing hard-coded numbers
1 parent 3a7da12 commit 181e8df

3 files changed

Lines changed: 48 additions & 24 deletions

File tree

docs/tutorials/preprocessing/voice-analysis.ipynb

Lines changed: 37 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,9 @@
2525
"source": [
2626
"# Analyzing Vocal Features for Pathology\n",
2727
"\n",
28-
"This notebook goes through a simple voice analysis of a few speech samples. First we download a public Parkinson's dataset and cut to just the sustained phonation."
28+
"This notebook goes through a simple voice analysis of a few speech samples. If you are new to speech feature extraction, we recommend reading through [Aalto Speech Processing Ch. 3 Basic Representations](https://speechprocessingbook.aalto.fi/Representations/Representations.html) before going through the notebook to understand the background and theory behind the signal processing techniques used here.\n",
29+
"\n",
30+
"As a sample vocalization for demonstration purposes, we first download a public sample from a person with Parkinson's disease and cut to just the sustained phonation."
2931
]
3032
},
3133
{
@@ -155,10 +157,14 @@
155157
"source": [
156158
"## Compute autocorrelation and related features\n",
157159
"\n",
158-
"Autocorrelation is the cross-correlation of a signal with itself at various lags.\n",
159-
"For harmonic signals, there are peaks at regular lag intervals corresponding to the period.\n",
160+
"Autocorrelation is the cross-correlation of a signal with itself at each lag from min_lag to max_lag.\n",
161+
"For periodic/harmonic signals, there are peaks at regular lag intervals corresponding to the period.\n",
160162
"The autocorrelation ratio is the ratio of the strongest peak against the theoretical maximum\n",
161-
"which occurs when the lag is zero."
163+
"which occurs when the lag is zero.\n",
164+
"\n",
165+
"For animations which may be helpful for understanding the concept, see the following:\n",
166+
"* [https://tahull.github.io/blog/2020/08/acf-animated](https://tahull.github.io/blog/2020/08/acf-animated)\n",
167+
"* [https://github.com/chautruonglong/Fundamental-Frequency](https://github.com/chautruonglong/Fundamental-Frequency)"
162168
]
163169
},
164170
{
@@ -193,9 +199,11 @@
193199
"frames = audio.unfold(-1, window_samples, step_samples)\n",
194200
"autocorrelation = autocorrelate(frames)\n",
195201
"\n",
196-
"# Use autocorrelation to estimate harmonicity and best lags\n",
202+
"# Use autocorrelation maxima to estimate harmonicity and lags corresponding to period\n",
197203
"harmonicity, lags = autocorrelation[:, :, min_lag:max_lag].max(dim=-1)\n",
198-
"lags = torch.nn.functional.pad(lags, pad=(3, 3)) \n",
204+
"lags = torch.nn.functional.pad(lags, pad=(3, 3))\n",
205+
"\n",
206+
"# Take the median of 7 frames to avoid short octave jumps\n",
199207
"best_lags, _ = lags.unfold(-1, 7, 1).median(dim=-1)\n",
200208
"\n",
201209
"# Re-add the min_lag back in after previous step removed it\n",
@@ -212,7 +220,7 @@
212220
"xticks = (torch.arange(1, 7) / 2 / step_size).int().tolist()\n",
213221
"plt.xticks(xticks, xs[xticks].tolist())\n",
214222
"yticks = torch.linspace(0, max_lag - min_lag, 5).int()\n",
215-
"plt.yticks(yticks.tolist(), ((yticks + min_lag) / 441).numpy().round(decimals=2))\n",
223+
"plt.yticks(yticks.tolist(), ((yticks + min_lag) / step_samples).numpy().round(decimals=2))\n",
216224
"plt.show()\n",
217225
"\n",
218226
"# Show autocorrelation-based features, harmonicity (usually represented in log scale as HNR) and f0\n",
@@ -505,7 +513,20 @@
505513
"source": [
506514
"## Compute GNE step-by-step\n",
507515
"\n",
508-
"The algorithm is best described in \"The Effectiveness of the Glottal to Noise Excitation Ratio for the Screening of Voice Disorders\" by Godino-Llorente et al.\n"
516+
"An algorithm for GNE computation from the original paper:\n",
517+
"\n",
518+
"\"Glottal-to-Noise Excitation Ratio - a New Measure for Describing\n",
519+
"Pathological Voices\" by D. Michaelis, T. Oramss, and H. W. Strube.\n",
520+
"\n",
521+
"This algorithm divides the signal into frequency bands, and compares\n",
522+
"the correlation between the bands. High correlation indicates a\n",
523+
"relatively low amount of noise in the signal, whereas lower correlation\n",
524+
"could be a sign of pathology in the vocal signal.\n",
525+
"\n",
526+
"Godino-Llorente et al. in \"The Effectiveness of the Glottal to Noise\n",
527+
"Excitation Ratio for the Screening of Voice Disorders\" explore the\n",
528+
"goodness of the bandwidth and frequency shift parameters, and write out\n",
529+
"a clear description of how to compute the measure, used here."
509530
]
510531
},
511532
{
@@ -700,7 +721,10 @@
700721
"source": [
701722
"## PRAAT-Parselmouth\n",
702723
"\n",
703-
"We'll run a similar analysis to verify that our numbers look accurate."
724+
"The following is a side-by-side analysis as PRAAT, a commonly used voice analysis tool, to verify that our numbers look accurate. To read more about PRAAT and Parselmouth, see here:\n",
725+
"\n",
726+
"* [https://www.fon.hum.uva.nl/praat/](https://www.fon.hum.uva.nl/praat/)\n",
727+
"* [https://parselmouth.readthedocs.io/en/stable/](https://parselmouth.readthedocs.io/en/stable/)"
704728
]
705729
},
706730
{
@@ -868,7 +892,9 @@
868892
"source": [
869893
"## Comparison with OpenSMILE\n",
870894
"\n",
871-
"Unlike PRAAT, we can do a frame-by-frame comparison with OpenSMILE."
895+
"Unlike PRAAT, we can do a frame-by-frame comparison with OpenSMILE, which is helpful for further verification of our approach.\n",
896+
"\n",
897+
"* [https://www.audeering.com/opensmile/](https://www.audeering.com/opensmile/)"
872898
]
873899
},
874900
{
@@ -1276,7 +1302,7 @@
12761302
"name": "python",
12771303
"nbconvert_exporter": "python",
12781304
"pygments_lexer": "ipython3",
1279-
"version": "3.12.7"
1305+
"version": "3.13.1"
12801306
}
12811307
},
12821308
"nbformat": 4,

speechbrain/lobes/features.py

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,6 @@
3434
from speechbrain.utils.autocast import fwd_default_precision
3535
from speechbrain.utils.filter_analysis import FilterProperties
3636

37-
VOICE_EPSILON = 1e-3
38-
3937

4038
class Fbank(torch.nn.Module):
4139
"""Generate features for input to the speech pipeline.
@@ -695,12 +693,17 @@ class VocalFeatures(torch.nn.Module):
695693
sample_rate: int
696694
The number of samples in a second.
697695
log_scores: bool
698-
Whether to represent the jitter/shimmer/hnr on a log scale.
696+
Whether to represent the jitter/shimmer/hnr/gne on a log scale,
697+
as these features are typically close to zero.
699698
eps: float
700699
The minimum value before log transformation, default of
701700
1e-3 results in a maximum value of 30 dB.
702701
sma_neighbors: int
703702
Number of frames to average -- default 3
703+
n_mels: int (default: 23)
704+
Number of filters to use for creating filterbank.
705+
n_mfcc: int (default: 4)
706+
Number of output coefficients
704707
705708
Example
706709
-------
@@ -721,6 +724,8 @@ def __init__(
721724
log_scores: bool = True,
722725
eps: float = 1e-3,
723726
sma_neighbors: int = 3,
727+
n_mels: int = 23,
728+
n_mfcc: int = 4,
724729
):
725730
super().__init__()
726731

@@ -738,7 +743,6 @@ def __init__(
738743
self.max_lag * PERIODIC_NEIGHBORS <= self.window_samples
739744
), f"Need at least {PERIODIC_NEIGHBORS} periods in a window"
740745

741-
n_mels, n_mfcc = 23, 4
742746
self.compute_fbanks = Filterbank(
743747
sample_rate=sample_rate,
744748
n_fft=self.window_samples,
@@ -760,7 +764,7 @@ def forward(self, audio: torch.Tensor):
760764
Returns
761765
-------
762766
features: torch.Tensor
763-
A [batch, frame, 17] tensor with the following features per-frame.
767+
A [batch, frame, 13+n_mfcc] tensor with the following features per-frame.
764768
* autocorr_f0: A per-frame estimate of the f0 in Hz.
765769
* autocorr_hnr: harmonicity-to-noise ratio for each frame.
766770
* periodic_jitter: Average deviation in period length.
@@ -774,10 +778,7 @@ def forward(self, audio: torch.Tensor):
774778
* spectral_flatness: The ratio of geometric mean to arithmetic mean.
775779
* spectral_crest: The ratio of spectral maximum to arithmetic mean.
776780
* spectral_flux: The 2-normed diff between successive spectral values.
777-
* mfcc_0: The first mel cepstral coefficient.
778-
* mfcc_1: The second mel cepstral coefficient.
779-
* mfcc_2: The third mel cepstral coefficient.
780-
* mfcc_3: The fourth mel cepstral coefficient.
781+
* mfcc_{0-n_mfcc}: The mel cepstral coefficients.
781782
"""
782783
assert (
783784
audio.dim() == 2

speechbrain/processing/vocal_features.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,6 @@
1111
import torch
1212
import torchaudio
1313

14-
# Minimum value for log measures, results in max of 30dB
15-
EPSILON = 10**-3
1614
PERIODIC_NEIGHBORS = 4
1715

1816

@@ -285,8 +283,7 @@ def compute_gne(
285283
Godino-Llorente et al. in "The Effectiveness of the Glottal to Noise
286284
Excitation Ratio for the Screening of Voice Disorders." explore the
287285
goodness of the bandwidth and frequency shift parameters, the defaults
288-
here are the ones recommended in that work. They also suggest using
289-
log( 1 - GNE ), which they called GNE_L as the final score, as done here.
286+
here are the ones recommended in that work.
290287
291288
Arguments
292289
---------

0 commit comments

Comments
 (0)