Speech-Emotion-Recognition

Problem Statement:

Why: Communication lost in technology. In the waves of the AI technological tsunami, many of the nuances of human communication are diminished as we translate our beings into computers.

93% of communication in nonverbal and 38% of that is noverbal but is audatory. Source

Ideally stacked with video image recognition of body language and computers could become more human communicators.

In order for future technologies such as Agentic Agents to be truly effective and possibly ease the burden for the crucial roles of nurses, caretakers, teachers ect agentic agents will need to be able to 'intuit' its audiences emotions inorder to best direct itself, call for help and know when it's up to the task at hand (not miss crucial cues!) Thus the refinement of models that are up to the task of emotion recognition is crucial for the developing landscape of human-computer interaction.

Other applications/implications: agentic-agents, call centers, customer churn, healthcare service providers...

Goal: To accurately assess the emotional state of speakers in audio recordings.

How: Using the librosa library standardize and extract features voice signals from a dataset created for SER modeling and then build, train and test the SER model.

Deliverable: MVP - A functioning LSTM model

Outline

Problem statement

Description

Installation

Contents

Data Sources

Code Structure

Results and Evaluation

Future Work

Acknowledgements & References

Licenses

Description

The aim of this project is to construct and employ a Long Short Term Memory (LSTM) classification model for Speech Emotion Recognition (SER) trained on a hybrid dataset.

Installation

Please begin by opening 01_EDA_Dataset_prep_firstModel.ipynb with Colab

Follow the instructions at the beginning of each notebook. Alternatively install all libraries in the requirements.txt file in your terminal and then proceed with only the imports and file uploads as outlined in each notebook.

.
├── ```
├── 01_EDA_Dataset_prep_firstModel.ipynb	
├── 02_LSTM1-4.ipynb			
├── images/
│   ├── EDA/
│   │   ├── Count of Samples per Emotion Label.png                          Example_1_Male_Actor_1_Fourier_Spectogram_Neutral.png
│   │   ├── Example_1_Male_ChromaSTFT_Emotion_ Neutral.png
│   │   ├── Example_1_Male_Mel_Spectogram_Neutral.png
│   │   ├── Example_1_Male_Spectogram_Emotion_ Neutral.png
│   │   ├── Example_2_Female_ChromaSTFT_Suprise.png
│   │   ├── Example_2_Female_Fourier_Spectogram_Suprise.png
│   │   ├── Example_2_Female_Mel_Spectogram_Suprise.png
│   │   ├── Example_2_Female_Spectogram_Suprise.png
│   │   ├── Example_3_Male_Mel_Spectogram_Sad.png
│   │   ├── Example_3_Male_Spectogram_Sad.png
│   │   ├── Example_4_Male_Mel_Spectogram_Happy.png
│   │   └── Example_4_Male_Spectogram_Happy.png
│   └── VAL/
│       ├── Confusion_Matrix_1sr_LSTM.png	
│       ├── LSTM3_Accuracy.png
│       ├── LSTM1_Accuracy.png		
│       ├── LSTM3_Loss.png
│       ├── LSTM1_Loss.png
│       ├── LSTM4_Accuracy.png			    
│       ├── LSTM2_Accuracy.png		
│       ├── LSTM4_Loss.png
│       └── LSTM2_Loss.png
├── LICENSE	  
├── models/
│   ├── lstm_model1.h5		
│   ├── lstm1_model.h5		
│   ├── lstm4_model.h5
│   ├── lstm_model2.h5		
│   ├── lstm3_model.h5		
│   └── my_lstm_model.h5
├── prepped_data/
│   ├── av-angry.m4a			
│   ├── ser-labels-paths (1).csv
│   └── mfccs3_data.npy
├── README.md
├── requirements.txt
└── Speech_emotion_recognition_presentation.pdf

Data and Sources

Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

Toronto emotional speech set (TESS)

Surrey Audio-Visual Expressed Emotion (SAVEE)

Crowd Sourced Emotional Multimodal Actors Dataset (CREMA-D)

Librosa Documentation

tree.nathanfriend.com

Code Structure

The first notebook called 01_EDA_Dataset_prep_firstModel is structured in the following fashion: Imports: import these as they are required for the notebook to run. If needed, un-comment installations and install as well. You may need to restart your notebook after installation but not imports.

Import the dataset. Imported single files to show some different types of extractions that are possible with this type of data and necessary to transform them for modeling. Example files illustrated the actual audio file for listening, a spectogram, a ChromaSTFT, a mel spectogram and a fourier-spectogram.

Next data labels were extracted. The labels were encoded into the file names, differently for each dataset. This required a specific function for each dataset. At the same time we concatenated the labels and files paths, which are later used to locate the files during extraction, onto lists which would then compose our dataframe.

Next a calculation of the average, min and max length of each data file was calculated and returned to determine the best length, padding and trimming for the normalization of the data. Later it was discovered that this was not necessary as librosa.load handles a lot of this work under the hood.

Analysis and visualization of our label classes.

Performed feature extraction of Mel-Frequency Cepstral Coefficients (MFCCs) as numerical features representing the spectral shape of sound.

Reshaped data

Encoded labels

Train test split

Built, trained and fitted on test data: Model the LSTM: model_LSTM, 'first_lstm_model.h5 - this was the best performing model.

Stored predictions

Visualized predictions accuracy and loss against testing data

Imported a new raw audio file for classification, put through model which misclassified the audio file.

Results and Evaluation

As can be see in Count_of_Samples_per_Emotion.png the classes are unbalanced. It is important to note that despite stratifying in the train-test-split, the class imbalanaces negativley impacted performance.

The first model: model_LSTM, 'first_lstm_model.h5 - performed the best. It had the highest Accuracy to Validation Loss (our key metrics) ratio and also the best %'s of those metrics. The score was 67% Accuracy and 1.22 Validation Loss (66% and 1.15 Validation Loss if the model was retrained and stopped at its best Epoch 90). Most of the tuned models, even with early stopping added, did not improve in performance. This is clearly illustrated by the accuracy test/pred graphics as well as the loss test/pred graphic contained both in the presentation pdf and the images folder, which were created for each iteration of the model.

Future Work

Decide whether to balance the data set by finding, imputing more values for or dropping the calm, suprised and neutral label categories.

Further tuning of the current model could help the user familiarize themself with the process of building an LSTM and iterating on it for performance improvements.

Both early stopping and LS regularization were implemented to improve performance but in fact both worsened the models performance this indicates it is advisable to use a simpler model.

Accessing other possible standardisations techniques, feature extraction and tuning methods would be the next step.

Utilize pretrained models such as Whisper, WavLM, and Wav2Vec 2.0, which can be fine-tuned for SER tasks. Any of these hould return greater results as they have been trained on large datasets and have achieved very high accuracy with SER modeling.

Transfer learning (TL) for Speaker Emotion Recognition (SER) involves taking a pre-trained machine learning model, typically trained on a related task like general audio understanding or a large-scale SER dataset, and fine-tuning it on a smaller, specific SER dataset.

Acknowledgments & References

Many thanks to contributors to the source material. Many thanks to the instructors and staff at GA.

Research paper outlining the functionality, primarily signal processing and feature extraction, and use cases of the audio interpretation library Limbros: Speech Emotion Recognition Using Librosa A wonderful article by Rohit Bohra outlining a basic potential workflow for Emotion Detection in audio using Python — Part 1 A helpful note: Review of many articles on Medium and repos on Github all of which had a unique approach and sometimes similar struggles, were very helpful in understanding workflow and potential routes. It is highly recommended to browse both of these sources when partaking in any modeling endeavor!

Licenses

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)" by Livingstone & Russo is licensed under CC BY-NA-SC 4.0. Here is a research paper describing the data set:The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English Toronto emotional speech database license: Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) CREMA-D license SURREY SAVEE

(did not implement but is in futurework notebook) model1: @misc{speech-emotion-recognition, author = {JagjeevanAK}, title = {Speech Emotion Recognition Model}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/JagjeevanAK/Speech-emotion-detection} }

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
images		images
models		models
prepped_data		prepped_data
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
01_EDA_Dataset_prep_firstModel.ipynb		01_EDA_Dataset_prep_firstModel.ipynb
02_LSTM1-4.ipynb		02_LSTM1-4.ipynb
Count_of_Samples_per_Emotion.png		Count_of_Samples_per_Emotion.png
Example_1_Male_Actor_1_Fourier_Spectrogram_Neutral.png		Example_1_Male_Actor_1_Fourier_Spectrogram_Neutral.png
Example_1_Male_ChromaSTFT_Emotion: Neutral.png		Example_1_Male_ChromaSTFT_Emotion: Neutral.png
Example_1_Male_ChromaSTFT_Neutral.png		Example_1_Male_ChromaSTFT_Neutral.png
Example_1_Male_Fourier_Spectrogram_Neutral.png		Example_1_Male_Fourier_Spectrogram_Neutral.png
Example_1_Male_Mel_Spectrogram_Neutral.png		Example_1_Male_Mel_Spectrogram_Neutral.png
Example_1_Male_Spectrogram_Emotion: Neutral.png		Example_1_Male_Spectrogram_Emotion: Neutral.png
Example_1_Male_Spectrogram_Neutral.png		Example_1_Male_Spectrogram_Neutral.png
Example_2_Female_ChromaSTFT_Suprise.png		Example_2_Female_ChromaSTFT_Suprise.png
Example_2_Female_ChromaSTFT_Surprise.png		Example_2_Female_ChromaSTFT_Surprise.png
Example_2_Female_Fourier_Spectrogram_Suprise.png		Example_2_Female_Fourier_Spectrogram_Suprise.png
Example_2_Female_Fourier_Spectrogram_Surprise.png		Example_2_Female_Fourier_Spectrogram_Surprise.png
Example_2_Female_Mel_Spectrogram_Suprise.png		Example_2_Female_Mel_Spectrogram_Suprise.png
Example_2_Female_Mel_Spectrogram_Surprise.png		Example_2_Female_Mel_Spectrogram_Surprise.png
Example_2_Female_Spectrogram_Suprise.png		Example_2_Female_Spectrogram_Suprise.png
Example_2_Female_Spectrogram_Surprise.png		Example_2_Female_Spectrogram_Surprise.png
LICENSE		LICENSE
README.md		README.md
Speech_emotion_recognition_presentation.pdf		Speech_emotion_recognition_presentation.pdf
confusion_matrix.png		confusion_matrix.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech-Emotion-Recognition

Problem Statement:

Outline

Description

Installation

Contents

Data and Sources

Code Structure

Results and Evaluation

Future Work

Acknowledgments & References

Licenses

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Speech-Emotion-Recognition

Problem Statement:

Outline

Description

Installation

Contents

Data and Sources

Code Structure

Results and Evaluation

Future Work

Acknowledgments & References

Licenses

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages