CFP last date
20 June 2025
Reseach Article

AI-Driven Speech Emotion Detection: A Systematic Approach to Voice-based Sentiment Analysis

by Srijen Mishra, Syed Wajahat Abbas Rizvi
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 5
Year of Publication: 2025
Authors: Srijen Mishra, Syed Wajahat Abbas Rizvi
10.5120/ijca2025924877

Srijen Mishra, Syed Wajahat Abbas Rizvi . AI-Driven Speech Emotion Detection: A Systematic Approach to Voice-based Sentiment Analysis. International Journal of Computer Applications. 187, 5 ( May 2025), 43-48. DOI=10.5120/ijca2025924877

@article{ 10.5120/ijca2025924877,
author = { Srijen Mishra, Syed Wajahat Abbas Rizvi },
title = { AI-Driven Speech Emotion Detection: A Systematic Approach to Voice-based Sentiment Analysis },
journal = { International Journal of Computer Applications },
issue_date = { May 2025 },
volume = { 187 },
number = { 5 },
month = { May },
year = { 2025 },
issn = { 0975-8887 },
pages = { 43-48 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number5/ai-driven-speech-emotion-detection-a-systematic-approach-to-voice-based-sentiment-analysis/ },
doi = { 10.5120/ijca2025924877 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2025-05-29T00:02:58.434870+05:30
%A Srijen Mishra
%A Syed Wajahat Abbas Rizvi
%T AI-Driven Speech Emotion Detection: A Systematic Approach to Voice-based Sentiment Analysis
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 5
%P 43-48
%D 2025
%I Foundation of Computer Science (FCS), NY, USA
Abstract

This research presents a speech emotion recognition (SER) system utilizing deep learning techniques, specifically Long Short-Term Memory (LSTM) networks, to classify emotions from audio signals. The system leverages Mel-Frequency Cepstral Coefficients (MFCC) with delta and delta-delta features for robust temporal feature extraction. Two widely used emotional speech datasets, TESS and RAVDESS, were combined to enhance model generalization across diverse voices and expressions. The audio data was preprocessed to standardize sampling rates and durations, followed by MFCC feature extraction with mean pooling over time. The LSTM model, trained on the combined dataset, classifies seven emotion classes: angry, calm, disgust, fear, happy, sad, and surprise. The proposed system achieved high accuracy, demonstrating the effectiveness of temporal feature modeling in capturing emotional cues from speech. This study highlights the significance of deep learning in voice-based sentiment analysis, with potential applications in human-computer interaction, virtual assistants, and mental health monitoring.

References
  1. Schuller, B., Steidl, S., & Batliner, A. (2009). The INTERSPEECH 2009 Emotion Challenge. Proceedings of Interspeech 2009, 312–315.
  2. Eyben, F., Wöllmer, M., & Schuller, B. (2010). Opensmile: The Munich versatile and fast open-source audio feature extractor. Proceedings of the 18th ACM International Conference on Multimedia, 1459–1462.
  3. Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication, 48(9), 1162–1181.
  4. Sahu, S., & Rao, K. S. (2018). Speech emotion recognition using DNN-HMM hybrid models. 2018 25th International Conference on Systems, Signals and Image Processing (IWSSIP), 1–4.
  5. Latif, S., Rana, R., Qadir, J., & Epps, J. (2019). Direct modelling of speech emotion from raw speech. Interspeech 2019, 3920–3924.
  6. Zhang, Z., Han, J., Deng, J., & Schuller, B. (2018). Leveraging adversarial learning for domain adaptation in speech emotion recognition. Interspeech 2018, 1116–1120.
  7. Chowdhury, R., Reza, S., & Hossain, M. S. (2021). Speech emotion recognition using LSTM network with hybrid feature extraction. IEEE Access, 9, 123479–123489.
  8. Trigeorgis, G., Nicolaou, M. A., Zafeiriou, S., & Schuller, B. W. (2016). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5200–5204.
  9. Satt, A., Rozenberg, S., & Hoory, R. (2017). Efficient emotion recognition from speech using deep learning on spectrograms. Interspeech 2017, 1089–1093.
  10. Han, K., Yu, D., & Tashev, I. (2014). Speech emotion recognition using deep neural network and extreme learning machine.Interspeech 2014, 223–227.
  11. Xie, Z., Peng, S., & Li, W. (2020). Speech emotion recognition using MFCC and LSTM. 2020 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), 16–20.
  12. Fayek, H. M., Lech, M., & Cavedon, L. (2017). Evaluating deep learning architectures for Speech Emotion Recognition. Neural Networks, 92, 60–68.
  13. Zhang, X., Zhao, J., & Lei, L. (2020). Speech emotion recognition based on CNN and BiLSTM. 2020 13th International Symposium on Computational Intelligence and Design (ISCID), 361–364.
  14. Neumann, M., & Vu, N. T. (2017). Attentive convolutional neural network based speech emotion recognition: A study on the IEMOCAP database. Interspeech 2017, 1263–1267.
  15. Wang, Y., & Guan, Y. (2021). A hybrid CNN-LSTM model for speech emotion recognition. 2021 International Joint Conference on Neural Networks (IJCNN), 1–7.
  16. Chen, S., & Zhao, G. (2020). Multi-modal speech emotion recognition using deep learning. IEEE Transactions on Multimedia, 22(7), 1923–1936.
  17. Tao, J., & Tan, T. (2005). Affective computing: A review. In International Conference on Affective Computing and Intelligent Interaction, 981–995.
Index Terms

Computer Science
Information Sciences

Keywords

Speech Emotion Recognition LSTM MFCC Deep Learning TESS RAVDESS Sentiment Analysis