CFP last date
20 May 2025
Reseach Article

AI Powered Speech Recognition System using Wavelet Multi-Resolution Analysis with One-Dimentional CNN-LSTM

by Rekha S. Kotwal, Geetanjali Jindal
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 2
Year of Publication: 2025
Authors: Rekha S. Kotwal, Geetanjali Jindal
10.5120/ijca2025924807

Rekha S. Kotwal, Geetanjali Jindal . AI Powered Speech Recognition System using Wavelet Multi-Resolution Analysis with One-Dimentional CNN-LSTM. International Journal of Computer Applications. 187, 2 ( May 2025), 72-81. DOI=10.5120/ijca2025924807

@article{ 10.5120/ijca2025924807,
author = { Rekha S. Kotwal, Geetanjali Jindal },
title = { AI Powered Speech Recognition System using Wavelet Multi-Resolution Analysis with One-Dimentional CNN-LSTM },
journal = { International Journal of Computer Applications },
issue_date = { May 2025 },
volume = { 187 },
number = { 2 },
month = { May },
year = { 2025 },
issn = { 0975-8887 },
pages = { 72-81 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number2/ai-powered-speech-recognition-system/ },
doi = { 10.5120/ijca2025924807 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2025-05-17T02:45:32+05:30
%A Rekha S. Kotwal
%A Geetanjali Jindal
%T AI Powered Speech Recognition System using Wavelet Multi-Resolution Analysis with One-Dimentional CNN-LSTM
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 2
%P 72-81
%D 2025
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The objective of current project is for developing deep learning (DL)-based speech emotion detection system that may identify and categorize emotional states including happiness and sadness. For capturing spatial and temporal patterns in audio input, system uses mel-spectrogram features, that are processed employing hybrid model that combines "convolutional neural networks (CNNs)" and "long short-term memory networks (LSTMs)". Pre-trained model's efficacy in this field is further demonstrated by refinement of transformer-based Wav2Vec2 model for emotion classification. The provided methods accurately identify speech emotions, thus being beneficial for customer service, healthcare, and human-computer interaction.

References
  1. M. Xu, F. Zhang, and W. Zhang, “Head fusion: Improving theaccuracy and robustness of speech emotion recognition on theIEMOCAP and RAVDESS dataset,” IEEE Access, vol. 9, pp.74539–74549,2021.
  2. Farooq,M.;Hussain,F.;Baloch,N.K.;Raja,F.R.;Yu,H.;Zikria, Y.B. Impact of Feature Selection Algorithm on Speech EmotionRecognitionUsingDeepConvolutionalNeuralNetwork. Sensors2020, 20, 6008. https://doi.org/10.3390/s20216008
  3. K. Aghajani and I. E. P. Afrakoti, “Speech emotion recognitionusing scalogram based deep structure,” Int. J. Eng., vol. 33, no. 2, pp.285–292, 2020
  4. Mustaqeem; Kwon, S. A CNN-Assisted Enhanced Audio SignalProcessingforSpeechEmotionRecognition. Sensors2020,20,183.https://doi.org/10.3390/s20010183
  5. K. Aghajani and I. E. P. Afrakoti, “Speech emotion recognitionusing scalogram based deep structure,” Int. J. Eng., vol. 33, no. 2, pp.285–292,2020.https://doi.org/10.1016/j.bspc.2020.101894
  6. D.Issa,M.F.Demirci,andA.Yazici,“Speechemotionrecognitionwith deep convolutional neural networks,” Biomed. SignalProcess. Control, vol. 59, 2020, Art. no. 101894
  7. H. Sak, A. Senior, and F. Beaufays, “Long short-term memorybased recurrent neural network architectures for large vocabularyspeech recognition,” 2014, arXiv:1402.1128
  8. K. K. Kishore and P. K. Satish, “Emotion recognition in speechusingMFCCandwaveletfeatures,”inProc. IEEE 3rd Int.Adv. Comput.Conf.,2013,pp.842–847
  9. Alnuaim, A. A., Zakariah, M., Shukla, P. K., Alhadlaq, A.,Hatamleh, W. A., Tarazi, H., Sureshbabu, R., & Ratna, R. (Year).Human-Computer Interaction for Recognizing Speech EmotionsUsing Multilayer Perceptron Classifier. Publisher.
  10. M. Xu, F. Zhang, and W. Zhang, “Head fusion: Improving theaccuracy and robustness of speech emotion recognition on theIEMOCAP and RAVDESS dataset,” IEEE Access, vol. 9, pp.74539–74549.
  11. Farooq, M.; Hussain, F.; Baloch, N.K.; Raja, F.R.; Yu, H.; Zikria,Y.B., “Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network,” Sensors,vol. 20, no. 6008. https://doi.org/10.3390/s20216008
  12. K. Aghajani and I. E. P. Afrakoti, “Speech emotion recognitionusing scalogram based deep structure,” Int. J. Eng., vol. 33, no. 2, pp.285–292.
  13. Mustaqeem, K.; Kwon, S., “A CNN-Assisted Enhanced AudioSignal Processing for SpeechEmotion Recognition,” Sensors, vol.20, no. 183. https://doi.org/10.3390/s20010183
  14. K. Aghajani and I. E. P. Afrakoti, “Speech emotion recognitionusing scalogram based deep structure,” Int. J. Eng., vol. 33, no. 2, pp.285–292. https://doi.org/10.1016/j.bspc.2020.101894
  15. D. Issa, M. F. Demirci, and A. Yazici, “Speech emotionrecognition with deep convolutional neural networks,” Biomed.Signal Process. Control, vol. 59, Art. no. 101894.
  16. H. Sak, A. Senior, and F. Beaufays, “Long short-term memorybased recurrent neural network architectures for large vocabularyspeech recognition,” arXiv:1402.1128.
  17. K. K. Kishore and P. K. Satish, “Emotion recognition in speechusing MFCC and wavelet features,” in Proc. IEEE 3rd Int. Adv.Comput. Conf., pp. 842–847.
  18. Alnuaim, A. A., Zakariah, M., Shukla, P. K., Alhadlaq, A.,Hatamleh, W. A., Tarazi, H., Sureshbabu, R., & Ratna, R.,“Human-Computer Interaction for Recognizing Speech EmotionsUsing Multilayer Perceptron Classifier.”
  19. S. Upadhyay, V. Kumar, and R. Singh, “Cross-corpus SpeechEmotion Recognition using Self-supervised Learning Models,”IEEE Trans. Affect. Comput., vol. 14, no. 2, pp. 489–500.
  20. Chen, W.; Wu, J.; Zhang, Z.; Wang, Y., “Deep learning-basedspeech emotion recognition with multi-scale feature fusion,”Neural Networks, vol. 136, pp. 20–30.
  21. B. Liu, J. Tao, Z. Lian, and Z. Wen, “Exploiting LabelDependency for Speech EmotionRecognition UsingGraph NeuralNetworks,” IEEETrans.Affect.Comput.,vol.13,no.4,pp.1849– 1862.
  22. Alnuaim, A. A., et al., “Human-Computer Interaction forRecognizing Speech Emotions Using Multilayer PerceptronClassifier,” Neural Comput. Appl., 2023.
  23. Yang, Y., et al., “Attention-based Convolutional Recurrent NeuralNetworks for Speech Emotion Recognition,” IEEE Trans. Affect.Comput., vol. 13, no. 2, pp. 1016–1027.
  24. Tsai,W.-C.,etal.,“MultimodalSpeech EmotionRecognition withTransformer-basedAudio-TextFusion,”Proc.Interspeech2022, pp. 2340–2344.
  25. Feng, L., et al., “Contrastive Learning for Speech EmotionRecognition,” IEEE ICASSP, pp. 11201–11205.
Index Terms

Computer Science
Information Sciences

Keywords

Mel-spectrogram deep learning speech emotion detection CNN LSTM Wav2Vec2 emotion classification human- computer interaction.