Call for Paper - January 2024 Edition
IJCA solicits original research papers for the January 2024 Edition. Last date of manuscript submission is December 20, 2023. Read More

Image and Signal Processing of Mel-Spectrograms in Isolated Speech Recognition

International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Year of Publication: 2021
Atharva Bankar, Aryan Gandhi, Dipali Baviskar

Atharva Bankar, Aryan Gandhi and Dipali Baviskar. Image and Signal Processing of Mel-Spectrograms in Isolated Speech Recognition. International Journal of Computer Applications 183(25):11-17, September 2021. BibTeX

	author = {Atharva Bankar and Aryan Gandhi and Dipali Baviskar},
	title = {Image and Signal Processing of Mel-Spectrograms in Isolated Speech Recognition},
	journal = {International Journal of Computer Applications},
	issue_date = {September 2021},
	volume = {183},
	number = {25},
	month = {Sep},
	year = {2021},
	issn = {0975-8887},
	pages = {11-17},
	numpages = {7},
	url = {},
	doi = {10.5120/ijca2021921625},
	publisher = {Foundation of Computer Science (FCS), NY, USA},
	address = {New York, USA}


One of the fundamental modes of communication is speech. In the past decade, many advances in the field of speech recognition system have been recorded. The conversion of acoustic waveforms into human understandable texts is the basic idea behind these systems. In this paper, an automatic speech recognition (speech-to-text) system is modelled which recognizes isolated words (one at a time). The word predictions are made based on two methods, namely Image Processing and Signal Processing. This paper presents the idea of a speech recognition system for the fundamental progress of speech recognition and also gives an overview of techniques used in each stage of speech recognition. Moreover, a comparative analysis on basis of accuracy and computation time is done. The techniques showcased in this study are used for feature extraction and then used to identify 30 spoken commands using convolutional neural networks (CNNs).


  1. T. Athanaselis, S. Bakamidis, G. Giannopoulos, I. Dologlou and E. Fotinea, "Robust speech recognition in the presence of noise using medical data," 2008 IEEE International Workshop on Imaging Systems and Techniques, Crete, 2008, pp. 349-352, doi: 10.1109/IST.2008.4659999.
  2. V. Mitra, W. Wang, C. Bartels, H. Franco and D. Vergyri, "Articulatory Information and Multiview Features for Large Vocabulary Continuous Speech Recognition," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, 2018, pp. 5634-5638, doi: 10.1109/ICASSP.2018.8462028.
  3. A. F. Abka and H. F. Pardede, "Speech recognition features: Comparison studies on robustness against environmental distortions," 2015 International Conference on Computer, Control, Informatics and its Applications (IC3INA), Bandung, 2015, pp. 114-119, doi: 10.1109/IC3INA.2015.7377757.
  4. G. Hopper and R. Adhami, "An fft-based speech recognition system", Journal of the Franklin Institute, vol. 329, no. 3, pp. 555-562, 1992.
  5. Boussaid, L., Hassine, M. Arabic isolated word recognition system using hybrid feature extraction techniques and neural network. Int J Speech Technol 21, 29–37 (2018).
  6. Shukla, S., Jain, M. A novel system for effective speech recognition based on artificial neural network and opposition artificial bee colony algorithm. Int J Speech Technol 22, 959–969 (2019).
  7. Kaur, Gurpreet & Srivastava, Mohit & Kumar, Amod. (2017). Analysis of Feature Extraction Methods for Speaker Dependent Speech Recognition. International Journal of Engineering and Technology Innovation. 7. 78-88.
  8. Tabassum, Mehnaz& Jahan, M. & Rahman, Mm & Mohamed, S. & Rashid, Mohd. (2017). Speaker Independent Speech Recognition of Isolated Words in Room Environment. International Journal on Advanced Science, Engineering and Information Technology. 7. 475. 10.18517/ijaseit.7.2.1465.
  9. Lokesh, S., Malarvizhi Kumar, P., Ramya Devi, M. et al. An Automatic Tamil Speech Recognition system by using Bidirectional Recurrent Neural Network with Self-Organizing Map. Neural Comput&Applic 31, 1521–1531 (2019).
  10. Kandagal, Amaresh&Udayashankara (2017). Speaker Independent Speech Recognition Using Maximum Likelihood Approach for Isolated Words. INTERNATIONAL JOURNAL OF COMPUTER APPLICATION. 7. 10.26808/
  11. Kaur, Gurpreet & Srivastava, Mohit & Kumar, Amod. (2018). Speaker and Speech Recognition using Deep Neural Network. International Journal of Emerging Research in Management and Technology. 6. 118. 10.23956/ijermt.v6i8.126.
  12. Coniam, David. “The Use of Speech Recognition Software as an English Language Oral Assessment Instrument: An Exploratory Study.” CALICO Journal, vol. 15, no. 4, 1998, pp. 7–23. JSTOR, Accessed 26 Oct. 2020.
  13. M. A. M. Abu Shariah, R. N. Ainon, R. Zainuddin and O. O. Khalifa, "Human computer interaction using isolated-words speech recognition technology," 2007 International Conference on Intelligent and Advanced Systems, Kuala Lumpur, 2007, pp. 1173-1178, doi: 10.1109/ICIAS.2007.4658569.
  14. F. Itakura, "Minimum prediction residual principle applied to speech recognition," in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 23, no. 1, pp. 67-72, February 1975, doi: 10.1109/TASSP.1975.1162641.
  15. Srinivas, Nettimi& Nagarajan, Sugan & Kumar, L.s & Nath, Malaya &Kanhe, Aniruddha. (2018). Speaker-Independent Japanese Isolated Speech Word Recognition Using TDRC Features. 278-283. 10.1109/CETIC4.2018.8530947.
  16. Paul, Dipanwita& Parekh, Ranjan. (2011). Automated Speech Recognition of Isolated Words using Neural Networks. International Journal of Engineering Science and Technology. 3. 4993-5000.
  17. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (June 2017), 84–90. DOI:
  18. Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998, doi: 10.1109/5.726791.
  19. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna, "Rethinking the Inception Architecture for Computer Vision," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 2818-2826, doi: 10.1109/CVPR.2016.308.


Mel-Spectrogram, Feature Extraction, Image Processing, Signal Processing, Transfer Learning, CNNs