Image and Signal Processing of Mel-Spectrograms in Isolated Speech Recognition

Atharva Bankar; Aryan Gandhi; Dipali Baviskar

Call for Paper

August Edition

IJCA solicits high quality original research papers for the upcoming August edition of the journal. The last date of research paper submission is 20 July 2026

Submit your paper

Know more

The week's pick

Quantifying Label-Induced Bias in Large Language Model Self and Cross Evaluations

Muskan Saraf Sajjad Rezvani Boroujeni Justin Beaudry Hossein Abedi Tom Bush

Random Articles

Video Steganography using Zero Order Hold Method for Secured Data Transmission

Oct

2017

Image Compression using Orthogonal Wavelets Viewed from Peak Signal to Noise Ratio and Computation Time

June

2012

Building a Web-based IDE from Web 2.0 perspective

June

2014

Performing Transactions Simultaneously in Multiple Heterogeneous Database Instances using Vocal Commands with One Time Password Authenticator as an Extended Security Feature

January

2011

Reseach Article

Image and Signal Processing of Mel-Spectrograms in Isolated Speech Recognition

by Atharva Bankar, Aryan Gandhi, Dipali Baviskar

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 183 - Number 25

Year of Publication: 2021

Authors: Atharva Bankar, Aryan Gandhi, Dipali Baviskar

10.5120/ijca2021921625

Atharva Bankar, Aryan Gandhi, Dipali Baviskar . Image and Signal Processing of Mel-Spectrograms in Isolated Speech Recognition. International Journal of Computer Applications. 183, 25 ( Sep 2021), 11-17. DOI=10.5120/ijca2021921625

@article{ 10.5120/ijca2021921625,

author = { Atharva Bankar, Aryan Gandhi, Dipali Baviskar },

title = { Image and Signal Processing of Mel-Spectrograms in Isolated Speech Recognition },

journal = { International Journal of Computer Applications },

issue_date = { Sep 2021 },

volume = { 183 },

number = { 25 },

month = { Sep },

year = { 2021 },

issn = { 0975-8887 },

pages = { 11-17 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume183/number25/32082-2021921625/ },

doi = { 10.5120/ijca2021921625 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-07T01:17:51.833617+05:30

%A Atharva Bankar

%A Aryan Gandhi

%A Dipali Baviskar

%T Image and Signal Processing of Mel-Spectrograms in Isolated Speech Recognition

%J International Journal of Computer Applications

%@ 0975-8887

%V 183

%N 25

%P 11-17

%D 2021

%I Foundation of Computer Science (FCS), NY, USA

Abstract

One of the fundamental modes of communication is speech. In the past decade, many advances in the field of speech recognition system have been recorded. The conversion of acoustic waveforms into human understandable texts is the basic idea behind these systems. In this paper, an automatic speech recognition (speech-to-text) system is modelled which recognizes isolated words (one at a time). The word predictions are made based on two methods, namely Image Processing and Signal Processing. This paper presents the idea of a speech recognition system for the fundamental progress of speech recognition and also gives an overview of techniques used in each stage of speech recognition. Moreover, a comparative analysis on basis of accuracy and computation time is done. The techniques showcased in this study are used for feature extraction and then used to identify 30 spoken commands using convolutional neural networks (CNNs).

References

T. Athanaselis, S. Bakamidis, G. Giannopoulos, I. Dologlou and E. Fotinea, "Robust speech recognition in the presence of noise using medical data," 2008 IEEE International Workshop on Imaging Systems and Techniques, Crete, 2008, pp. 349-352, doi: 10.1109/IST.2008.4659999.
V. Mitra, W. Wang, C. Bartels, H. Franco and D. Vergyri, "Articulatory Information and Multiview Features for Large Vocabulary Continuous Speech Recognition," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, 2018, pp. 5634-5638, doi: 10.1109/ICASSP.2018.8462028.
A. F. Abka and H. F. Pardede, "Speech recognition features: Comparison studies on robustness against environmental distortions," 2015 International Conference on Computer, Control, Informatics and its Applications (IC3INA), Bandung, 2015, pp. 114-119, doi: 10.1109/IC3INA.2015.7377757.
G. Hopper and R. Adhami, "An fft-based speech recognition system", Journal of the Franklin Institute, vol. 329, no. 3, pp. 555-562, 1992.
Boussaid, L., Hassine, M. Arabic isolated word recognition system using hybrid feature extraction techniques and neural network. Int J Speech Technol 21, 29–37 (2018). https://doi.org/10.1007/s10772-017-9480-7.
Shukla, S., Jain, M. A novel system for effective speech recognition based on artificial neural network and opposition artificial bee colony algorithm. Int J Speech Technol 22, 959–969 (2019). https://doi.org/10.1007/s10772-019-09639-0.
Kaur, Gurpreet & Srivastava, Mohit & Kumar, Amod. (2017). Analysis of Feature Extraction Methods for Speaker Dependent Speech Recognition. International Journal of Engineering and Technology Innovation. 7. 78-88.
Tabassum, Mehnaz& Jahan, M. & Rahman, Mm & Mohamed, S. & Rashid, Mohd. (2017). Speaker Independent Speech Recognition of Isolated Words in Room Environment. International Journal on Advanced Science, Engineering and Information Technology. 7. 475. 10.18517/ijaseit.7.2.1465.
Lokesh, S., Malarvizhi Kumar, P., Ramya Devi, M. et al. An Automatic Tamil Speech Recognition system by using Bidirectional Recurrent Neural Network with Self-Organizing Map. Neural Comput&Applic 31, 1521–1531 (2019). https://doi.org/10.1007/s00521-018-3466-5.
Kandagal, Amaresh&Udayashankara (2017). Speaker Independent Speech Recognition Using Maximum Likelihood Approach for Isolated Words. INTERNATIONAL JOURNAL OF COMPUTER APPLICATION. 7. 10.26808/rs.ca.i7v6.10.
Kaur, Gurpreet & Srivastava, Mohit & Kumar, Amod. (2018). Speaker and Speech Recognition using Deep Neural Network. International Journal of Emerging Research in Management and Technology. 6. 118. 10.23956/ijermt.v6i8.126.
Coniam, David. “The Use of Speech Recognition Software as an English Language Oral Assessment Instrument: An Exploratory Study.” CALICO Journal, vol. 15, no. 4, 1998, pp. 7–23. JSTOR, www.jstor.org/stable/24147601. Accessed 26 Oct. 2020.
M. A. M. Abu Shariah, R. N. Ainon, R. Zainuddin and O. O. Khalifa, "Human computer interaction using isolated-words speech recognition technology," 2007 International Conference on Intelligent and Advanced Systems, Kuala Lumpur, 2007, pp. 1173-1178, doi: 10.1109/ICIAS.2007.4658569.
F. Itakura, "Minimum prediction residual principle applied to speech recognition," in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 23, no. 1, pp. 67-72, February 1975, doi: 10.1109/TASSP.1975.1162641.
Srinivas, Nettimi& Nagarajan, Sugan & Kumar, L.s & Nath, Malaya &Kanhe, Aniruddha. (2018). Speaker-Independent Japanese Isolated Speech Word Recognition Using TDRC Features. 278-283. 10.1109/CETIC4.2018.8530947.
Paul, Dipanwita& Parekh, Ranjan. (2011). Automated Speech Recognition of Isolated Words using Neural Networks. International Journal of Engineering Science and Technology. 3. 4993-5000.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (June 2017), 84–90. DOI: https://doi.org/10.1145/3065386.
Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998, doi: 10.1109/5.726791.
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna, "Rethinking the Inception Architecture for Computer Vision," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 2818-2826, doi: 10.1109/CVPR.2016.308.

Index Terms

Computer Science

Information Sciences

Keywords

Mel-Spectrogram Feature Extraction Image Processing Signal Processing Transfer Learning CNNs