SPEAKNet: Spectrogram-Phoneme Embedding Architecture for Knowledge-enhanced Speech Command Recognition

Sunakshi Mehra

Call for Paper

January Edition

IJCA solicits high quality original research papers for the upcoming January edition of the journal. The last date of research paper submission is 22 December 2025

Submit your paper

Know more

The week's pick

A Hybrid Transformer-CNN Framework with Early and Late Fusion for Robust Skin Lesion Classification

Raihan Tanvir

Random Articles

Reseach Article

SPEAKNet: Spectrogram-Phoneme Embedding Architecture for Knowledge-enhanced Speech Command Recognition

by Sunakshi Mehra

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Number 14

Year of Publication: 2025

Authors: Sunakshi Mehra

10.5120/ijca2025925146

Sunakshi Mehra . SPEAKNet: Spectrogram-Phoneme Embedding Architecture for Knowledge-enhanced Speech Command Recognition. International Journal of Computer Applications. 187, 14 ( Jun 2025), 27-37. DOI=10.5120/ijca2025925146

@article{ 10.5120/ijca2025925146,

author = { Sunakshi Mehra },

title = { SPEAKNet: Spectrogram-Phoneme Embedding Architecture for Knowledge-enhanced Speech Command Recognition },

journal = { International Journal of Computer Applications },

issue_date = { Jun 2025 },

volume = { 187 },

number = { 14 },

month = { Jun },

year = { 2025 },

issn = { 0975-8887 },

pages = { 27-37 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume187/number14/speaknet-spectrogram-phoneme-embedding-architecture-for-knowledge-enhanced-speech-command-recognition/ },

doi = { 10.5120/ijca2025925146 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2025-06-26T19:04:43.370810+05:30

%A Sunakshi Mehra

%T SPEAKNet: Spectrogram-Phoneme Embedding Architecture for Knowledge-enhanced Speech Command Recognition

%J International Journal of Computer Applications

%@ 0975-8887

%V 187

%N 14

%P 27-37

%D 2025

%I Foundation of Computer Science (FCS), NY, USA

Abstract

This research aims to enhance automatic speech recognition (ASR) by integrating multimodal data—specifically, text transcripts and Mel spectrograms generated from raw audio signals. The study explores the often-overlooked role of phonological features and spectrogram-based representations in improving the accuracy of spoken word recognition. A dual-path approach is adopted: EfficientNetV2 is utilized to extract features from spectrogram images, while a Speech2Text transformer model is employed to generate text transcripts. For evaluation, the study uses ten-word categories from version 2 of the Google Speech Commands dataset. To reduce noise in the audio samples, a Kalman filter is applied, ensuring cleaner signal processing. The resulting Mel spectrograms are resized to 256×256 pixels to produce two-dimensional visual representations of the audio data. These images are then classified using EfficientNetV2, pre-trained on the ImageNet dataset. In parallel, a grapheme-to-phoneme (G2P) model is used to convert Speech2Text outputs into phonemes. These are further processed through a technique called phoneme slicing, which extracts core phonological units—such as fricatives, nasals, liquids, glides, plosives, approximants, taps/flaps, trills, and vowels—based on articulatory features like manner and place of articulation. The proposed system employs a late fusion strategy that combines phoneme embeddings with image-based embeddings to achieve high classification accuracy. This fusion not only boosts ASR performance but also underscores the value of incorporating linguistic and phonological knowledge into spoken language understanding. Through comprehensive ablation analysis, the study demonstrates that the integration of spectrograms and phonological analysis sets a new benchmark, outperforming existing models in terms of accuracy and interpretability.

References

Baltrušaitis, Tadas, Chaitanya Ahuja, and Louis-Philippe Morency. "Multimodal machine learning: A survey and taxonomy." IEEE transactions on pattern analysis and machine intelligence 41, no. 2 (2018): 423-443.
Zhu, Hao, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He. "Deep audio-visual learning: A survey." International Journal of Automation and Computing 18 (2021): 351-376.
Sumby, William H., and Irwin Pollack. "Visual contribution to speech intelligibility in noise." The journal of the acoustical society of america 26, no. 2 (1954): 212-215.
Lai, Kuo-Wei Kyle, and Hao-Jan Howard Chen. "An exploratory study on the accuracy of three speech recognition software programs for young Taiwanese EFL learners." Interactive Learning Environments (2022): 1-15.
Nijhawan, Tanya, Girija Attigeri, and T. Ananthakrishna. "Stress detection using natural language processing and machine learning over social interactions." Journal of Big Data 9, no. 1 (2022): 1-24.
Paula, Amauri J., Odair Pastor Ferreira, Antonio G. Souza Filho, Francisco Nepomuceno Filho, Carlos E. Andrade, and Andreia F. Faria. "Machine learning and natural language processing enable a data-oriented experimental design approach for producing biochar and hydrochar from biomass." Chemistry of Materials 34, no. 3 (2022): 979-990.
BENSALAH, Rana Fadia, and Achouak BETTA. "The Impact of the Mother Tongue on the Phonetic Realization of Foreign Language Allophones. Algerian Arabic VS Received Pronunciation English." PhD diss., Université Ibn Khaldoun-Tiaret-, 2022.
Syafrizal, Syafrizal, Sri Wahyuni, and Tosi Rut Syamsun. "Pronunciation Errors of the Silent Consonants of Pariskian Junior High School Students." Journal of English Language Teaching and English Linguistics 7, no. 2 (2022): 155-165.
Li, Bo, Ruoming Pang, Yu Zhang, Tara N. Sainath, Trevor Strohman, Parisa Haghani, Yun Zhu, Brian Farris, Neeraj Gaur, and Manasa Prasad. "Massively multilingual asr: A lifelong learning solution." In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6397-6401. IEEE, 2022.
Li, Bo, Shuo-yiin Chang, Tara N. Sainath, Ruoming Pang, Yanzhang He, Trevor Strohman, and Yonghui Wu. "Towards fast and accurate streaming end-to-end ASR." In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6069-6073. IEEE, 2020.
Baevski, Alexei, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. "wav2vec 2.0: A framework for self-supervised learning of speech representations." Advances in neural information processing systems 33 (2020): 12449-12460.
Chung, Yu-An, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. "W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training." In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 244-250. IEEE, 2021.
Zhang, Yu, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen et al. "Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition." IEEE Journal of Selected Topics in Signal Processing 16, no. 6 (2022): 1519-1532.
Graves, Alex. "Sequence transduction with recurrent neural networks." arXiv preprint arXiv:1211.3711 (2012).
Hu, Ke, Antoine Bruguier, Tara N. Sainath, Rohit Prabhavalkar, and Golan Pundak. "Phoneme-based contextualization for cross-lingual speech recognition in end-to-end models." arXiv preprint arXiv:1906.09292 (2019).
Reddy, Sravana, and James N. Stanford. "Toward completely automated vowel extraction: Introducing DARLA." Linguistics Vanguard 1, no. 1 (2015): 15-28.
Jimerson, Robbie, and Emily Prud’Hommeaux. "ASR for documenting acutely under-resourced indigenous languages." In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018.
Michaud, Alexis, Oliver Adams, Trevor Anthony Cohn, Graham Neubig, and Séverine Guillaume. "Integrating automatic transcription into the language documentation workflow: Experiments with Na data and the Persephone toolkit." (2018).
Omachi, Motoi, Yuya Fujita, Shinji Watanabe, and Matthew Wiesner. "End-to-end ASR to jointly predict transcriptions and linguistic annotations." In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1861-1871. 2021.
Li, Jinyu. "Recent advances in end-to-end automatic speech recognition." APSIPA Transactions on Signal and Information Processing 11, no. 1 (2022).
Prabhavalkar, Rohit, Takaaki Hori, Tara N. Sainath, Ralf Schlüter, and Shinji Watanabe. "End-to-End Speech Recognition: A Survey." arXiv preprint arXiv:2303.03329 (2023).
Chiu, Chung-Cheng, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan et al. "State-of-the-art speech recognition with sequence-to-sequence models." In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4774-4778. IEEE, 2018.
Chen, Xie, Yu Wu, Zhenghao Wang, Shujie Liu, and Jinyu Li. "Developing real-time streaming transformer transducer for speech recognition on large-scale dataset." In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5904-5908. IEEE, 2021.
Wang, Yuxuan, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang et al. "Tacotron: Towards end-to-end speech synthesis." arXiv preprint arXiv:1703.10135 (2017).
Shen, Jonathan, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen et al. "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions." In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4779-4783. IEEE, 2018.
Li, Naihan, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. "Neural speech synthesis with transformer network." In Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, pp. 6706-6713. 2019.
Ren, Yi, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. "Fastspeech: Fast, robust and controllable text to speech." Advances in neural information processing systems 32 (2019).
Ren, Yi, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. "Fastspeech 2: Fast and high-quality end-to-end text to speech." arXiv preprint arXiv:2006.04558 (2020).
Li, Naihan, Yanqing Liu, Yu Wu, Shujie Liu, Sheng Zhao, and Ming Liu. "Robutrans: A robust transformer-based text-to-speech model." In Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 05, pp. 8228-8235. 2020.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. "Improving language understanding by generative pre-training." (2018).
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems 27 (2014).
Lewis, Mike, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. "Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension." arXiv preprint arXiv:1910.13461 (2019).
Yi, Cheng, Jianzhong Wang, Ning Cheng, Shiyu Zhou, and Bo Xu. "Applying wav2vec2. 0 to speech recognition in various low-resource languages." arXiv preprint arXiv:2012.12121 (2020).
Gao, Heting, Junrui Ni, Yang Zhang, Kaizhi Qian, Shiyu Chang, and Mark Hasegawa-Johnson. "Zero-Shot Cross-Lingual Phonetic Recognition with External Language Embedding." In Interspeech, pp. 1304-1308. 2021.
Chen, Tianlong, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. "The lottery ticket hypothesis for pre-trained bert networks." Advances in neural information processing systems 33 (2020): 15834-15846.
Lai, Cheng-I. Jeff, Yang Zhang, Alexander H. Liu, Shiyu Chang, Yi-Lun Liao, Yung-Sung Chuang, Kaizhi Qian, Sameer Khurana, David Cox, and Jim Glass. "Parp: Prune, adjust and re-prune for self-supervised speech recognition." Advances in Neural Information Processing Systems 34 (2021): 21256-21272.
Newell, Alejandro, and Jia Deng. "How useful is self-supervised pretraining for visual tasks?." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7345-7354. 2020.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
Hinton, Geoffrey, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior et al. "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups." IEEE Signal processing magazine 29, no. 6 (2012): 82-97.
Chen, Guoguo, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su et al. "Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio." arXiv preprint arXiv:2106.06909 (2021).
Mohamed, Abdelrahman, Hung-yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff et al. "Self-supervised speech representation learning: A review." IEEE Journal of Selected Topics in Signal Processing (2022).
Chung, Yu-An, Wei-Ning Hsu, Hao Tang, and James Glass. "An unsupervised autoregressive model for speech representation learning." arXiv preprint arXiv:1904.03240 (2019).
Oord, Aaron van den, Yazhe Li, and Oriol Vinyals. "Representation learning with contrastive predictive coding." arXiv preprint arXiv:1807.03748 (2018).
Hsu, Wei-Ning, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. "Hubert: Self-supervised speech representation learning by masked prediction of hidden units." IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021): 3451-3460.
Mehra, Sunakshi, Virender Ranga, and Ritu Agarwal. "Dhivehi Speech Recognition: A Multimodal Approach for Dhivehi Language in Resource-Constrained Settings." Circuits, Systems, and Signal Processing (2024): 1-21.
Zhang, Qiuju, Hongtao Zhang, Keming Zhou, and Le Zhang. "Developing a Physiological Signal-Based, Mean Threshold and Decision-Level Fusion Algorithm (PMD) for Emotion Recognition." Tsinghua Science and Technology 28, no. 4 (2023): 673-685.
Hazen, Timothy J. "Automatic alignment and error correction of human generated transcripts for long speech recordings." In Ninth International Conference on Spoken Language Processing. 2006.
Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le. "Xlnet: Generalized autoregressive pretraining for language understanding." Advances in neural information processing systems 32 (2019).
Yenkimaleki, Mahmood, and Vincent J. van Heuven. "Effects of attention to segmental vs. suprasegmental features on the speech intelligibility and comprehensibility of the EFL learners targeting the perception or production-focused practice." System 100 (2021): 102557.
Duchi, John, Elad Hazan, and Yoram Singer. "Adaptive subgradient methods for online learning and stochastic optimization." Journal of machine learning research 12, no. 7 (2011).
Warden, Pete. "Speech commands: A dataset for limited-vocabulary speech recognition." arXiv preprint arXiv:1804.03209 (2018).
Haque, Md Amaan, Abhishek Verma, John Sahaya Rani Alex, and Nithya Venkatesan. "Experimental evaluation of CNN architecture for speech recognition." In First International Conference on Sustainable Technologies for Computational Intelligence: Proceedings of ICTSCI 2019, pp. 507-514. Springer Singapore, 2020.
Abdelmaksoud, Engy Ragaei, Arafa Hassen, Nabila Hassan, and Mohamed Hesham. "Convolutional Neural Network for Arabic Speech Recognition." The Egyptian Journal of Language Engineering 8, no. 1 (2021): 27-38.
Wazir, Abdulaziz Saleh Mahfoudh Ba, and Joon Huang Chuah. "Spoken Arabic digits recognition using deep learning." In 2019 IEEE International Conference on Automatic Control and Intelligent Systems (I2CACIS), pp. 339-344. IEEE, 2019.
Zia, Tehseen, and Usman Zahid. "Long short-term memory recurrent neural network architectures for Urdu acoustic modeling." International Journal of Speech Technology 22 (2019): 21-30.
Lezhenin I, Bogach N, Pyshkin E (2019) Urban sound classification using long short-term memory neural network. In 2019 federated conference on computer science and information systems (FedCSIS), 57–60.
Zeng, Mengjun, and Nanfeng Xiao. "Effective combination of DenseNet and BiLSTM for keyword spotting." IEEE Access 7 (2019): 10767-10775.
De Andrade, Douglas Coimbra, Sabato Leo, Martin Loesener Da Silva Viana, and Christoph Bernkopf. "A neural attention model for speech command recognition." arXiv preprint arXiv:1808.08929 (2018).
Wei, Yungen, Zheng Gong, Shunzhi Yang, Kai Ye, and Yamin Wen. "EdgeCRNN: an edge-computing oriented model of acoustic feature enhancement for keyword spotting." Journal of Ambient Intelligence and Humanized Computing (2022): 1-11.
Cances, Léo, and Thomas Pellegrini. "Comparison of Deep Co-Training and Mean-Teacher approaches for semi-supervised audio tagging." In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 361-365. IEEE, 2021.
Higy, Bertrand, and Peter Bell. "Few-shot learning with attention-based sequence-to-sequence models." arXiv preprint arXiv:1811.03519 (2018).
Vygon, Roman, and Nikolay Mikhaylovskiy. "Learning efficient representations for keyword spotting with triplet loss." In Speech and Computer: 23rd International Conference, SPECOM 2021, St. Petersburg, Russia, September 27–30, 2021, Proceedings 23, pp. 773-785. Springer International Publishing, 2021.
Kim, Byeonggeun, Simyung Chang, Jinkyu Lee, and Dooyong Sung. "Broadcasted residual learning for efficient keyword spotting." arXiv preprint arXiv:2106.04140 (2021).
Berg, Axel, Mark O'Connor, and Miguel Tairum Cruz. "Keyword transformer: A self-attention model for keyword spotting." arXiv preprint arXiv:2104.00769 (2021).
Majumdar, Somshubra, and Boris Ginsburg. "Matchboxnet: 1d time-channel separable convolutional neural network architecture for speech commands recognition." arXiv preprint arXiv:2004.08531 (2020).
Ng, Dianwen, Yunqi Chen, Biao Tian, Qiang Fu, and Eng Siong Chng. "Convmixer: Feature interactive convolution with curriculum learning for small footprint and noisy far-field keyword spotting." In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3603-3607. IEEE, 2022.
Lin, James, Kevin Kilgour, Dominik Roblek, and Matthew Sharifi. "Training keyword spotters with limited and synthesized speech data." In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7474-7478. IEEE, 2020.
Seo, Deokjin, Heung-Seon Oh, and Yuchul Jung. "Wav2kws: Transfer learning from speech representations for keyword spotting." IEEE Access 9 (2021): 80682-80691.
Mehra, Sunakshi, Virender Ranga, and Ritu Agarwal. "A deep learning approach to dysarthric utterance classification with BiLSTM-GRU, speech cue filtering, and log mel spectrograms." The Journal of Supercomputing (2024): 1-28.
Mehra, Sunakshi, Virender Ranga, and Ritu Agarwal. "Improving speech command recognition through decision-level fusion of deep filtered speech cues." Signal, Image and Video Processing 18, no. 2 (2024): 1365-1373.
Mehra, Sunakshi, Virender Ranga, Ritu Agarwal, and Seba Susan. "Speaker independent recognition of low-resourced multilingual Arabic spoken words through hybrid fusion." Multimedia Tools and Applications 83, no. 35 (2024): 82533-82561.
Mehra, Sunakshi, and Seba Susan. "Early fusion of phone embeddings for recognition of low-resourced accented speech." In 2022 4th International Conference on Artificial Intelligence and Speech Technology (AIST), pp. 1-5. IEEE, 2022.
Mehra, Sunakshi, Virender Ranga, and Ritu Agarwal. "Multimodal Integration of Mel Spectrograms and Text Transcripts for Enhanced Automatic Speech Recognition: Leveraging Extractive Transformer‐Based Approaches and Late Fusion Strategies." Computational Intelligence 40, no. 6 (2024): e70012.

Index Terms

Computer Science

Information Sciences

Keywords

Speech Command Recognition EfficientNetV2 Speech Filtering Techniques Transformer Models Feedforward Neural Network (FNN) Multimodal Speech Processing Mel Spectrogram Phoneme Analysis