CFP last date
22 April 2024
Reseach Article

Audio-Visual Speech Recognition for People with Speech Disorders

by Elham S. Salama, Reda A. El-khoribi, Mahmoud E. Shoman
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 96 - Number 2
Year of Publication: 2014
Authors: Elham S. Salama, Reda A. El-khoribi, Mahmoud E. Shoman

Elham S. Salama, Reda A. El-khoribi, Mahmoud E. Shoman . Audio-Visual Speech Recognition for People with Speech Disorders. International Journal of Computer Applications. 96, 2 ( June 2014), 51-56. DOI=10.5120/16770-6337

@article{ 10.5120/16770-6337,
author = { Elham S. Salama, Reda A. El-khoribi, Mahmoud E. Shoman },
title = { Audio-Visual Speech Recognition for People with Speech Disorders },
journal = { International Journal of Computer Applications },
issue_date = { June 2014 },
volume = { 96 },
number = { 2 },
month = { June },
year = { 2014 },
issn = { 0975-8887 },
pages = { 51-56 },
numpages = {9},
url = { },
doi = { 10.5120/16770-6337 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
%0 Journal Article
%1 2024-02-06T22:21:15.709521+05:30
%A Elham S. Salama
%A Reda A. El-khoribi
%A Mahmoud E. Shoman
%T Audio-Visual Speech Recognition for People with Speech Disorders
%J International Journal of Computer Applications
%@ 0975-8887
%V 96
%N 2
%P 51-56
%D 2014
%I Foundation of Computer Science (FCS), NY, USA

Speech recognition of disorder people is a difficult task due to the lack of motor-control of the speech articulators. Multimodal speech recognition can be used to enhance the robustness of disordered speech. This paper introduces an automatic speech recognition system for people with dysarthria speech disorder based on both speech and visual components. The Mel-Frequency Cepestral Coefficients (MFCC) is used as features representing the acoustic speech signal. For the visual counterpart, the Discrete Cosine Transform (DCT) Coefficients are extracted from the speaker's mouth region. Face and mouth regions are detected using the Viola-Jones algorithm. The acoustic and visual input features are then concatenated on one feature vector. Then, the Hidden Markov Model (HMM) classifier is applied on the combined feature vector of acoustic and visual components. The system is tested on isolated English words spoken by disorder speakers from UA-Speech data. Results of the proposed system indicate that visual features are highly effective and can improve the accuracy to reach 7. 91% for speaker dependent experiments and 3% for speaker independent experiments.

  1. A. N. Mishra, Mahesh Chandra, Astik Biswas, and S. N. Sharan. 2013. Hindi phoneme-viseme recognition from continuous speech. International Journal of Signal and Imaging Systems Engineering (IJSISE).
  2. Estellers, Virginia, Thiran, and Jean-Philippe. 2012. Multi-pose lipreading and audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing.
  3. H. Kim, M. Hasegawa-Johnson, A. Perlman, J. Gunderson, T. Huang, K. Watkin, and S. Frame. 2008. Dysarthric speech database for universal access research. In Proceedings of Interspeech. Brisbane. Australia.
  4. H. V. Sharma and M. Hasegawa-Johnson. 2010. State transition interpolation and map adaptation for hmm-based dysarthric speech recognition. NAACL HLT Workshop on Speech and Language Processing for Assistive Technologies (SLPAT).
  5. G. Jayaram and K. Abdelhamied. 1995. Experiments in dysarthric speech recognition using arti?cial neural networks. Journal of rehabilitation research and development.
  6. Heidi Christensen, Stuart Cunningham, Charles Fox, Phil Green, and Thomas Hain. 2012. A comparative study of adaptive, automatic recognition of disordered speech. INTERSPEECH. ISCA.
  7. F. Rudzicz. 2011. Production knowledge in the recognition of dysarthric speech. Ph. D. thesis. University of Toronto. Department of Computer Science.
  8. Potamianos G. , Neti C. , Luettin J. , and Matthews I. 2004. Audio-visual automatic speech recognition: an overview. Issues in Visual and Audio-Visual Speech Processing. MIT Press Cambridge. MA.
  9. H. McGurk and J. W. MacDonald. 1976. Hearing lips and seeing voices. Nature.
  10. A. Q. Summerfield. 1987. Some preliminaries to a comprehensive account of audio-visual speech perception. In Hearing by eye. The psychology of lip-reading.
  11. Massaro DW. , and Stork DG. 1998. Speech recognition and sensory integration. American Scientist.
  12. Chikoto Miyamoto, Yuto Komai, Tetsuya Takiguchi, Yasuo Ariki, and Ichao Li. 2010. Multimodal speech recognition of a person with articulation disorders using AAM and MAF. In proceeding of Multimedia Signal Processing (MMSP).
  13. Ahmed Farag, Mohamed El Adawy, and Ahmed Ismail. 2013. A robust speech disorders correction system for Arabic language using visual speech recognition. Biomedical Research.
  14. Davis, S. B. , and Mermelstein, P. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, (ASSP).
  15. Paul Viola and Michael J. Jones. 2001. Rapid Object Detection using a Boosted Cascade of Simple Features. IEEE CVPR.
  16. G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. 2003. Recent advances in the automatic recognition of audiovisual speech.
  17. Potamianos G. , Graf HP. , and Cosatto E. 1998. An image transform approach for HMM based automatic lipreading. In IEEE International Conference on Image Processing.
  18. Scanlon P, Ellis D, and Reilly R. 2003. Using mutual information to design class speci?c phone recognizers. In Proceedings of Eurospeech.
  19. P. Scanlon and G. Potamianos. 2005. Exploiting lower face symmetry in appearance-based automatic speechreading. Proc. Works. Audio-Visual Speech Process. (AVSP).
  20. Rabiner, L. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE.
  21. Tian, Y. , Zhou, J. L. , Lin, H. , and Jiang, H. 2006. Tree-Based Covariance Modeling of Hidden Markov Models. IEEE transactions on Audio, Speech and Language Processing.
  22. S. Young, D. Kershaw, J. Odell, V. Valtchev, and P. Woodland. 2006. The HTK Book Version 3. 4. Cambridge University Press.
  23. 2013. The OpenCV Reference Manual. Release 2. 4. 6. 0. [Online]. Available: http://docs. opencv. org/opencv2refman. pdf.
Index Terms

Computer Science
Information Sciences