CFP last date
22 April 2024
Reseach Article

A Large Multilingual Corpus of Pashto, Urdu, English for Automatic Spoken Language Identification

by Aamer Zahoor, Nasir Ahmad
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 177 - Number 10
Year of Publication: 2019
Authors: Aamer Zahoor, Nasir Ahmad

Aamer Zahoor, Nasir Ahmad . A Large Multilingual Corpus of Pashto, Urdu, English for Automatic Spoken Language Identification. International Journal of Computer Applications. 177, 10 ( Oct 2019), 42-45. DOI=10.5120/ijca2019919522

@article{ 10.5120/ijca2019919522,
author = { Aamer Zahoor, Nasir Ahmad },
title = { A Large Multilingual Corpus of Pashto, Urdu, English for Automatic Spoken Language Identification },
journal = { International Journal of Computer Applications },
issue_date = { Oct 2019 },
volume = { 177 },
number = { 10 },
month = { Oct },
year = { 2019 },
issn = { 0975-8887 },
pages = { 42-45 },
numpages = {9},
url = { },
doi = { 10.5120/ijca2019919522 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
%0 Journal Article
%1 2024-02-07T00:45:32.248700+05:30
%A Aamer Zahoor
%A Nasir Ahmad
%T A Large Multilingual Corpus of Pashto, Urdu, English for Automatic Spoken Language Identification
%J International Journal of Computer Applications
%@ 0975-8887
%V 177
%N 10
%P 42-45
%D 2019
%I Foundation of Computer Science (FCS), NY, USA

The availability of a standard and phonetically rich speech corpus provides a common platform for comparing the performance of different speech recognition approaches and therefore is the first step for the research in a language. This work presents the development of a large multilingual speech corpus of Pashto, Urdu and English. Recordings have been made from a total of 194 speakers in the three languages, covering diverse dialects, age groups, genders and professions. Pashto and Urdu both native and non-native speakers have been considered while for English, all the speakers were non-native. The corpus comprises of three categories of phonetically rich spoken data in each language, that is, short questions regarding speaker’s personal information, read speech and spontaneous speech from the domain of tourism. Although the corpus is developed primarily for research on Automatic Spoken Language Identification purpose, nevertheless, it can also be used for research on other topics such as Automatic Speech Recognition, Accent Recognition, Automatic Speaker Identification and the study of effects of non-nativeness on Language and Speaker Identification.

  1. Vyas, G. and Dutta, M.K., 2014, August. An integrated spoken language recognition system using support vector machines. In 2014 Seventh International Conference on Contemporary Computing (IC3) (pp. 105-108). IEEE.
  2. Shrishrimal, P.P., Deshmukh, R.R. and Waghmare, V.B., 2012. Indian language speech database: A review. International journal of Computer applications, 47(5), pp.17-21.
  3. Shriberg, E., 2005. Spontaneous speech: How people really talk and why engineers should care. In Ninth European Conference on Speech Communication and Technology.
  4. Larnel, L.F., Gauvain, J.L. and Eskenazi, M., 1991. BREF, a large vocabulary spoken corpus for French. In Second european conference on speech communication and technology.
  5. “LDC - Linguistic Data Consortium” [Online]. Available at:
  6. Maekawa, K., 2003. Corpus of Spontaneous Japanese: Its design and evaluation. In ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition.
  7. Ali, H., Ahmad, N., Yahya, K.M. and Farooq, O., 2012, April. A medium vocabulary Urdu isolated words balanced corpus for automatic speech recognition. In 2012 international conference on electronics computer technology (ICECT 2012) (pp. 473-476).
  8. Sarfraz, H., Hussain, S., Bokhari, R., Raza, A.A., Ullah, I., Sarfraz, Z., Pervez, S., Mustafa, A., Javed, I. and Parveen, R., 2010. Speech corpus development for a speaker independent spontaneous Urdu speech recognition system. Proceedings of the O-COCOSDA, Kathmandu, Nepal.
  9. Raza, A.A., Hussain, S., Sarfraz, H., Ullah, I. and Sarfraz, Z., 2009, August. Design and development of phonetically rich Urdu speech corpus. In 2009 oriental COCOSDA international conference on speech database and assessments (pp. 38-43). IEEE.
  10. Abbas, A.W., Ahmad, N. and Ali, H., 2012, September. Pashto Spoken Digits database for the automatic speech recognition research. In 18th International Conference on Automation and Computing (ICAC) (pp. 1-5). IEEE.
  11. Ahmed, I., Ahmad, N., Ali, H. and Ahmad, G., 2012, September. The development of isolated words pashto automatic speech recognition system. In 18th ICAC (pp. 1-4). IEEE.
  12. Abbas, A.W., Ali, Z. and Uddin, B., 2014, December. Analyzing the Impact of MFCC and LDA for the Development of Isolated Pashto Spoken Numbers ASR. In 2014 12th International Conference on Frontiers of Information Technology (pp. 350-354). IEEE.
  13. Ashraf, J., Iqbal, N., Khattak, N.S. and Zaidi, A.M., 2010, March. Speaker independent Urdu speech recognition using HMM. In 2010 The 7th INFOS (pp. 1-5). IEEE.
  14. Lamel, L., Adda, G., Adda-Decker, M., Corredor-Ardoy, C., Gangolf, J.J. and Gauvain, J.L., 1998, May. A multilingual corpus for language identification. In 1st International Conference on Language Resources and Evaluation (Vol. 1, pp. 1115-1122).
Index Terms

Computer Science
Information Sciences


Corpus Development Pashto Language Corpus Urdu Language Corpus Pashto Automatic Spoken Language Identification Automatic Speaker Identification Automatic Speech Recognition.