A Large Multilingual Corpus of Pashto, Urdu, English for Automatic Spoken Language Identification

Aamer Zahoor; Nasir Ahmad

Call for Paper

August Edition

IJCA solicits high quality original research papers for the upcoming August edition of the journal. The last date of research paper submission is 20 July 2026

Submit your paper

Know more

The week's pick

Quantifying Label-Induced Bias in Large Language Model Self and Cross Evaluations

Muskan Saraf Sajjad Rezvani Boroujeni Justin Beaudry Hossein Abedi Tom Bush

Random Articles

On Chain Folding Problems of Chain Mapper and Chain Reducer Meta Expressions

April

2015

A Supervised Approach to Zero-Shot Learning for Field Classification of Texts: Leveraging File Data for Improved Text Categorization

Sep

2024

Optimized kNN Query Processing using Clustering in Untrusted Cloud Environment

April

2015

Development of an Instrument for Enterprise Resource Planning (ERP) Implementation in Indian Small and Medium Enterprises (SMEs)

July

2012

Reseach Article

A Large Multilingual Corpus of Pashto, Urdu, English for Automatic Spoken Language Identification

by Aamer Zahoor, Nasir Ahmad

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 177 - Number 10

Year of Publication: 2019

Authors: Aamer Zahoor, Nasir Ahmad

10.5120/ijca2019919522

Aamer Zahoor, Nasir Ahmad . A Large Multilingual Corpus of Pashto, Urdu, English for Automatic Spoken Language Identification. International Journal of Computer Applications. 177, 10 ( Oct 2019), 42-45. DOI=10.5120/ijca2019919522

@article{ 10.5120/ijca2019919522,

author = { Aamer Zahoor, Nasir Ahmad },

title = { A Large Multilingual Corpus of Pashto, Urdu, English for Automatic Spoken Language Identification },

journal = { International Journal of Computer Applications },

issue_date = { Oct 2019 },

volume = { 177 },

number = { 10 },

month = { Oct },

year = { 2019 },

issn = { 0975-8887 },

pages = { 42-45 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume177/number10/30937-2019919522/ },

doi = { 10.5120/ijca2019919522 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-07T00:45:32.248700+05:30

%A Aamer Zahoor

%A Nasir Ahmad

%T A Large Multilingual Corpus of Pashto, Urdu, English for Automatic Spoken Language Identification

%J International Journal of Computer Applications

%@ 0975-8887

%V 177

%N 10

%P 42-45

%D 2019

%I Foundation of Computer Science (FCS), NY, USA

Abstract

The availability of a standard and phonetically rich speech corpus provides a common platform for comparing the performance of different speech recognition approaches and therefore is the first step for the research in a language. This work presents the development of a large multilingual speech corpus of Pashto, Urdu and English. Recordings have been made from a total of 194 speakers in the three languages, covering diverse dialects, age groups, genders and professions. Pashto and Urdu both native and non-native speakers have been considered while for English, all the speakers were non-native. The corpus comprises of three categories of phonetically rich spoken data in each language, that is, short questions regarding speaker’s personal information, read speech and spontaneous speech from the domain of tourism. Although the corpus is developed primarily for research on Automatic Spoken Language Identification purpose, nevertheless, it can also be used for research on other topics such as Automatic Speech Recognition, Accent Recognition, Automatic Speaker Identification and the study of effects of non-nativeness on Language and Speaker Identification.

References

Vyas, G. and Dutta, M.K., 2014, August. An integrated spoken language recognition system using support vector machines. In 2014 Seventh International Conference on Contemporary Computing (IC3) (pp. 105-108). IEEE.
Shrishrimal, P.P., Deshmukh, R.R. and Waghmare, V.B., 2012. Indian language speech database: A review. International journal of Computer applications, 47(5), pp.17-21.
Shriberg, E., 2005. Spontaneous speech: How people really talk and why engineers should care. In Ninth European Conference on Speech Communication and Technology.
Larnel, L.F., Gauvain, J.L. and Eskenazi, M., 1991. BREF, a large vocabulary spoken corpus for French. In Second european conference on speech communication and technology.
“LDC - Linguistic Data Consortium” [Online]. Available at: https://catalog.ldc.upenn.edu/LDC93S1
Maekawa, K., 2003. Corpus of Spontaneous Japanese: Its design and evaluation. In ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition.
Ali, H., Ahmad, N., Yahya, K.M. and Farooq, O., 2012, April. A medium vocabulary Urdu isolated words balanced corpus for automatic speech recognition. In 2012 international conference on electronics computer technology (ICECT 2012) (pp. 473-476).
Sarfraz, H., Hussain, S., Bokhari, R., Raza, A.A., Ullah, I., Sarfraz, Z., Pervez, S., Mustafa, A., Javed, I. and Parveen, R., 2010. Speech corpus development for a speaker independent spontaneous Urdu speech recognition system. Proceedings of the O-COCOSDA, Kathmandu, Nepal.
Raza, A.A., Hussain, S., Sarfraz, H., Ullah, I. and Sarfraz, Z., 2009, August. Design and development of phonetically rich Urdu speech corpus. In 2009 oriental COCOSDA international conference on speech database and assessments (pp. 38-43). IEEE.
Abbas, A.W., Ahmad, N. and Ali, H., 2012, September. Pashto Spoken Digits database for the automatic speech recognition research. In 18th International Conference on Automation and Computing (ICAC) (pp. 1-5). IEEE.
Ahmed, I., Ahmad, N., Ali, H. and Ahmad, G., 2012, September. The development of isolated words pashto automatic speech recognition system. In 18th ICAC (pp. 1-4). IEEE.
Abbas, A.W., Ali, Z. and Uddin, B., 2014, December. Analyzing the Impact of MFCC and LDA for the Development of Isolated Pashto Spoken Numbers ASR. In 2014 12th International Conference on Frontiers of Information Technology (pp. 350-354). IEEE.
Ashraf, J., Iqbal, N., Khattak, N.S. and Zaidi, A.M., 2010, March. Speaker independent Urdu speech recognition using HMM. In 2010 The 7th INFOS (pp. 1-5). IEEE.
Lamel, L., Adda, G., Adda-Decker, M., Corredor-Ardoy, C., Gangolf, J.J. and Gauvain, J.L., 1998, May. A multilingual corpus for language identification. In 1st International Conference on Language Resources and Evaluation (Vol. 1, pp. 1115-1122).

Index Terms

Computer Science

Information Sciences

Keywords

Corpus Development Pashto Language Corpus Urdu Language Corpus Pashto Automatic Spoken Language Identification Automatic Speaker Identification Automatic Speech Recognition.