CFP last date
20 May 2024
Reseach Article

A Rule based Stemming Method for Multilingual Urdu Text

by Mubashir Ali, Shehzad Khalid, M. Haneef Saleemi, Waheed Iqbal, Armughan Ali, Ghayur Naqvi
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 134 - Number 8
Year of Publication: 2016
Authors: Mubashir Ali, Shehzad Khalid, M. Haneef Saleemi, Waheed Iqbal, Armughan Ali, Ghayur Naqvi
10.5120/ijca2016907784

Mubashir Ali, Shehzad Khalid, M. Haneef Saleemi, Waheed Iqbal, Armughan Ali, Ghayur Naqvi . A Rule based Stemming Method for Multilingual Urdu Text. International Journal of Computer Applications. 134, 8 ( January 2016), 10-18. DOI=10.5120/ijca2016907784

@article{ 10.5120/ijca2016907784,
author = { Mubashir Ali, Shehzad Khalid, M. Haneef Saleemi, Waheed Iqbal, Armughan Ali, Ghayur Naqvi },
title = { A Rule based Stemming Method for Multilingual Urdu Text },
journal = { International Journal of Computer Applications },
issue_date = { January 2016 },
volume = { 134 },
number = { 8 },
month = { January },
year = { 2016 },
issn = { 0975-8887 },
pages = { 10-18 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume134/number8/23933-2016907784/ },
doi = { 10.5120/ijca2016907784 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:33:36.784255+05:30
%A Mubashir Ali
%A Shehzad Khalid
%A M. Haneef Saleemi
%A Waheed Iqbal
%A Armughan Ali
%A Ghayur Naqvi
%T A Rule based Stemming Method for Multilingual Urdu Text
%J International Journal of Computer Applications
%@ 0975-8887
%V 134
%N 8
%P 10-18
%D 2016
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Urdu is a national language of Pakistan and spoken more than 200 million people use it as a verbal and written communication. There exists a large amount of unstructured Urdu textual data in the world; by applying data mining techniques useful information can be achieved. However it seriously lacks processing capabilities to develop innovative systems based on Urdu language. In this paper, authors present a rule based stemming method for Urdu language that has the ability to cope the challenges of Urdu infix stemming. The proposed stemming method generates the stem of Urdu words by removing prefix, infix and postfix from it. In this proposed Urdu stemming technique, authors have introduced two novel classes of Urdu infix words and a new minimum word length rule. To generate stem of Urdu word that belongs to proposed Urdu infix word classes, infix stripping rules are developed. The proposed Urdu stemming technique is competent to generate the stem of borrowed words and compound words, as well. The proposed approach is evaluated on Urdu headline news datasets. This proposed approach is compared with existing state-of-the art technique (A Light Weight Urdu Stemmer) to demonstrate the effectiveness of the proposed method. The proposed method provides 90% to 95 % accuracy and shows significant improvements comparing to the Urdu stemming technique.

References
  1. Bowman, Q. Akram, A. Naseer and S. Hussain. Assas-band, an affix- exception-list based Urdu stemmer. Proceedings of the 7th Workshop on Asia Language Resources. Singapore. pages 40–47. (2009).
  2. M. Al-Khuli. A dictionary of theoretical linguistics: English-Arabic with an Arabic- English glossary. Published by Library of Lebanon. (1991).
  3. K. Riaz. Challenges in Urdu Stemming (A Progress Report). BCS IRSG Symposium: Future Directions in Information Access (FDIA). (2007).
  4. S. Ahmad, W. Anwar, U.I. Bajwa. Challenges in Developing a Rule based Urdu Stemmer. Proceedings of the 2nd Workshop on South and Southeast Asian Natural Language Processing (WSSANLP). Chiang Mai, Thailand. pages 46–51. (2011).
  5. J. B. Lovins. Development of a stemming algorithm. Mechanical Translation and Computer Linguistic. vol.11, no.1/2, pp. 22-31, (1968).
  6. D.C. Paice. Another stemmer. ACM SIGIR Forum. Volume 24, No. 3: 56-61. (1990).
  7. M.F. Porter. An algorithm for suffix stripping. Program. 14: 130-137. (1980).
  8. M.F. Porter. Snowball: A language for stemming algorithms. (2001).
  9. N. Thabet, Stemming the Qur’an. In the Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages. pages 85-88. (2004).
  10. M. Tashakori, M. Meybodi & F. Oroumchian. Bon: first Persian stemmer. Lecture Notes on Information and Communication Technology. pages 487-494. (2002).
  11. S. Ahmad, W. Anwar, U.I. Bajwa, X. Wang. A Light Weight Stemmer for Urdu Language: A Scarce Resourced Language. Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP). Mumbai. pages 69–78. (2012).
  12. Mayfield James and McNamee Paul. “Single Ngram stemming”. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval. 415-416. (2003).
  13. Melucci Massimo and Orio Nicola. “A novel method for stemmer generation based on hidden Markov models”. Proceedings of the twelfth international conference on Information and knowledge management. 131-138. (2003).
  14. Prasenjit Majumder, Mandar Mitra, Swapan K. Parui, Gobinda Kole, Pabitra Mitra and Kalyankumar Datta. “YASS: Yet another suffix stripper”. ACM Transactions on Information Systems. Volume 25, Issue 4. Article No. 18. (2007).
  15. Hussain, Sara. Finite-State Morphological Analyzer for Urdu. Unpublished MS thesis, Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Pakistan. (2004).
  16. Sabzwari, S. Urdu Quwaid. Sang-e-Meel Publication. (2002).
  17. M. Ali, S. Khalid, M.H. Saleemi. A Novel Stemming Approach for Urdu language. Journal of Applied Environmental and Biological Sciences. 4(7S) 436-443, (2014).
Index Terms

Computer Science
Information Sciences

Keywords

Urdu stemming stemming rules infix stemming stemming lists Urdu infix classes