CFP last date
20 May 2024
Reseach Article

Tigrinya Part-of-Speech Tagging with Morphological Patterns and the New Nagaoka Tigrinya Corpus

by Yemane Keleta Tedla, Kazuhide Yamamoto, Ashuboda Marasinghe
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 146 - Number 14
Year of Publication: 2016
Authors: Yemane Keleta Tedla, Kazuhide Yamamoto, Ashuboda Marasinghe
10.5120/ijca2016910943

Yemane Keleta Tedla, Kazuhide Yamamoto, Ashuboda Marasinghe . Tigrinya Part-of-Speech Tagging with Morphological Patterns and the New Nagaoka Tigrinya Corpus. International Journal of Computer Applications. 146, 14 ( Jul 2016), 33-41. DOI=10.5120/ijca2016910943

@article{ 10.5120/ijca2016910943,
author = { Yemane Keleta Tedla, Kazuhide Yamamoto, Ashuboda Marasinghe },
title = { Tigrinya Part-of-Speech Tagging with Morphological Patterns and the New Nagaoka Tigrinya Corpus },
journal = { International Journal of Computer Applications },
issue_date = { Jul 2016 },
volume = { 146 },
number = { 14 },
month = { Jul },
year = { 2016 },
issn = { 0975-8887 },
pages = { 33-41 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume146/number14/25468-2016910943/ },
doi = { 10.5120/ijca2016910943 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:50:28.855427+05:30
%A Yemane Keleta Tedla
%A Kazuhide Yamamoto
%A Ashuboda Marasinghe
%T Tigrinya Part-of-Speech Tagging with Morphological Patterns and the New Nagaoka Tigrinya Corpus
%J International Journal of Computer Applications
%@ 0975-8887
%V 146
%N 14
%P 33-41
%D 2016
%I Foundation of Computer Science (FCS), NY, USA
Abstract

This paper presents the first part-of-speech (POS) tagging research for Tigrinya (Semitic language) from the newly constructed Nagaoka Tigrinya Corpus. The raw text was extracted from a newspaper published in Eritrea in the Tigrinya language. This initial corpus was cleaned and formatted in plaintext and the Text Encoding Initiative (TEI) XML format. A tagset of 73 tags was designed, and the corpus for POS was manually annotated. This tagset encompasses three levels of grammatical information, which are the main POS categories, subcategories, and POS clitics. The POS tagged corpus contains 72,080 tokens. Tigrinya has a unique pattern of root-template morphology that can be utilized to infer POS categories. Subsequently, a supervised learning approach based on conditional random fields (CRFs) and support vector machines (SVMs) was applied, trained over contextual features of words and POS tags, morphological patterns, and affixes. A rigorous parameter optimization was performed and different combinations of features, data size, and tagsets were experimented upon to boost the overall accuracy, and particularly the prediction of POS for unknown words. For a reduced tagset of 20 tags, an overall accuracy of 90.89% was obtained on a stratified 10-fold cross validation. Enriching contextual features with morphological and affix features improved performance up to 41.01 percentage point, which is significant.

References
  1. Ali, B. B., and Jarray, F. 2013. Genetic approach for Arabic part of speech tagging. In International Journal on Natural Language Computing, IJNLC Vol. 2, No. 3. AIRCC.
  2. Bar-haim, R., Sima'an, K., and Winter, Y. 2008. Part-of-speech Tagging of Modern Hebrew Text. Natural Language Engineering, 14(2):223--251.
  3. Brants, T. 2000. TnT: a statistical part-of-speech tagger. Proceedings of the sixth conference on Applied Natural Language Processing, pages 224--231.
  4. Cortes, C., and Vapnik, V. 1995. Support-vector networks. Machine Learning, pages 273--297.
  5. David, G., and Walker, K. 2001. Arabic newswire part 1 -Linguistic Data Consertium. https://catalog.ldc.upenn.edu/LDC2001T55. Accessed: 2014-10-16.
  6. Demeke, G. A., and Getachew, M. 2006. Manual annotation of Amharic news items with part-of-speech tags and its challenges. Addis Ababa. ELRC Working Papers, 2:1–17.
  7. Diab, M., Hacioglu, K., and Jurafsky, D. 2004. Automatic tagging of Arabic text: from raw text to base phrase chunks. In Human Language Technologies; 5th Meeting of the North American Chapter of the Association of Computational Linguistics, pages 149--152. Association for Computational Linguistics.
  8. Gambäck, B., Olsson, F., Argaw A. A., and Asker, L. 2009. Methods for Amharic Part-of-Speech Tagging. Proceedings of the First Workshop on Language Technologies for African Languages, (March):104--111.
  9. Gasser, M. 2012. HornMorpho 2.5 user's guide. Indiana University, Indiana.
  10. Gebre, B. G. 2010. Part of speech tagging for Amharic. Master's thesis, University of Wolverhampton.
  11. Adi, G. 2000. Tigrinya Grammar. Admas Forlag, Stockholm, 2 edition.
  12. Lafferty, J. D., McCallum, A., and Pereira, F. C. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML '01), pages 282--289, San Francisco, CA, USA.
  13. Khoja, S. 2001. APT: Arabic Part-of-speech tagger. In Proceedings of the Student Workshop at NAACL-2001, pages 20--25.
  14. Sebhatu, G. K. 1997. The basic principles of Tigrinian Language. ForfattaresBokmaskin, Stockholm.
  15. Maamouri, M. 2003. Arabic Treebank v.1. Linguistic Data Consortium, https://catalog.ldc.upenn.edu/LDC2001T55. Accessed: 2014-10-16.
  16. Marsi, E., Van Den Bosch, A., and Soudi, A. 2005. Memory-based Morphological Analysis and Part-of-speech Tagging of Arabic. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, pages 1--8, Ann Arbor. Association for Computational Linguistics.
  17. Mason, J. 1996. Tigrinya grammar. The Red Sea Press, Inc., New Jersey, 1996.
  18. Gasser, M. 2009. Semitic morphological analysis and generation using finite state transducers with feature structures. In Proceedings of the 12th conference of the European Chapter of the ACL, page 309–317. ACL.
  19. Mohammed, E. and Kübler, S. 2010. Is Arabic part of speech tagging feasible without word segmentation? In Human Language Technologies; 5th Meeting of the North American Chapter of the Association of Computational Linguistics, pages 705--708. Association for Computational Linguistics.
  20. Habash, N., and Rambow, O. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one full swoop. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL'05, pages 573--580, Stroudsburg, PA,USA. Association for Computational Linguistics.
  21. Omer, O., and Mikami, Y. 2012. Stemming Tigrinya Words for Information Retrieval. In Proceedings of COLING 2012: Demonstration Papers, pages 345--352, Mumbai.
  22. Amanuel, S. 1998. A Comprehensive Tigrinya Grammar. The Red Sea Press, Inc., Lawrenceville NJ.
  23. Savova, V., and Peshkin, L. 2003. Part-of-speech tagging with minimal lexicalization. In Proceedings of CoRR, 2003.
  24. Streiter, O., Scannell, K., and Stuflesser, M. 2007. Implementing NLP projects for noncentral languages: instructions for funding bodies, strategies for developers. Springer Science+Business Media.
  25. Daniel, T. R. 2005. Modern grammar of Tigrinya language. Mega Publishing and Distribution PLC, Addis Ababa.
  26. Tseng, H., Jurafsky, D., and Manning, C. 2005. Morphological features help POS tagging of unknown words across language varieties. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, 2005. Asian Federation of Natural Language Processing.
  27. Tedla, Y. K., Yamamoto, K. and Marasinghe, A. 2016. Nagaoka Tigrinya Corpus: Design and Development of Part-of-speech Tagged Corpus. In Language Processing Society 22nd Annual Meeting Papers Collection, Tohoku, Japan,. The Association for Natural Language Processing.
Index Terms

Computer Science
Information Sciences

Keywords

Semitic languages Tigrinya corpus Tigrinya part-of-speech tagging morphological patterns