CFP last date
22 April 2024
Reseach Article

Automatic Generation of Stopwords in the Amharic Text

by Sileshi Girmaw Miretie, Vijayshri Khedkar
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 180 - Number 10
Year of Publication: 2018
Authors: Sileshi Girmaw Miretie, Vijayshri Khedkar
10.5120/ijca2018916161

Sileshi Girmaw Miretie, Vijayshri Khedkar . Automatic Generation of Stopwords in the Amharic Text. International Journal of Computer Applications. 180, 10 ( Jan 2018), 19-22. DOI=10.5120/ijca2018916161

@article{ 10.5120/ijca2018916161,
author = { Sileshi Girmaw Miretie, Vijayshri Khedkar },
title = { Automatic Generation of Stopwords in the Amharic Text },
journal = { International Journal of Computer Applications },
issue_date = { Jan 2018 },
volume = { 180 },
number = { 10 },
month = { Jan },
year = { 2018 },
issn = { 0975-8887 },
pages = { 19-22 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume180/number10/28898-2018916161/ },
doi = { 10.5120/ijca2018916161 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:00:17.564972+05:30
%A Sileshi Girmaw Miretie
%A Vijayshri Khedkar
%T Automatic Generation of Stopwords in the Amharic Text
%J International Journal of Computer Applications
%@ 0975-8887
%V 180
%N 10
%P 19-22
%D 2018
%I Foundation of Computer Science (FCS), NY, USA
Abstract

For the retrieval of information from documents of different natural languages, pre-processing of the document is the main task. During pre-processing, words which occur too frequently and have little semantic in the document should be identified. Such words are called Stopwords. Stopwords list for different world languages like English, Chinese, Hindi, Arabic Sanskrit etc. are identified. But as I long as I know there is no standard method to identify these words for the Amharic language. In this paper, we proposed the automatic identification of Stopwords for the Amharic text by an aggregate based methodology of words frequency, inverse document frequency, and entropy value measure. Available works on Stopwords identification techniques are based on static or dictionary based Stopwords lists. This method inefficient and very expensive and it is a time-consuming task as the searching process takes a long time. The proposed work will overcome these problems using aggregated methods of both frequency measures and entropy measures of words in the Amharic text for the automatic Stopwords identification.

References
  1. Raulji, J. K., & Saini, J. R. (2017, January). Generating Stopword List for Sanskrit Language. In Advance Computing Conference (IACC), 2017 IEEE 7th International (pp. 799-802). IEEE.
  2. Raulji, J. K., & Saini, J. R. Stop-Word Removal Algorithm and its Implementation for Sanskrit Language.
  3. Asubiaro, T. V. (2013). Entropy-Based Generic Stopwords List for Yoruba Texts. Entropy, 2(05).
  4. Medhat, W., Yousef, A. H., & Korashy, H. (2015). Egyptian Dialect Stopword List Generation from Social Network Data. arXiv preprint arXiv:1508.02060.
  5. Mohammed-Ali, Y-Z-F., Behrouz, M-B., Saeed, R.,& Saeed, S.(2015, November) PSWG: An automatic Stopword list generator for Persian information Retrieval systems based on similarity function &POS information.2015 international conference on Knowledge Based-engineering and Innovation(KBEI).IEEE.
  6. Saif, H., Fernandez, M., & Alani, H. (2014, October). Automatic stopword generation using contextual
  7. Vijayarani, S., Ilamathi, M. J., & Nithya, M. (2015). Preprocessing techniques for text mining-an overview. International Journal of Computer Science & Communication Networks, 5(1), 7-16.
  8. Rakholia, R. M., & Saini, J. R. (2017). A Rule-Based Approach to Identify Stop Words for Gujarati Language. Suresh Chandra Satapathy Vikrant Bhateja Siba K. Udgata, 797.
  9. Rakholia, R. M., & Saini, J. R. (2017). Information Retrieval for Gujarati Language Using Cosine Similarity Based Vector Space Model. In Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications (pp. 1-9). Springer, Singapore.
  10. Semantics for sentiment analysis of Twitter. In Proceedings of the 2014 International Conference on Posters & Demonstrations Track-Volume 1272(pp. 281-284). CEUR-WS. Org.
  11. Puri, R., Bedi, R. P. S., & Goyal, V. (2013). Automated Stopwords Identification in Punjabi Documents. vol, 8, 119-125
  12. Jha, V., Manjunath, N., Shenoy, P. D., & Venugopal, K. R. (2016, January). Hsra: Hindi stopword removal algorithm. In Microelectronics, Computing and Communications (MicroCom), 2016 International Conference on (pp. 1-5). IEEE.
  13. Saini, J. R., & Rakholia, R. M. (2016). On Continent and Script-Wise Divisions-Based Statistical Measures for Stop-words Lists of International Languages. Procedia Computer Science, 89, 313-319.
  14. Sharan, A., & Siddiqi, S. (2014, September). A supervised approach to distinguish between keywords and stopwords using probability distribution functions. In Advances in Computing, Communications and Informatics (ICACCI, 2014 International Conference on (pp. 1074-1080). IEEE.
  15. Puri, R., Bedi, R. P. S., & Goyal, V. (2013). Automated Stopwords Identification in Punjabi Documents. vol, 8, 119-125.
  16. Na, D., & Xu, C. (2015). Automatically generation and evaluation of Stop words list for Chinese Patents. TELKOMNIKA (Telecommunication Computing Electronics and Control), 13(4), 1414-1421.
  17. Hidayatullah, A. F., & Ma’arif, M. R. (2017, January). Pre-processing Tasks in Indonesian Twitter Messages. In Journal of Physics: Conference Series (Vol. 801, No. 1, p. 012072). IOP Publishing.
  18. Ferilli, S., Esposito, F., & Grieco, D. (2014). Automatic learning of linguistic resources for stopword removal and stemming from the text. Procedia Computer Science, 38, 116-123
  19. Wilbur, W. J., & Sirotkin, K. (1992). The automatic identification of stop words. Journal of information science, 18(1), 45-55.
  20. Zou, F., Wang, F. L., Deng, X., & Han, S. (2006). Automatic identification of Chinese stop words. Research on Computing Science, 18, 151-162.
  21. Munková, D., Munk, M., & Vozár, M. (2014). Influence of stop-words removal on sequence patterns identification within comparable corpora. In ICT Innovations 2013 (pp. 67-76). Springer, Heidelberg.
  22. Saif, H., Fernández, M., He, Y., & Alani, H. (2014). On stopwords, filtering and data sparsity for sentiment analysis of Twitter.
  23. Alajmi, A., Saad, E. M., & Darwish, R. R. (2012). Toward an ARABIC stop-words list generation. International Journal of Computer Applications, 46(8), 8-13.
  24. Kumar, M., & Vig, R. (2013). Focused crawling based upon Tf-IDF semantics and hub score learning. Journal of Emerging technologies in web intelligence, 5(1), 70-77.
  25. Ospanova, R. (2013). Calculating Information Entropy of Language Texts. World Applied Sciences Journal, 22(1), 41-45.
  26. Shannon, C. E. (1948). A mathematical theory of communication, Part I, Part II. Bell Syst. Tech. J., 27, 623-656.
  27. Harman, D. W. (1986, September). An experimental study of factors important in document ranking. In Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 186-193). ACM.
  28. R. Tsz-Wai, B. He, and I. ―Automatically Building a Stopword List for an Information Retrieval System. ‖ 5th Dutch-Belgium Information Retrieval Workshop (DIR)’05Utrecht, the Netherlands 2005.
  29. Abu El-Khair, I. (2017). Effects of Stop Words Elimination for Arabic Information Retrieval: A Comparative Study. arXiv preprint arXiv:1702.01925
  30. Blanchard, A. (2007). Understanding and customizing stopword lists for enhanced patent mapping. World Patent Information, 29(4), 308-316.
Index Terms

Computer Science
Information Sciences

Keywords

Natural language processing information retrieval document pre-processing Stopwords Amharic