CFP last date
20 May 2024
Reseach Article

Design and Evaluation of a Parallel Classifier for Large-Scale Arabic Text

by Mohammed M. Abu Tair, Rebhi S. Baraka
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 75 - Number 3
Year of Publication: 2013
Authors: Mohammed M. Abu Tair, Rebhi S. Baraka
10.5120/13090-0370

Mohammed M. Abu Tair, Rebhi S. Baraka . Design and Evaluation of a Parallel Classifier for Large-Scale Arabic Text. International Journal of Computer Applications. 75, 3 ( August 2013), 13-20. DOI=10.5120/13090-0370

@article{ 10.5120/13090-0370,
author = { Mohammed M. Abu Tair, Rebhi S. Baraka },
title = { Design and Evaluation of a Parallel Classifier for Large-Scale Arabic Text },
journal = { International Journal of Computer Applications },
issue_date = { August 2013 },
volume = { 75 },
number = { 3 },
month = { August },
year = { 2013 },
issn = { 0975-8887 },
pages = { 13-20 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume75/number3/13090-0370/ },
doi = { 10.5120/13090-0370 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:43:16.536804+05:30
%A Mohammed M. Abu Tair
%A Rebhi S. Baraka
%T Design and Evaluation of a Parallel Classifier for Large-Scale Arabic Text
%J International Journal of Computer Applications
%@ 0975-8887
%V 75
%N 3
%P 13-20
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Text classification has become one of the most important techniques in text mining. A number of machine learning algorithms have been introduced to deal with automatic text classification. One of the common classification algorithms is the k-NN algorithm which is known to be one of the best classifiers applied for different languages including Arabic language. However, the k-NN algorithm is of low efficiency because it requires a large amount of computational power. Such a drawback makes it unsuitable to handle a large volume of text documents with high dimensionality and in particular in the Arabic language. This paper introduces a high performance parallel classifier for large-scale Arabic text that achieves the enhanced level of speedup, scalability, and accuracy. The parallel classifier is based on the sequential k-NN algorithm. The classifier has been tested using the OSAC corpus. The performance of the parallel classifier has been studied on a multicomputer cluster. The results indicate that the parallel classifier has very good speedup and scalability and is capable of handling large documents collections with higher classification results.

References
  1. Feldman R. , and Sanger J. , The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, 2007.
  2. Hill T. , and Lewicki P. , STATISTICS Methods and Applications, 1st edition, StatSoft, Tulsa, OK, 2007.
  3. Sauban M. , and Pfahringer B. , "Text Categorization Using Document Profiling," The 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2003) – Conference Proceedings, Cavtat-Dubrovnik, Croatia, September 22-26, pp. 411-422, 2003.
  4. Sebastiani F. ,"Machine learning in automated text categorization," Journal of ACM Computing Surveys (CSUR), vol. 34 , no. 1, pp. 1-47, 2002.
  5. Yang Y. , Slattery S. , and Ghani R. , "A Study of approaches to hypertext Categorization," Journal of Intelligent Information Systems, vol. 18, no. 2-3, pp. 219-241, 2002.
  6. Al-Shalabi R. , Kannan G. , and Gharaibeh H. , "Arabic text categorization using K-NN algorithm," The 4th International Multiconference on Computer and Information Technology (CSIT 2006) – Conference Proceedings, Amman, Jordan, 2006.
  7. El-Halees A. , "A Comparative Study on Arabic Text Classification," Egyptian Computer Science Journal, vol. 30 , no. 2, 2008.
  8. Yang Y. , "An Evaluation of Statistical Approaches to Text Categorization," Journal of Information Retrieval, vol. 1 , no. 1-2, pp. 69-90, 1999.
  9. El-Kourdi M. , Bensaid A. , and Rachidi T. , "Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm," The 20th international conference on Computational Linguistics – Conference Proceedings, Geneva, August, 2004.
  10. Lewis D. , "Naïve (Bayes) at forty: The Independent Assumption in Information Retrieval," The 10th European Conference on Machine Learning (ECML 1998) – Conference Proceedings, Berlin, pp. 4–15, 1998.
  11. Feldman R. , and Sanger J. , The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, 2007.
  12. Joachims T. , "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," The 10th European Conference on Machine Learning (ECML 1998) – Conference Proceedings, London, UK, pp. 137-142, 1998.
  13. Apte C. , Damerau F. , and Weiss S. , "Text mining with decision rules and decision trees," The Conference on Automated Learning and Discovery (CONALD 1998) – Conference Proceedings, Pittsburgh, USA, June, 1998.
  14. Saad M. , and Ashour W. , "Arabic Text Classification Using Decision Trees," The 12th international workshop on computer science and information technologies (CSIT 2010) – Conference Proceedings, Moscow, Saint-Petersburg, Russia, vol. 2, pp. 75-79, 2010.
  15. Lianga S. , Liua Y. , Wang C. , and Jiana L. , "CUKNN: A parallel Implementation of k-Nearest Neighbor on Cuda-Enabled GPU," The 2009 IEEE Youth Conference on Information, Computing and Telecommunication (ICT2009) – Conference Proceedings, pp. 415-418, 2009.
  16. Manning D. , Raghavan P. , and Schütze H. , An introduction to information retrieval, Cambridge, England: Cambridge University Press, 2006.
  17. Grama A. , Gupta A. , Karypis G. , and Kumar V. , Introduction to Parallel Computing, 2nd edition, Addison Wesley, 2003.
  18. Duwairi R. , Al-Refai M. , Khasawneh N. , "Feature reduction techniques for Arabic text categorization," Journal of the American Society for Information Science, vol. 60, no. 11, pp. 2347-2352, 2009.
  19. Guan J. , and Zhou S. , "Pruning training corpus to speed up text classification," The 13th International Conference on Database and Expert Systems Applications (DEXA 2002) – Conference Proceedings, Aix-en-Provence, France, September, vol. 2453, pp. 831-840, 2002.
  20. Buana P. , Jannet S. , and Putra l. , "Combination of K-Nearest Neighbor and K-Means based on Term Re-weighting for Classify Indonesian News," International Journal of Computer Applications, vol. 50, no. 11, pp. 37-42, 2012.
  21. Ruoming J. , Yang G. , and Agrawal G. , "Shared memory parallelization of data mining algorithms: Techniques, programming interface and performance," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no . 1, pp. 71-89, 2005.
  22. Tekiner F. , Tsuruoka Y. , Tsujii J. , and Ananiadou S. , "Highly Scalable Text Mining – Parallel Tagging Application," The 5th International Conference on Soft Computing, Computing with Words and Perceptions in System Analysis, Decision and Control (ICSCCW 2009) – Conference Proceedings, September, pp. 1-4, 2009.
  23. Han J. , and Kamber M. , Data Mining: Concepts and Techniques, 2nd edition. The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor, 2006.
  24. Nishida K. , "Learning and Detecting Concept Drift," Ph. D. Dissertation, Department of Information Science and Technology, Hokkaido University, 2008.
  25. Khoja S. , and Garside R. , "Stemming Arabic text," Computer Science Department, Lancaster University, Lancaster, UK, 1999.
  26. Larkey L. , Ballesteros L. , and Connell M. , "Light Stemming for Arabic Information Retrieval," Arabic Computational Morphology, book chapter, Springer, 2007.
  27. Jing L. , Huang H. , and Shi H. , "Improved feature selection approach TFIDF in text mining," The 1st International Conference of machine learning and cybernetics – Conference Proceedings, Beijing, 2002.
  28. Said D. , Wanas N. , Darwish N. , and Hegazy N. , "A Study of Arabic Text preprocessing methods for Text Categorization," The 2nd International Conference of on Arabic Language Resources and Tools – Conference Proceedings, Cairo, Egypt, 2009.
  29. Salton G. , and Buckley C. , "A Study of Arabic Text preprocessing methods for Text Categorization," The Conference of information processing & management – Conference Proceedings, vol. 24, no. 5, pp. 513-523, 1998.
  30. Saad M. , and Ashour W. , "OSAC: Open Source Arabic Corpus," The 6th International Conference on Electrical and Electronics Engineering and Computer Science (EEECS 2010) – Conference Proceedings, European University of Lefke, Cyprus, November 25-26, pp. 1-6, 2010.
  31. Saad M. , "Open Source Arabic Language and Text Mining Tools," (2010, August), [Online], Available: http://sourceforge. net/projects/ar-text-mining [10 August 2012], 2010.
Index Terms

Computer Science
Information Sciences

Keywords

Arabic text classification k-NN algorithm parallel classifier multicomputer cluster