Call for Paper - January 2023 Edition
IJCA solicits original research papers for the January 2023 Edition. Last date of manuscript submission is December 20, 2022. Read More

Combination of K-Nearest Neighbor and K-Means based on Term Re-weighting for Classify Indonesian News

Print
PDF
International Journal of Computer Applications
© 2012 by IJCA Journal
Volume 50 - Number 11
Year of Publication: 2012
Authors:
Putu Wira Buana
Sesaltina Jannet D. R. M.
I Ketut Gede Darma Putra
10.5120/7817-1105

Putu Wira Buana, Sesaltina Jannet D.r.m. and Ketut Gede Darma I Putra. Article: Combination of K-Nearest Neighbor andK-Means based on Term Re-weighting for Classify Indonesian News. International Journal of Computer Applications 50(11):37-42, July 2012. Full text available. BibTeX

@article{key:article,
	author = {Putu Wira Buana and Sesaltina Jannet D.r.m. and I Ketut Gede Darma Putra},
	title = {Article: Combination of K-Nearest Neighbor andK-Means based on Term Re-weighting for Classify Indonesian News},
	journal = {International Journal of Computer Applications},
	year = {2012},
	volume = {50},
	number = {11},
	pages = {37-42},
	month = {July},
	note = {Full text available}
}

Abstract

KNN is one of the accepted classification tool, it used all training samples in the classification which cause to a high level of computation complexity. To resolve this problem, it is necessary to combine traditional KNN algorithm and K-Means cluster algorithm that is proposed in this paper. After completing the preprocessing step, the first thing to do is weighting the word (term) by usingTerm Frequency-Inverse Document Frequency (TF-IDF). TF-IDF weightedthe words calculating the number of words that appear in a document. Second, grouping all the training samples of each category of K-means algorithm, and take all the cluster centers as the new training sample. Third, the modified training samples are used for classification with KNN algorithm. Finally, calculate the accuracy of the evaluation using precision, recall and f-measure. The simulation results show that the combination of the proposed algorithm in this study has a percentage accuracy reached 87%, an average value of f-measure evaluation= 0. 8029 with the best k-values= 5 and the computation takes 55 second for one document.

References

  • Feldman, Ronen and Sanger, James. 2007. The Text Mining Handbook Advanced Approaches in Analyzing Unstructured Data. New York: Cambridge University Press.
  • Hearst, Marti. 2003. What is text mining?. SIMS, UC Berkeley. http://www. sims. berkeley. edu/~hearst/text-mining. html
  • Srivastava, Ashok N. and Sahami, Mehran. 2009. Text Mining Classification, Clustering, and Application. New York: CRC Press
  • Herwansyah,Adhit. 2009. AplikasiPengkategorianDokumendanPengukuran Tingkat SimilaritasDokumenMenggunakan Kata KuncipadaDokumenPenulisanIlmiahUniversitasGunadarma. http://www. gunadarma. ac. id/library/articles/graduate/computer-science/2009/Artikel_10105046. pdf
  • E. Fix and J. Hodges Discriminatory analysis. Nonparametric discrimination: Consistency properties. Technical Report 4, USAF School of Aviation Medicine Randolph Field, Texas, 1951.
  • Xindong Wu and Vipin Kumar. The Top Ten Algorithms in Data Mining. Chapman & Hall/CRC. New York: CRC Press
  • W. Yu, and W. Zhengguo, A Fast kNN algorithm for text categorization, Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, pp. 3436-3441, 2007.
  • Yang Y, Pedersen J O. A comparative study on feature selection in text categorization, ICNL,1997, pp. 412-420
  • Zhou Yong, LiYouwen and Xia Shixiong. 2009. An Improved KNN Text Classification Algorithm Based on Clustering. Journal of Computers, vol. 4,no. 3
  • N. Suguna and Dr. K. Thanushkodi. 2010. An Improved k-Nearest Neighbor Classification Using Genetic Algorithm. International Journal of Computer Science Issues, vol. 7,Issue 4,No. 2
  • Elisabeth, Hendrice. 2009. News Text Classification by Weight Adjusted K-Nearest Neighbor (WAKNN). InstitutTeknologi Telkom, Bandung,Indonesia.
  • Garcia, Dr. E. 2005. The Classic Vector Space Model (Description, Advanteges and Limitations of the Classic Vector Space Model).
  • Baldi, P, P. Frasconi, dan P. Smyth. 2003. ModellingThe Internet and The Web: Probabilistic Methods and Algorithms. New York: John and Willey & Sons.
  • Keno Buss. Literature Review on Preprocessing for Text Mining. STRL, De Montfort University.
  • Ramos, Juan. 2006. Using TF-IDF to Determine Word Relevance in Document Queries. Department of Computer Science, Rutgers University. http://www. cs. rutgers. edu/~mlittman/courses/m103/iCML03/papers/ramos. pdf
  • Atila Elci. 2011. Text Classification by PNN Term Re-Weighting. Turkey. International Journal of Computer Application Vol 29-No. 12, September 2011
  • Teknomo, Kardi. K-Nearest Neighbors Tutorial. http://people. revoledu. com/kardi/tutorial/KNN/index. html
  • Yang Lihua, Dai Qi, GuoYanjun, Study on KNN Text Categorization Algorithm, Micro Computer Information, No. 21, 2006, pp. 269-271
  • Xu, RuidanWunsch, D. C. 2009. Clustering. New York: John Wiley & Sons
  • Khaled W. Alnaji and Wesam M. Ashour. 2011. A Novel Clustering Algorithm using K-means (CUK). The Islamic University of Gaza. International Journal of Computer Applications Vol 25 No. 1 July 2011
  • Xinhao Wang, DingshengLuo, Xihong Wu, Huisheng Chi, Improving Chinese Text Categorization by Outlier Learning, Proceeding of NLP-KE'05, pp. 602-607
  • Lewis, D. 1995. Evaluating and Optimizing Autonomous Text Classification Systems. AT&T Bell Laboratories Murray Hill, NJ 07974. USA. Proceedings of the Eighteenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, July, 1995, pp. 246-254 http://net. pku. edu. cn/~wbia/2005/public_html/papers/classification/
  • Tala, Fadillah Z, 2003. A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia. Master of Logic Project. Institute for Logic, Language and Computation. Unversiteitvan Amsterdam. The Netherlands. www. illc. uva. nl/Publications/ResearchReports/MoL-200302. text. pdf
  • http://datamin. ubbcluj. ro/wiki/index. php/Evaluation_methods_in_text_categorization