CFP last date
20 May 2024
Reseach Article

A Cluster based Approach with N-grams at Word Level for Document Classification

by Apeksha Khabia, M. B. Chandak
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 117 - Number 23
Year of Publication: 2015
Authors: Apeksha Khabia, M. B. Chandak
10.5120/20697-3599

Apeksha Khabia, M. B. Chandak . A Cluster based Approach with N-grams at Word Level for Document Classification. International Journal of Computer Applications. 117, 23 ( May 2015), 38-42. DOI=10.5120/20697-3599

@article{ 10.5120/20697-3599,
author = { Apeksha Khabia, M. B. Chandak },
title = { A Cluster based Approach with N-grams at Word Level for Document Classification },
journal = { International Journal of Computer Applications },
issue_date = { May 2015 },
volume = { 117 },
number = { 23 },
month = { May },
year = { 2015 },
issn = { 0975-8887 },
pages = { 38-42 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume117/number23/20697-3599/ },
doi = { 10.5120/20697-3599 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:00:13.255808+05:30
%A Apeksha Khabia
%A M. B. Chandak
%T A Cluster based Approach with N-grams at Word Level for Document Classification
%J International Journal of Computer Applications
%@ 0975-8887
%V 117
%N 23
%P 38-42
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

A breakneck progress of computers and web makes it easier to collect and store large amount of information in the form of text; e. g. , reviews, forum postings, blogs, web pages, news articles, email messages. In text mining, growing size of text datasets and high dimensionality associated with natural language is great challenge which makes it difficult to classify documents in various categories and sub-categories. This paper focuses on cluster based document classification technique so that data inside each cluster shares some common trait. The common approach for document clustering problem is bag of words model (BOW), where words are considered as features. But some semantic information is always lost as only words are considered. Thus we aim at using vector-space model based on N-grams at word level which helps to reduce the loss of semantic information. The problem of high dimensionality is solved with feature selection technique by applying threshold on feature values of vector space model. The vector space is mapped into a modified one with latent semantic analysis (LSA). Clustering of documents is done using k-means algorithm. Experiments are performed on Stack Exchange data set of some categories. R is used as text mining tool for implementation purpose. Experiment results show that tri-grams give better clustering results than words and bi-grams.

References
  1. Khabia A. , Chandak M. B. , "A Cluster Based Approach for Classification of Web Results", International Journal of Advaanced Computer Research, December 2014. Vol. 4, No. 4, Issue 17.
  2. Salton G. , Buckley C. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management. Vol. 24, No. 5, Pages 513–523.
  3. Agrawal C. C. , Zhai C. 2012. A Survey of Text Clustering Algorithms. In:Mining Text Data. Springer US. ISBN: 978-1-4614-3222-7 (Print) 978-1-4614-3223-4 (Online).
  4. Canvar W. B. 1994. Using an n-gram-based document representation with a vector processing retrieval model. In TREC. Pages 269–278.
  5. Tan C. , Wang, Y. , and Lee, C. , "The use of bigrams to enhance text categorization", Journal of Information Processing and Management, 2002.
  6. Wang S. I. , Manning, C. D. 2012. Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. In Proceedings of ACL.
  7. Lin D. , Wu, X. 2009. Phrase clustering for discriminative learning. In Proceedings of ACL.
  8. I. K. Fodor. 2002. A survey of dimension reduction techniques. Technical Report UCRL-ID-148494. Center for Applied Scientific Computing. Lawrence Livermore National Laboratory.
  9. Y. Yang, J. O. Pedersen. 1997. A comparative study on feature selection in text categorization. In D. H. Fisher, editor, Proceedings of ICML. 14th International Conference on Machine Learning. Pages 412–420. Nashville, US.
  10. Wild F. , Stahl C. 2006. Investigating Unstructured Texts with Latent Semantic Analysis. In Proceedings of the 30th Annual Conference of the Gesellschaft für Klassifikation e. V. Springer. Berlin Heidelberg.
  11. Owen S. , Anil R. , Dunning T. , Friedman E. 2012. Real-world applications of clustering. In: Mohout In Action. Manning Publications, Shelter Island.
  12. Yingbo M. , Vlado K. , Evangelos M. 2005. Document Clustering using Character Ngrams: A Comparative Evaluation with Termbased and Wordbased Clustering. In the proceedings of the 14th ACM international conference on Information and knowledge management (CIKM). Pages 357-358. ISBN:1-59593-140-6.
  13. Mahdi S. , Singer W. , Roger Z, Evangelos M, Bin T. , Jane T. , Ray S. 2007. Document Representation and Dimension Reduction for Text Clustering. 23rd International Conference on Data Engineering Workshop. IEEE. Pages 770 – 779.
  14. Zho Y. 2012. R and Data Mining: Examples and Case Studies. Elsevier. http://www. rdatamining. com/
  15. Feinerer I. , Hornik K. 2014. Text Mining Package. http://cran. r-project. org/web/packages/tm/vignettes/tm. pdf.
  16. Stewart B. M. 2010. Practical Skills for Document Clustering in R*. http://faculty. washington. edu/jwilker/tft/Stewart. LabHandout. pdf
  17. Landauer T. , Foltz, P. , and Laham, D. 1998. Introduction to Latent Semantic Analysis. In: Discourse Processes 25, Pages 259–284.
  18. http://creativecommons. org/licenses/by-sa/3. 0/legalcode
  19. Tan P. , Steinbach M. , Kumar V. 2006. Introduction to Data Mining. Errata.
Index Terms

Computer Science
Information Sciences

Keywords

Document clustering N-grams at word level dimensionality reduction Latent Semantic Analysis