CFP last date
20 May 2024
Reseach Article

Performance Improvement of Web Page Genre Classification

by K. Pranitha Kumari, A.venugopal Reddy
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 53 - Number 10
Year of Publication: 2012
Authors: K. Pranitha Kumari, A.venugopal Reddy
10.5120/8457-2265

K. Pranitha Kumari, A.venugopal Reddy . Performance Improvement of Web Page Genre Classification. International Journal of Computer Applications. 53, 10 ( September 2012), 24-27. DOI=10.5120/8457-2265

@article{ 10.5120/8457-2265,
author = { K. Pranitha Kumari, A.venugopal Reddy },
title = { Performance Improvement of Web Page Genre Classification },
journal = { International Journal of Computer Applications },
issue_date = { September 2012 },
volume = { 53 },
number = { 10 },
month = { September },
year = { 2012 },
issn = { 0975-8887 },
pages = { 24-27 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume53/number10/8457-2265/ },
doi = { 10.5120/8457-2265 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:53:46.335956+05:30
%A K. Pranitha Kumari
%A A.venugopal Reddy
%T Performance Improvement of Web Page Genre Classification
%J International Journal of Computer Applications
%@ 0975-8887
%V 53
%N 10
%P 24-27
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The dynamic nature of web and with the increase of the number of web pages, it is very difficult to search required web pages easily and quickly out of thousands of web pages retrieved by a search engine. The solution to this problem is to classify the web pages according to their genre. Automatic genre identification of web pages has become an important area in web page classification, because it can be used to improve the quality of web search results and also to reduce the search time. In this paper, a Combined Stemming Approach (CSA) is proposed to extract genre relevant words and to classify web pages by genre (non- topical) based on word level and linguistic features. Experiments were performed on 7-genre corpus. In order to improve the accuracy of the results, we applied combined stemming and stop word elimination techniques. The proposed approach of extracting features discriminates web pages by genre. The classification results obtained using Random Forest classifier was compared with the results of other researchers, who worked on the same corpus. It is shown that the method proposed is superior in performance in terms of accuracy.

References
  1. Xiaoguang QI, and Davison, B. D. Web page classification: Features and algorithms. ACM Computer. Survey. vol 41, 2, Article 12 February 2009.
  2. Mehler, A. , Sharoff, S. , and Santini, M. , Genres on the Web: Computational Models and Empirical Studies. Springer, Berlin/New York, 2009.
  3. A. Finn and N. Kushmerick. Learning to Classify Documents According to Genre. Journal of American Society for Information Science and Technology, 2006.
  4. Boese E. S. Stereotyping the web: genre classification of web documents, 2005, Citeseer.
  5. Santini M. "Some issues in Automatic Genre Classification of Web Pages", 2006, Proc. of the Journées Internationales d'Analyse Statistique des Données Textuelles (JADT), Besançon France.
  6. Lijuan J. and Liping, F Improvement of Feature Extraction in Web Page Classification, 2010 IEEE 2nd International Conference on e-Business and Information System Security (EBISS).
  7. M. Shepherd, C. Watters, and A. Kennedy. Cybergenre: Automatic Identification of Home Pages on the Web. Journal of Web Engineering, 3(3&4):236-251,2004.
  8. Lei D, Carolyn Watters, Jack Duffy, Michael Shepherd An Examination of Genre Attributes for Web Page Classification , 2008,IEEE Proceedings of the 41st Annual Hawaii International Conference on System Sciences.
  9. M. Santini. Automatic Identification of Genre in Web Pages, PhD thesis, University of Brighton, 2007.
  10. Santini M. "Genres In Formation? An Exploratory Study of Web Pages using Cluster Analysis". Proceedings of the 8th Annual Colloquium for the UK Special Interest Group for Computational Linguistics (CLUK 2005). Manchester, (UK). 2005.
  11. Santini M. Characterizing Genres of Web Pages: Genre Hybridism and Individualization Proceedings of the 40th Hawaii International Conference on System Sciences – 2007
  12. Santini M. Zero, Single, or Multi? Genre of Web Pages Through the Users' Perspective. Information Processing and Management, 2008, pp. 702-737.
  13. Santini M. , Georg Rehm, Serge Sharoff and Alexander Mehler. "Automatic Genre Identification:Issues and Prospects". Journal for Language Technology and Computational Linguistics, JLCL ISSN 0175-1336 Volume 24, 2009.
  14. Willett P. The Porter stemming algorithm: then and now. Program: electronic library and information systems, 40 (3). pp. 219-223, 2006.
  15. Lovins, J. B. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 22–31 1968.
  16. Lindemann, C. and Littig, L. , Classification of web sites at super-genre level, 2011, Springer journal Genres on the Web pages pages 211—235.
  17. Kennedy A. and Shepherd M. Automatic Identification of Home Pages on the Web IEEE,Proceedings of the 38th Annual Hawaii International Conference on System Sciences, 2005.
  18. Ezeiza Ramos J. , Epelde Pagola I. , Elordui Urkiza U. , Payá Ruiz X. TOWARDS A volumen 6 año 2011.
  19. http://nlp. stanford. edu/software/tagger. html.
  20. Santini M. and Sharoff S. "Web Genre Benchmark Under Construction". Journal for Language Technology and Computational Linguistics (JLCL) 2009, volume 25, number 1 -- Special Issue: Automatic Genre Identification: Issues, and Prospects".
Index Terms

Computer Science
Information Sciences

Keywords

Web page classification Genre Corpus Feature Extraction Combined Stemming Approach