CFP last date
20 May 2024
Reseach Article

A Neural Network Language Document Representation Technique for Web-Page Classification

by Osanyin Quadri A., Ajose-Ismail B. M.
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 176 - Number 14
Year of Publication: 2020
Authors: Osanyin Quadri A., Ajose-Ismail B. M.
10.5120/ijca2020920071

Osanyin Quadri A., Ajose-Ismail B. M. . A Neural Network Language Document Representation Technique for Web-Page Classification. International Journal of Computer Applications. 176, 14 ( Apr 2020), 38-43. DOI=10.5120/ijca2020920071

@article{ 10.5120/ijca2020920071,
author = { Osanyin Quadri A., Ajose-Ismail B. M. },
title = { A Neural Network Language Document Representation Technique for Web-Page Classification },
journal = { International Journal of Computer Applications },
issue_date = { Apr 2020 },
volume = { 176 },
number = { 14 },
month = { Apr },
year = { 2020 },
issn = { 0975-8887 },
pages = { 38-43 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume176/number14/31273-2020920071/ },
doi = { 10.5120/ijca2020920071 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T00:42:34.342147+05:30
%A Osanyin Quadri A.
%A Ajose-Ismail B. M.
%T A Neural Network Language Document Representation Technique for Web-Page Classification
%J International Journal of Computer Applications
%@ 0975-8887
%V 176
%N 14
%P 38-43
%D 2020
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The task of assigning a web page to the correct category is getting cumbersome because of the influx of digital documents on the World Wide Web. The performance of applications such as web directories, question and answering system, web content filtering systems depends on the key performance of automatic web page classification systems. From extant literature, the performance of web page classification system depends on adequate textual representation of the web content. Several statistical document representation techniques such as bag of words models, n-grams models and topic models have been proposed by authors to capture the real semantics of web documents but are fraught with several challenges such as semantic mismatch, multiple meanings of words. Thus, this paper proposes a recent neural network language model (Doc2Vec) which utilizes document embedding’s to solve the document representation problem of web page classification system. Results obtained confirms the earlier assumption that Doc2Vec performs robustly on very high dimensional text such as web documents, it also capture the real semantics of the web document.

References
  1. Azam, N., & Yao, J. (2012). Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Systems with Applications, 39(5), 4760-4768.
  2. Tragha, A. (2019). Machine Learning for Web Page Classification: A Survey. International Journal of Information Science and Technology, 3(5), 38-50.
  3. Virik, M., Simko, M., & Bielikova, M. (2017). Blog style classification: refining affective blogs. Computing and Informatics, 35(5), 1027-1049.
  4. Karima, A., Zakaria, E., Yamina, T. G., Mohammed, A. A. S., Selvam, R. P., & Venkatakrishnan, V. (2012). Arabic text categorization: a comparative study of different representation modes. Journal of Theoretical and Applied Information Technology, 38(1), 1-5.
  5. Lopez-Sanchez, D., Arrieta, A. G., & Corchado, J. M. (2019). Visual content-based web page categorization with deep transfer learning and metric learning. Neurocomputing, 338, 418-431.
  6. Khan, A., Baharudin, B., Lee, L. H., & Khan, K. (2010). A review of machine learning algorithms for text-documents classification. Journal of advances in information technology, 1(1), 4-20.
  7. Fatima, S., & Srinivasu, B. (2017). Text Document categorization using support vector machine.
  8. Ma, S., Zhang, C., & He, D. (2016). Document representation methods for clustering bilingual documents. Proceedings of the Association for Information Science and Technology, 53(1), 1-10.
  9. Dey Sarkar, S., Goswami, S., Agarwal, A., & Aktar, J. (2014). A Novel Feature Selection Technique for Text Classification Using Naïve Bayes. International Scholarly Research Notices, 2014.
  10. Alamelu Mangai, J., Santhosh Kumar, V., & Sugumaran, V. (2010). Recent Research in Web Page Classification–A Review. International Journal of Computer Engineering & Technology (IJCET), 1(1), 112-122.
  11. Chen, R. C., & Hsieh, C. H. (2006). Web page classification based on a support vector machine using a weighted vote schema. Expert Systems with Applications, 31(2), 427-435
  12. Zhang, W., Yoshida, T., & Tang, X. (2011). A comparative study of TF* IDF, LSI and multi-words for text classification. Expert Systems with Applications, 38(3), 2758-2765.
  13. Karima, A., Zakaria, E., Yamina, T. G., Mohammed, A. A. S., Selvam, R. P., & VENKATAKRISHNAN, V. (2012). Arabic text categorization: a comparative study of different representation modes. Journal of Theoretical and Applied Information Technology, 38(1), 1-5.
  14. Lilleberg, J., Zhu, Y., & Zhang, Y. (2015, July). Support vector machines and word2vec for text classification with semantic features. In Cognitive Informatics & Cognitive Computing (ICCI* CC), 2015 IEEE 14th International Conference on (pp. 136-140). IEEE.
  15. Raj, A. J., Francis, F. S., & Benadit, P. J. (2016). Optimal Web Page Classification Technique Based on Informative Content Extraction and FA-NBC. Computer Science and
  16. Huang, C., Qiu, X., & Huang, X. (2014). Text classification with document embeddings. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data (pp. 131-140). Springer, Cham.
  17. Dhar, A., Dash, N. S., & Roy, K. (2018, January). Categorization of bangla web text documents based on TF-IDF-ICF text analysis scheme. In Annual Convention of the Computer Society of India (pp. 477-484). Springer, Singapore.
  18. Deri, L., Martinelli, M., Sartiano, D., & Sideri, L. (2015, November). Large scale web-content classification. In Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), 2015 7th International Joint Conference on (Vol. 1, pp. 545-554).
  19. Singh, K. N., Devi, H. M., & Mahanta, A. K. (2017). Document representation techniques and their effect on the document Clustering and Classification: A Review. International Journal of Advanced Research in Computer Science, 8(5).
  20. Nayak, J., Naik, B., & Behera, H. (2015). A comprehensive survey on support vector machine in data mining tasks: applications & challenges. International Journal of Database Theory and Application, 8(1), 169-186.
  21. Dit, B., Panichella, A., Moritz, E., Oliveto, R., Di Penta, M., Poshyvanyk, D., & De Lucia, A. (2013, May). Configuring topic models for software engineering tasks in tracelab. In Traceability in Emerging Forms of Software Engineering (TEFSE), 2013 International Workshop on (pp. 105-109). IEEE.
  22. Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188-1196).
Index Terms

Computer Science
Information Sciences

Keywords

Classification Document embedding’s Machine learning Document representation Web Page classification Doc2Vec