CFP last date
20 May 2024
Reseach Article

Web Document Segmentation for Better Extraction of Information: A Review

by Hassan F. Eldirdiery, A. H. Ahmed
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 110 - Number 3
Year of Publication: 2015
Authors: Hassan F. Eldirdiery, A. H. Ahmed
10.5120/19297-0734

Hassan F. Eldirdiery, A. H. Ahmed . Web Document Segmentation for Better Extraction of Information: A Review. International Journal of Computer Applications. 110, 3 ( January 2015), 24-28. DOI=10.5120/19297-0734

@article{ 10.5120/19297-0734,
author = { Hassan F. Eldirdiery, A. H. Ahmed },
title = { Web Document Segmentation for Better Extraction of Information: A Review },
journal = { International Journal of Computer Applications },
issue_date = { January 2015 },
volume = { 110 },
number = { 3 },
month = { January },
year = { 2015 },
issn = { 0975-8887 },
pages = { 24-28 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume110/number3/19297-0734/ },
doi = { 10.5120/19297-0734 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:45:24.893361+05:30
%A Hassan F. Eldirdiery
%A A. H. Ahmed
%T Web Document Segmentation for Better Extraction of Information: A Review
%J International Journal of Computer Applications
%@ 0975-8887
%V 110
%N 3
%P 24-28
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

This paper reviews the problem of web page segmentation. According to the recent studies, there exist different approaches used to segment the web page into multiple blocks. Segmentation of web document is an essential step for many applications, such as text classifications, clustering, extraction of information and searching. The study provided full description for each approach and showed its contribution to the work area of research. Also the paper discusses the variance between these approaches, explaining the benefits and limitations of each one. In addition to that it explores most of the effective algorithms those based on these approaches and explains the application area of each algorithm.

References
  1. Z. Bar-Yossef and S. Rajagopalan. 2002. Template detection via data mining and its applications. In proceedings of the International Conference on the World Wide Web. ACM Press, pp. 580-59.
  2. D. Chakrabarti, R. Kumar, K. Punera. 2008. A Graph-Theoretic Approach to Webpage Segmentation. In Proceeding of the 17th international conference on World Wide Web. ACM Press, pp. 377-386.
  3. Lan Yi, Bing Liu, and Xiaoli Li. 2003. Eliminating Noisy Information in Web Pages for Data Mining, SIGKDD, ACM Press, pp. 296-305.
  4. S. Debnath, P. Mitra, and C. L. Giles. 2005. Automatic Extraction of Informative Blocks from Web pages. In ACM Symposium on Applied Computing. ACM, pp. 1722-1726.
  5. Yu Wang, Bingxing Fang, Xueqi Cheng, Li Guo, Hongbo Xu. 2008. Incremental Web Page Template Detection. WWW. ACM, pp. 1247-1248.
  6. Aleksander Kolcz and Wen-tau Yih. 2007. Site-Independent Template-Block Detection. PKDD. Springer , pp. 152–163.
  7. Yunpeng Xiao, Yang Tao, and Qian Li. 2008. Web page adaptation for mobile device. In Proceeding of the14th conference on Wireless Communications, Networking and Mobile Computing. IEEE, pp. 1-5.
  8. Hamed Ahmadi and Jun Kong. 2008. Efficient web browsing on small screens. In Proceedings of the working conference on Advanced visual interfaces. ACM, pp. 23–30.
  9. G. Vineel. 2009. Web page dom node characterization and its application to page segmentation. In Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications. IMSAA'09, NJ, USA, pp. 1-6.
  10. J. Kang, J. Yang, and J. Choi. 2010. Repetition-based web page segmentation by detecting tag patterns for small-screen devices. IEEE Transactions on Consumer Electronics. IEEE, pp. 980–986.
  11. K. Rajkumar and V. Kalaivani, 2012. Dynamic web page segmentation based on detecting reappearance and layout of tag patterns for small screen devices, 2012 International Conference on Recent Trends In Information Technology. IEEE, pp. 508-513.
  12. S. Alcic and S. Conrad. 2011. Page segmentation by web content clustering. In Proceedings of the International Conference on Web Intelligence, Mining and Semantics. WIMS '11, New York, NY, USA, pp. 1-24.
  13. D. Cai, S. Yu, J Wen, W. Ma. 2003. VIPS: a Vision-based Page Segmentation Algorithm. Microsoft Technical Report (MSR-TR-2003-79).
  14. M. Kovacevic, M. Diligenti, M. Gori, V. 2002. Milutinovic. Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification. In Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM, pp. 250-257.
  15. R. Song, H. Lui, J. -R. Wen, and W. -Y. Ma. 2004. Learning block importance models for web pages. In proceedings of the International Conference on the World Wide Web. ACM Press, pp. 203-211.
  16. Fu Lei, Meng Yao, Yu Hao. 2009. Improve the Performance of the Webpage Content Extraction using Webpage Segmentation Algorithm. In proceedings of International Forum on Computer Science-Technology and Applications. Chongqing, China, pp. 323-325.
  17. Radek Burget and Ivana Rudolfova. 2009. Web page element classification based visual features. In 2009 First Asian conference on Intelligent Information and Database Systems. IEEE, pp. 67–72.
  18. Xiangye Xiao, Qiong Luo, Dan Hong, and Hongbo Fu. 2005. Slicing*-tree based web page transformation for small displays. In Proceedings of the 14th ACM international conference on Information and knowledge management. CIKM '05, New York, NY, USA, ACM, pp. 303-304.
  19. H. Yan and M. Miao. 2009. Research and implementation on multi-cues based page segmentation algorithm. International Conference on Computational Intelligence and Software Engineering, 2009. CiSE 2009, pp. 1-4.
  20. Zhang, J. Jing, L. Kang, and L. Zhang. 2010. Precise web page segmentation based on semantic block headers detection. IEEE, pp. 63–68.
  21. Myriam Ben Saad and Stephane Ganc¸arski. 2010. Using visual pages analysis for optimizing web archiving. In Proceedings of the 2010 EDBT/ICDT Workshops, , New York, NY, USA, ACM, pp. 1-43.
  22. E. Akpinar and Y. Yesilada. 2012. Vision based page segmentation: Extended and improved algorithm. eMINE Technical Report Deliverable 2 (D2), Middle East Technical University, Ankara, Turkey.
  23. M. A Hearst. Multi-paragraph segmentation of expository text. 1994. In proceedings of the 32nd annual meeting on Association for Computational Linguistics. Morristown, NJ, USA, pp. 9 -16.
  24. C. Kohlschutter, W. Nejdl. 2008. A Densitometric Approach to Web Page Segmentation. In Proceeding of the 17th ACM conference on Information and knowledge management. ACM Press, pp. 1173-1182.
  25. LI Ruijie, YANG Weidong and JIANG Haowei. 2010. Based on semantic web similarity. IEEE.
  26. Robert Kreuzer, Jurriaan Hage and Ad Feelders. 2014. A Quantitative Comparison of Semantic Web Page Segmentation Approaches. Technical Report, Utrecht University, Utrecht, The Netherlands.
  27. Y. Zhang and K. Deng. 2010. Algorithm of web page purification based on improved DOM and statistical learning. 2010 International Conference on Computer Design and Applications (ICCDA).
  28. Jing Wang1 and Zhijing Liu. 2009. A Novel Method for the Web page Segmentation And Identification. In Proceedings of the 2009 International Conference on Computer Engineering and Technology. IEEE.
  29. Waseem SAFI, Fabrice Maurel, Jean-Marc Routoure, Pierre Beust and Gaël Dias. 2014. A Hybrid Segmentation of Web Pages for Vibro-Tactile Access on Touch-Screen Devices. In Proceedings of the 25th International Conference on Computational Linguistics. Dublin, Ireland, pp. 95-102.
  30. Andres Sanoja and Stephane Gancarski. 2014. Block-o-Matic: A web page segmentation framework. In Proceedings of 2014 International Conference on Multimedia Computing and Systems, Marrakech. pp. 595-600.
Index Terms

Computer Science
Information Sciences

Keywords

Web page segmentation DOM tree Information Extraction.