CFP last date
20 May 2024
Reseach Article

A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix

by Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 19 - Number 7
Year of Publication: 2011
Authors: Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan
10.5120/2374-3128

Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan . A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix. International Journal of Computer Applications. 19, 7 ( April 2011), 16-21. DOI=10.5120/2374-3128

@article{ 10.5120/2374-3128,
author = { Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan },
title = { A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix },
journal = { International Journal of Computer Applications },
issue_date = { April 2011 },
volume = { 19 },
number = { 7 },
month = { April },
year = { 2011 },
issn = { 0975-8887 },
pages = { 16-21 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume19/number7/2374-3128/ },
doi = { 10.5120/2374-3128 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:06:21.809351+05:30
%A Midhun Mathew
%A Shine N Das
%A T R Lakshmi Narayanan
%A Pramod K Vijayaraghavan
%T A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix
%J International Journal of Computer Applications
%@ 0975-8887
%V 19
%N 7
%P 16-21
%D 2011
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The voluminous amount of web documents has weakened the performance and reliability of web search engines. The subsistence of near-duplicate data is an issue that accompanies the growing need to incorporate heterogeneous data. Web content mining face huge problems due to the existence of duplicate and near-duplicate web pages. These pages either increase the index storage space or increase the serving costs thereby irritating the users. Near-duplicate detection has been recognized as an important one in the field of plagiarism detection, spam detection and in focused web crawling scenarios. Here we propose a novel idea for finding near-duplicates of an input web-page, from a huge repository. We proposes a TDW matrix based algorithm with three phases, rendering, filtering and verification, which receives an input web-page and a threshold in its first phase , prefix filtering and positional filtering to reduce the size of records in the second phase and returns an optimal set of near-duplicate web pages in the verification phase after calculating its similarity. The experimental results show that our algorithm outperforms in terms of two benchmark measures, precision and recall, and a reduction in the size of competing record set.

References
  1. Fetterly D, Manasse M, Najork M, On the evolution of clusters of near-duplicate Web pages, In Proceedings of the First Latin American Web Congress, pp.37- 45 Nov. 2003.
  2. Chuan Xiao, Wei Wang, Xuemin Lin, Efficient Similarity Joins for Near-Duplicate Detection, Proceeding of the 17th international conference on World Wide Web, pp 131 – 140. April 2008.
  3. Gurmeet Singh Manku, Arvind Jain and Anish Das Sarma, Detecting near-duplicates for web crawling, In Proceedings of the 16th international conference on World Wide Web, pp. 141 - 150, Banff, Alberta, Canada, 2007.
  4. Dennis Fetterly, Mark Manasse and Marc Najork, Detecting phrase-level duplication on the world wide web, In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp.170 - 177, Salvador, Brazil, 2005.
  5. D. Lowd and C. Meek, Good word attacks on statistical spam filters, Second Conference on Email and Anti-Spam, July 2005.
  6. Shine N Das, Midhun Mathew, Pramod K.Vijayaraghavan, An Approach for Optimal Feature Subset Selection using a New Term Weighting Scheme and Mutual Information, Proceeding of the International Conference on Advanced Science, Engineering and Information Technology, Malaysia, 2011, pp 273-278, January 2011.
  7. Broder, A., Glassman, S., Manasse, M., and Zweig G. Syntactic Clustering of the Web, In 6th International World Wide Web Conference, pp: 393-404, 1997.
  8. Fetterly, D., Manasse, M. and Najork, M. On the evolution of clusters of near-duplicate web pages, In Proceedings of the first Latin AmericanWeb Congress (LAWeb), 37–45, 2003.
  9. Yun Ling, Xiaobo Tao Hexin Lv, A Priority-Based Method Of Near-duplicated Text Information Of Web Pages Deletion, IEEE International Conference on Software Engineering and Service Sciences (ICSESS), August 2010.
  10. V.A. Narayana, P. Premchand and A. Govardhan, Effective Detection of Near-Duplicate Web Documents in Web Crawling, International Journal of Computational Intelligence Research, Volume 5, Number 1, pp. 83–96, 2009.
  11. Jody S. Hourigan and Lynn V. McIndoo, A scientific Report on Singular Value Decomposition, 1998
  12. Shine N Das, K. V. Pramod, Relevancy based Re-ranking of Search Engine Result, Proceedings of International Conference on Mathematical Computing and Management, Kerala, India, June 2010.
Index Terms

Computer Science
Information Sciences

Keywords

Near-Duplicate Detection Term-Document-Weight Matrix Prefix filtering Positional filtering Singular Value Decomposition