A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix

Midhun Mathew; Shine N Das; T R Lakshmi Narayanan; Pramod K Vijayaraghavan

Call for Paper

August Edition

IJCA solicits high quality original research papers for the upcoming August edition of the journal. The last date of research paper submission is 20 July 2026

Submit your paper

Know more

The week's pick

Quantifying Label-Induced Bias in Large Language Model Self and Cross Evaluations

Muskan Saraf Sajjad Rezvani Boroujeni Justin Beaudry Hossein Abedi Tom Bush

Random Articles

On Chain Folding Problems of Chain Mapper and Chain Reducer Meta Expressions

April

2015

A Supervised Approach to Zero-Shot Learning for Field Classification of Texts: Leveraging File Data for Improved Text Categorization

Sep

2024

Optimized kNN Query Processing using Clustering in Untrusted Cloud Environment

April

2015

Development of an Instrument for Enterprise Resource Planning (ERP) Implementation in Indian Small and Medium Enterprises (SMEs)

July

2012

Reseach Article

A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix

by Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 19 - Number 7

Year of Publication: 2011

Authors: Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan

10.5120/2374-3128

Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan . A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix. International Journal of Computer Applications. 19, 7 ( April 2011), 16-21. DOI=10.5120/2374-3128

@article{ 10.5120/2374-3128,

author = { Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan },

title = { A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix },

journal = { International Journal of Computer Applications },

issue_date = { April 2011 },

volume = { 19 },

number = { 7 },

month = { April },

year = { 2011 },

issn = { 0975-8887 },

pages = { 16-21 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume19/number7/2374-3128/ },

doi = { 10.5120/2374-3128 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T20:06:21.809351+05:30

%A Midhun Mathew

%A Shine N Das

%A T R Lakshmi Narayanan

%A Pramod K Vijayaraghavan

%T A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix

%J International Journal of Computer Applications

%@ 0975-8887

%V 19

%N 7

%P 16-21

%D 2011

%I Foundation of Computer Science (FCS), NY, USA

Abstract

The voluminous amount of web documents has weakened the performance and reliability of web search engines. The subsistence of near-duplicate data is an issue that accompanies the growing need to incorporate heterogeneous data. Web content mining face huge problems due to the existence of duplicate and near-duplicate web pages. These pages either increase the index storage space or increase the serving costs thereby irritating the users. Near-duplicate detection has been recognized as an important one in the field of plagiarism detection, spam detection and in focused web crawling scenarios. Here we propose a novel idea for finding near-duplicates of an input web-page, from a huge repository. We proposes a TDW matrix based algorithm with three phases, rendering, filtering and verification, which receives an input web-page and a threshold in its first phase , prefix filtering and positional filtering to reduce the size of records in the second phase and returns an optimal set of near-duplicate web pages in the verification phase after calculating its similarity. The experimental results show that our algorithm outperforms in terms of two benchmark measures, precision and recall, and a reduction in the size of competing record set.

References

Fetterly D, Manasse M, Najork M, On the evolution of clusters of near-duplicate Web pages, In Proceedings of the First Latin American Web Congress, pp.37- 45 Nov. 2003.
Chuan Xiao, Wei Wang, Xuemin Lin, Efficient Similarity Joins for Near-Duplicate Detection, Proceeding of the 17th international conference on World Wide Web, pp 131 – 140. April 2008.
Gurmeet Singh Manku, Arvind Jain and Anish Das Sarma, Detecting near-duplicates for web crawling, In Proceedings of the 16th international conference on World Wide Web, pp. 141 - 150, Banff, Alberta, Canada, 2007.
Dennis Fetterly, Mark Manasse and Marc Najork, Detecting phrase-level duplication on the world wide web, In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp.170 - 177, Salvador, Brazil, 2005.
D. Lowd and C. Meek, Good word attacks on statistical spam filters, Second Conference on Email and Anti-Spam, July 2005.
Shine N Das, Midhun Mathew, Pramod K.Vijayaraghavan, An Approach for Optimal Feature Subset Selection using a New Term Weighting Scheme and Mutual Information, Proceeding of the International Conference on Advanced Science, Engineering and Information Technology, Malaysia, 2011, pp 273-278, January 2011.
Broder, A., Glassman, S., Manasse, M., and Zweig G. Syntactic Clustering of the Web, In 6th International World Wide Web Conference, pp: 393-404, 1997.
Fetterly, D., Manasse, M. and Najork, M. On the evolution of clusters of near-duplicate web pages, In Proceedings of the first Latin AmericanWeb Congress (LAWeb), 37–45, 2003.
Yun Ling, Xiaobo Tao Hexin Lv, A Priority-Based Method Of Near-duplicated Text Information Of Web Pages Deletion, IEEE International Conference on Software Engineering and Service Sciences (ICSESS), August 2010.
V.A. Narayana, P. Premchand and A. Govardhan, Effective Detection of Near-Duplicate Web Documents in Web Crawling, International Journal of Computational Intelligence Research, Volume 5, Number 1, pp. 83–96, 2009.
Jody S. Hourigan and Lynn V. McIndoo, A scientific Report on Singular Value Decomposition, 1998
Shine N Das, K. V. Pramod, Relevancy based Re-ranking of Search Engine Result, Proceedings of International Conference on Mathematical Computing and Management, Kerala, India, June 2010.

Index Terms

Computer Science

Information Sciences

Keywords

Near-Duplicate Detection Term-Document-Weight Matrix Prefix filtering Positional filtering Singular Value Decomposition