Web Spam Detection by Learning from Small Labeled Samples

Jaber Karimpour; Ali A. Noroozi; Somayeh Alizadeh

Call for Paper

June Edition

IJCA solicits high quality original research papers for the upcoming June edition of the journal. The last date of research paper submission is 20 May 2024

Submit your paper

Know more

The week's pick

Enhancing Privacy Preservation: Multi-Attribute Protection with P-Sensitive K-Anonymity

Twinkle Patel Kiran Amin

Random Articles

A Novel Hidden Markov Model for Credit Card Fraud Detection

December

2012

An Efficient Approach Based on Trust to Purge the Weakness of Recommendation System

February

2010

Performance Enhancement of Database Driven Technique using Cynosure Method in Cloud

October

2014

Performance Analysis of Controlled Scalability in Unstructured Peer-to-Peer Networks

February

2012

Reseach Article

Web Spam Detection by Learning from Small Labeled Samples

by Jaber Karimpour, Ali A. Noroozi, Somayeh Alizadeh

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 50 - Number 21

Year of Publication: 2012

Authors: Jaber Karimpour, Ali A. Noroozi, Somayeh Alizadeh

10.5120/7924-0993

Jaber Karimpour, Ali A. Noroozi, Somayeh Alizadeh . Web Spam Detection by Learning from Small Labeled Samples. International Journal of Computer Applications. 50, 21 ( July 2012), 1-5. DOI=10.5120/7924-0993

@article{ 10.5120/7924-0993,

author = { Jaber Karimpour, Ali A. Noroozi, Somayeh Alizadeh },

title = { Web Spam Detection by Learning from Small Labeled Samples },

journal = { International Journal of Computer Applications },

issue_date = { July 2012 },

volume = { 50 },

number = { 21 },

month = { July },

year = { 2012 },

issn = { 0975-8887 },

pages = { 1-5 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume50/number21/7924-0993/ },

doi = { 10.5120/7924-0993 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T20:48:52.518053+05:30

%A Jaber Karimpour

%A Ali A. Noroozi

%A Somayeh Alizadeh

%T Web Spam Detection by Learning from Small Labeled Samples

%J International Journal of Computer Applications

%@ 0975-8887

%V 50

%N 21

%P 1-5

%D 2012

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Web spamming tries to deceive search engines to rank some pages higher than they deserve. Many methods have been proposed to combat web spamming and to detect spam pages. One basic method is using classification, i. e. , learning a classification model from previously labeled training data and using this model for classifying web pages to spam or non-spam. A drawback of this method is that manually labeling a large number of web pages to generate the training data can be biased, non-accurate, labor intensive and time consuming. In this paper, we are going to propose a new method to resolve this drawback by using semi-supervised learning to automatically label the training data. To do this, we incorporate Expectation-Maximization algorithm that is an efficient and an important algorithm of semi-supervised learning. Experiments are carried out on the real web spam data, which show the new method, performs very well in practice.

References

Caverlee, J. , Webb, S. , Liu, L. , Rouse, WB. 2009. A Parameterized Approach to Spam-Resilient Link Analysis of the Web. IEEE Transactions on Parallel and Distributed Systems. 20: 1422-38.
Caverlee, J. , Liu, L. 2007. Countering Web Spam with Credibility-Based Link Analysis. In Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing (PODC '07). 157-166.
Caverlee, J. , Webb, S. , Liu, L. 2007. Spam-Resilient Web Rankings via Influence Throttling. 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS). 1-10
Gyongyi, Z. , Garcia-Molina, H. 2005. Web Spam Taxonomy. First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'05).
Ntoulas, A. , Najork, M. , Manasse, M. ,Fetterly, D. 2006. Detecting spam web pages through content analysis. In Proceedings of the 15th international conference on World Wide Web. 83-92.
Castillo, C. , Donato, D. , Becchetti, L. , et al. 2006. A reference collection for web spam. SIGIR Forum. 11-24.
Liú, B. 2011. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer.
Wang, W. , Zeng, G. Tang, D. 2010. Using evidence based content trust model for spam detection. Expert Systems with Applications. 37: 5599-606.
Gyongyi, Z. , Garcia-Molina, H. , Pedersen, J. 2004. Combating Web Spam with TrustRank. In Proceedings of 30th Intl. Conf. on Very Large Data Bases (VLDB'04). 576-587.
Becchetti, L. , Castillo, C. , Donato, D. , Leonardi, S. , Baeza-Yates, R. 2006. Link-based characterization and detection of Web Spam. 2nd Int Workshop on Adversarial Information Retrieval on the Web (AIRWeb'06). 1-8.
Liu, Y. , Cen, R. , Zhang, M. , Ma, S. Ru, L. 2008. Identifying web spam with user behavior analysis. In Proceedings of the 4th international workshop on Adversarial information retrieval on the web. 9-16.
Erdelyi, M. , Garzo, A. ,Benczur, AA. 2011. Web spam classification: a few features worth more. In Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality. 27-34.
Mitchell, T. 1997. Machine Learning. McGraw-Hill.
Yahoo Research. 2007. Web Spam Collections, http://barcelona. research. yahoo. net/webspam/datasets/, accessed May 2011.
Castillo, C. , Donato, D. , Gionis, A. , Murdock, V. , Silvestri, F. 2007. Know your neighbors: web spam detection using the web topology. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 423-30.
Han, J. , Kamber, M. , Pei, J. 2011. Data Mining: Concepts and Techniques. Elsevier.

Index Terms

Computer Science

Information Sciences

Keywords

Adversarial Information Retrieval Web Search Web Spam Detection Semi-supervised Learning Expectation Maximization Algorithm