CFP last date
20 March 2024
Reseach Article

Web Spam Detection by Learning from Small Labeled Samples

by Jaber Karimpour, Ali A. Noroozi, Somayeh Alizadeh
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 50 - Number 21
Year of Publication: 2012
Authors: Jaber Karimpour, Ali A. Noroozi, Somayeh Alizadeh
10.5120/7924-0993

Jaber Karimpour, Ali A. Noroozi, Somayeh Alizadeh . Web Spam Detection by Learning from Small Labeled Samples. International Journal of Computer Applications. 50, 21 ( July 2012), 1-5. DOI=10.5120/7924-0993

@article{ 10.5120/7924-0993,
author = { Jaber Karimpour, Ali A. Noroozi, Somayeh Alizadeh },
title = { Web Spam Detection by Learning from Small Labeled Samples },
journal = { International Journal of Computer Applications },
issue_date = { July 2012 },
volume = { 50 },
number = { 21 },
month = { July },
year = { 2012 },
issn = { 0975-8887 },
pages = { 1-5 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume50/number21/7924-0993/ },
doi = { 10.5120/7924-0993 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:48:52.518053+05:30
%A Jaber Karimpour
%A Ali A. Noroozi
%A Somayeh Alizadeh
%T Web Spam Detection by Learning from Small Labeled Samples
%J International Journal of Computer Applications
%@ 0975-8887
%V 50
%N 21
%P 1-5
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Web spamming tries to deceive search engines to rank some pages higher than they deserve. Many methods have been proposed to combat web spamming and to detect spam pages. One basic method is using classification, i. e. , learning a classification model from previously labeled training data and using this model for classifying web pages to spam or non-spam. A drawback of this method is that manually labeling a large number of web pages to generate the training data can be biased, non-accurate, labor intensive and time consuming. In this paper, we are going to propose a new method to resolve this drawback by using semi-supervised learning to automatically label the training data. To do this, we incorporate Expectation-Maximization algorithm that is an efficient and an important algorithm of semi-supervised learning. Experiments are carried out on the real web spam data, which show the new method, performs very well in practice.

References
  1. Caverlee, J. , Webb, S. , Liu, L. , Rouse, WB. 2009. A Parameterized Approach to Spam-Resilient Link Analysis of the Web. IEEE Transactions on Parallel and Distributed Systems. 20: 1422-38.
  2. Caverlee, J. , Liu, L. 2007. Countering Web Spam with Credibility-Based Link Analysis. In Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing (PODC '07). 157-166.
  3. Caverlee, J. , Webb, S. , Liu, L. 2007. Spam-Resilient Web Rankings via Influence Throttling. 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS). 1-10
  4. Gyongyi, Z. , Garcia-Molina, H. 2005. Web Spam Taxonomy. First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'05).
  5. Ntoulas, A. , Najork, M. , Manasse, M. ,Fetterly, D. 2006. Detecting spam web pages through content analysis. In Proceedings of the 15th international conference on World Wide Web. 83-92.
  6. Castillo, C. , Donato, D. , Becchetti, L. , et al. 2006. A reference collection for web spam. SIGIR Forum. 11-24.
  7. Liú, B. 2011. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer.
  8. Wang, W. , Zeng, G. Tang, D. 2010. Using evidence based content trust model for spam detection. Expert Systems with Applications. 37: 5599-606.
  9. Gyongyi, Z. , Garcia-Molina, H. , Pedersen, J. 2004. Combating Web Spam with TrustRank. In Proceedings of 30th Intl. Conf. on Very Large Data Bases (VLDB'04). 576-587.
  10. Becchetti, L. , Castillo, C. , Donato, D. , Leonardi, S. , Baeza-Yates, R. 2006. Link-based characterization and detection of Web Spam. 2nd Int Workshop on Adversarial Information Retrieval on the Web (AIRWeb'06). 1-8.
  11. Liu, Y. , Cen, R. , Zhang, M. , Ma, S. Ru, L. 2008. Identifying web spam with user behavior analysis. In Proceedings of the 4th international workshop on Adversarial information retrieval on the web. 9-16.
  12. Erdelyi, M. , Garzo, A. ,Benczur, AA. 2011. Web spam classification: a few features worth more. In Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality. 27-34.
  13. Mitchell, T. 1997. Machine Learning. McGraw-Hill.
  14. Yahoo Research. 2007. Web Spam Collections, http://barcelona. research. yahoo. net/webspam/datasets/, accessed May 2011.
  15. Castillo, C. , Donato, D. , Gionis, A. , Murdock, V. , Silvestri, F. 2007. Know your neighbors: web spam detection using the web topology. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 423-30.
  16. Han, J. , Kamber, M. , Pei, J. 2011. Data Mining: Concepts and Techniques. Elsevier.
Index Terms

Computer Science
Information Sciences

Keywords

Adversarial Information Retrieval Web Search Web Spam Detection Semi-supervised Learning Expectation Maximization Algorithm