CFP last date
20 May 2024
Reseach Article

Approaches for Web Spam Detection

by Kanchan Hans, Laxmi Ahuja, S. K. Muttoo
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 101 - Number 1
Year of Publication: 2014
Authors: Kanchan Hans, Laxmi Ahuja, S. K. Muttoo
10.5120/17655-8467

Kanchan Hans, Laxmi Ahuja, S. K. Muttoo . Approaches for Web Spam Detection. International Journal of Computer Applications. 101, 1 ( September 2014), 38-44. DOI=10.5120/17655-8467

@article{ 10.5120/17655-8467,
author = { Kanchan Hans, Laxmi Ahuja, S. K. Muttoo },
title = { Approaches for Web Spam Detection },
journal = { International Journal of Computer Applications },
issue_date = { September 2014 },
volume = { 101 },
number = { 1 },
month = { September },
year = { 2014 },
issn = { 0975-8887 },
pages = { 38-44 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume101/number1/17655-8467/ },
doi = { 10.5120/17655-8467 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:30:35.717182+05:30
%A Kanchan Hans
%A Laxmi Ahuja
%A S. K. Muttoo
%T Approaches for Web Spam Detection
%J International Journal of Computer Applications
%@ 0975-8887
%V 101
%N 1
%P 38-44
%D 2014
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Spam is a major threat to web security. The web of trust is being abused by the spammers through their ever evolving new tactics for their personal gains. In fact, there is a long chain of spammers who are running huge business campaigns under the web. Spam causes underutilization of search engine resources and creates dissatisfaction among web community. Web Security being a prime challenge for search engines has motivated the researchers in academia and industry to devise new techniques for web spam detection. In this paper we present a comprehensive survey of techniques for detection of web spam and discuss their applicability and performance in various scenarios where they outperformed the others. We have categorized web spam detection with the primary focus on the approaches used for spam detection. The paper also gives the possible directions for future work.

References
  1. Abernethy, J. , Chapelle, O. , & Castillo, C. "Graph regularization methods for Web spam detection", Machine Learning, (81:2), 2010, 207-225.
  2. Aburrous, M. , Hossain, M. A. , Dahal, K. , & Thabtah, F. "Intelligent phishing detection system for e-banking using fuzzy data mining", Expert systems with applications, (37:12), 2010, 7913-7921.
  3. Agichtein, E. , Brill, E. , & Dumais, S. "Improving web search ranking by incorporating user behaviour information", In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006 August, pp. 19-26.
  4. Akoglu, L. , & Faloutsos, C. "Anomaly, event, and fraud detection in large network datasets", In Proceedings of the sixth ACM international conference on Web search and data mining , 2013, February, pp. 773-774.
  5. Almeida, Tiago A. , & Akebo Yamakami. "Compression?based spam filter", Security and Communication Networks 2012.
  6. Almeida, T. A. , & Yamakami, A. "Occam's razor-based spam filter", Journal of Internet Services and Applications, (3:3), 2012, pp 245-253.
  7. Almeida, T. A. , & Yamakami, A. "Advances in spam filtering techniques", Computational Intelligence for Privacy and Security, Springer Berlin Heidelberg, 2012, pp. 199-214
  8. Almeida, T. A. , & Yamakami, A. "Facing the spammers: A very effective approach to avoid junk e-mails", Expert Systems with Applications, (39:7), 2012, pp. 6557-6561.
  9. Amitay, E. , Carmel, D. , Darlow, A. , Lempel, R. , & Soffer, A. "The connectivity sonar: detecting site functionality by structural patterns", In Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, 2003, August, pp. 38-47
  10. Anagnostakis, K. G. , Sidiroglou, S. , Akritidis, P. , Xinidis, K. , Markatos, E. , & Keromytis, A. D. "Detecting targeted attacks using shadow honeypots", In Proceedings of the 14th USENIX security symposium 2005.
  11. Blei, D. M. , Ng, A. Y. , & Jordan, M. I. "Latent dirichlet allocation", The Journal of machine Learning research, (3:1), 2003, pp. 993-1022.
  12. Breiman, L. "Random forests", Machine learning, (45:1), 2001, pp. 5-32.
  13. Becchetti, L. , Castillo, C. , Donato, D. , Leonardi, S. , & Baeza-Yates, R. A. "Link-Based Characterization and Detection of Web Spam". In international workshop on adversarial information retrieval on the web , AIRWeb, 2006, August. pp. 1-8.
  14. Becchetti, L. , Castillo, C. , Donato, D. , Leonardi, S. , & Baeza-Yates, R. "Using rank propagation and probabilistic counting for link-based spam detection", In Proceedings of WebKDD (Vol. 6), 2006, August.
  15. Caferrella M. & Cutting, "Building Nutch: Open source search". Queue, (2: 2), 2004, pp. 54-61.
  16. Castillo, C. , Donato, D. , Gionis, A. , Murdock, V. , & Silvestri, F. "Know your neighbors: Web spam detection using the web topology", In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 2007, July, pp. 423-430.
  17. Chang, C. C. , & Lin, C. J. "LIBSVM: a library for support vector machines", ACM Transactions on Intelligent Systems and Technology (2:3), 2011, pp. 27-35.
  18. Cohen, W. W. & Kou, Z. "Stacked graphical learning: approximating learning in markov random fields using very short inhomogeneous markov chains" , Technical report, 2006.
  19. Dai, N. , Davison, B. D. , & Qi, X. "Looking into the past to better classify web spam", In Proceedings of the 5th international workshop on adversarial information retrieval on the web, 2009, April, pp. 1-8.
  20. Dudley, J. , Barone, L. , & While, L. "Multi-objective spam filtering using an evolutionary algorithm". In Evolutionary Computation, IEEE World Congress on Computational Intelligence, 2008, June, pp. 123-130.
  21. Erdélyi, M. , Garzó, A. , & Benczúr, A. A. "Web spam classification: a few features worth more", In Proceedings of the 2011 Joint WICOW/AIRWeb ACM Workshop on Web Quality , 2011, March, pp. 27-34.
  22. Fuad, M. M. , Deb, D. , & Hossain, M. S. "A trainable fuzzy spam detection system", In Proc. of the 7th Int. Conf. on Computer and Information Technology, 2004, December
  23. Fetterly, D. , Manasse, M. , & Najork, M. "Detecting phrase-level duplication on the world wide web". In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, 2005, August, pp. 170-177.
  24. Fetterly, D. , Manasse, M. , & Najork, M. "Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages", In Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004 , pp. 1-6, ACM.
  25. Friedman, J. , Hastie, T. , & Tibshirani, R. "Additive logistic regression: A statistical view of boosting" Annals of statistics, 2000, pp. 337-374.
  26. Ghiam, Shekoofeh, and Alireza Nemaney Pour. "A Survey on Web Spam Detection Methods: Taxonomy. ", arXiv preprint arXiv:1210. 3131 , 2012 .
  27. Gyongyi, Z. , & Garcia-Molina, H, "Web spam taxonomy", In First international workshop on adversarial information retrieval on the web AIRWeb, 2005.
  28. John, J. P. , Yu, F. , Xie, Y. , Krishnamurthy, A. , & Abadi, M. "deSEO: Combating Search-Result Poisoning" , In USENIX Security Symposium, 2011, August.
  29. K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song. " Design and evaluation of a real-time URL spam filtering service", In IEEE Symposium on Security and Privacy, 2011
  30. Liu, Y. , Chen, F. , Kong, W. , Yu, H. , Zhang, M. , Ma, S. , & Ru, L. "Identifying Web Spam with the Wisdom of the Crowds", ACM Transactions on the Web (TWEB), (6:1), 2012, pp. 2-12.
  31. Liu, Y. , Zhang, M. , Ma, S. , & Ru, L. "User behavior oriented web spam detection", In Proceedings of the 17th international conference on World Wide Web, 2008, April, pp. 1039-1040. ACM
  32. Lu, L. , Perdisci, R. , & Lee, W. "SURF: detecting and measuring search poisoning", In Proceedings of the 18th ACM conference on Computer and communications security, 2011, October, pp. 467-476. ACM.
  33. Mcafe Labs Threats Report available at http://www. mcafee. com/uk/resources/reports/rp-quarterly-threat-q4-2013. pdf
  34. Martin, A. , Anutthamaa, N. , Sathyavathy, M. , Francois, M. M. S. , & Venkatesan, P. "A Framework for Predicting Phishing Websites Using Neural Networks", International Journal of Computer Science Issues, (8:2). 2011.
  35. Microsoft research strider team. Strider search defender, May 2006. http://research. microsoft. com/ SearchDefender/
  36. Mishne, G. , Carmel, D. , & Lempel, R. "Blocking Blog Spam with Language Model Disagreement", In In international workshop on adversarial information retrieval on the web (Vol. 5), 2005, May, pp. 1-6.
  37. Mokube, I. , & Adams, M. "Honeypots: concepts, approaches, and challenges", In Proceedings of the 45th annual southeast regional conference, 2007, March, pp. 321-326, ACM.
  38. Moshchuk, A. , Bragin, T. , Gribble, S. D. , & Levy, H. M. "A Crawler-based Study of Spyware in the Web", In NDSS, 2006, February.
  39. Najork, M. "System and method for identifying cloaked web servers", patent, 2002.
  40. Ntoulas, A. , Najork, M. , Manasse, M. , & Fetterly, D. "Detecting spam web pages through content analysis", In Proceedings of the 15th international conference on World Wide Web, 2006, May, pp. 83-9, ACM.
  41. Provos, N. , McNamee, D. , Mavrommatis, P. , Wang, K. , & Modadugu, N. "The ghost in the browser analysis of web-based malware" , In Proceedings of the first conference on First Workshop on Hot Topics in Understanding Botnets, 2007, April, pp. 4-4.
  42. Qian, F. , Pathak, A. , Hu, Y. C. , Mao, Z. M. , & Xie, Y. "A case for unsupervised-learning-based spam filtering", ACM SIGMETRICS Performance Evaluation Review, (38:1), 2010, June, pp. 367-368).
  43. Quinlan, J. R. "C4. 5: programs for machine learning" Vol. 1, Morgan kaufmann, 1993.
  44. Sanglerdsinlapachai, N. , & Rungsawang, A. "Web phishing detection using classifier ensemble", In Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services, 2010, November, pp. 210-215, ACM.
  45. Sanpakdee, U. , Walairacht, A. , & Walairacht, S. "Adaptive spam mail filtering using genetic algorithm", Advanced Communication Technology, 2006 and ICACT 2006. The 8th International Conference (Vol. 1, pp. 441-445). IEEE.
  46. Sidiroglou, S. , & Keromytis, A. D, "A network worm vaccine architecture", In Proceedings of Twelfth IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises", 2003, June, pp. 220-225.
  47. Silva, R. M. , Yamakami, A. , & Almeida, T. A. "An analysis of machine learning methods for spam host detection", in Proceedings of 11th International Conference on Machine Learning and Applications , 2012, pp. 227-232. IEEE.
  48. Silva, R. M. , Almeida, T. A. , & Yamakami, A. "Artificial neural networks for content-based web spam detection", In Proc. of the 14th International Conference on Artificial Intelligence, 2012, pp. 1-7.
  49. Silva, R. M. , Almeida, T. A. , & Yamakami, A. "Towards web spam filtering with neural-based approaches", In Advances in Artificial Intelligence–IBERAMIA, 2012, pp. 199-209, Springer Berlin Heidelberg.
  50. Spirin, Nikita, and Jiawei Han. "Survey on web spam detection: principles and algorithms. " ACM SIGKDD Explorations Newsletter 13. 2 (2012): 50-64.
  51. Sobek, M. "Pr0-google's pagerank 0 penalty. Badrank", 2002.
  52. Spitzner, L. "Honeypots: Catching the insider threat", In Proceedings of 19th Annual Conference on Computer Security Applications, 2003, December, pp. 170-179, IEEE.
  53. Suhara, Y. , Toda, H. , Nishioka, S. , & Susaki, S. "Automatically generated spam detection based on sentence-level topic information" , InProceedings of the 22nd international conference on World Wide Web companion, 2013, May, pp. 1157-1160.
  54. Svore, K. M. , Wu, Q. , Burges, C. J. , & Raman, A. "Improving web spam classification using rank-time features" , In Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, 2007, May, pp. 9-16. ACM.
  55. Symantec's Internet Security Threat Report http://www. symantec. com/security_response/publications/threatreport. jsp
  56. Vijayan, R. , Viknesh, S. T. G. M. , & Subhashini, S. "An Anti-Spam Engine using Fuzzy Logic with Enhanced Performance Tuning", International Journal of Computer Applications, (0975–8887) volume 2011.
  57. Vivekprasanth, R. , and Ram Kumar P. , "Fraudulent Pages Detection Using Link Reliability And Content Based Features. " In Proceedings of National Conference on Future Computing, . 2012
  58. Web Sense 2013 Threat Report available at http://www. websense. com/assets/reports/websense-2013-threat-report. pdf
  59. Westbrook, A. , & Greene, R. "Using semantic analysis to classify search engine spam", Class Project report at http://www. stanford. edu/class/cs276a/projects/reports. (2002-11-5).
  60. Wu, B. , & Davison, B. D. "Identifying link farm spam pages". InSpecial interest tracks and posters of the 14th international conference on World Wide Web , 2005, May, pp. 820-829. ACM.
  61. Yu, H. , Kaminsky, M. , Gibbons, P. B. , & Flaxman, A. "Sybilguard: defending against sybil attacks via social networks", ACM SIGCOMM Computer Communication Review, (36:4), 2006, pp. 267-278.
  62. Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen, " Combating web spam with TrustRank", Proc. of the 30th International Conference on Very Large Data Bases (VLDB), Toronto, Canada, 2004.
  63. Zhang, Y. , Hong, J. I. , & Cranor, L. F. "Cantina: a content-based approach to detecting phishing web sites", In Proceedings of the 16th international conference on World Wide Web, 2007, May, pp. 639-648. ACM.
  64. Zhang, J. , Seifert, C. , Stokes, J. W. , & Lee, W. "Arrow: Generating signatures to detect drive-by downloads", In Proceedings of the 20th international conference on World wide web, 2011, March, pp. 187-196, ACM.
  65. Zhang, Y. , Li, H. , Niranjan, M. , & Rockett, P. "Applying cost-sensitive multiobjective genetic programming to feature extraction for spam e-mail filtering", Genetic Pro gramming, Springer Berlin
Index Terms

Computer Science
Information Sciences

Keywords

Anti-Spam web security spam detection approaches search engines