Call for Paper - March 2023 Edition
IJCA solicits original research papers for the March 2023 Edition. Last date of manuscript submission is February 20, 2023. Read More

Search Engine Spam Detection using an Integrated Hybrid Genetic Algorithm based Decision Tree

Print
PDF
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Year of Publication: 2016
Authors:
D. Saraswathi, A. Vijaya
10.5120/ijca2016908027

D Saraswathi and A Vijaya. Article: Search Engine Spam Detection using an Integrated Hybrid Genetic Algorithm based Decision Tree. International Journal of Computer Applications 133(10):20-27, January 2016. Published by Foundation of Computer Science (FCS), NY, USA. BibTeX

@article{key:article,
	author = {D. Saraswathi and A. Vijaya},
	title = {Article: Search Engine Spam Detection using an Integrated Hybrid Genetic Algorithm based Decision Tree},
	journal = {International Journal of Computer Applications},
	year = {2016},
	volume = {133},
	number = {10},
	pages = {20-27},
	month = {January},
	note = {Published by Foundation of Computer Science (FCS), NY, USA}
}

Abstract

Search Engine spam is a poison for the search engine. It is created by the search engine spammers for commercial benefits. It affects quality of search engine. Already there are many algorithms available for filtering the search engine spam. But the spammers are often changing the strategy for creating the search engine spam. So there is a need to detect it in efficient way. The proposed system detects the search engine spam using an integrated hybrid genetic algorithm based decision tree. The proposed system is compared with different criteria and is shown the best performance than other methods.

References

  1. Assas Ouarda, M. Bouamar, “A Comparison of Evolutionary Algorithms: PSO, DE and GA for Fuzzy C-Partition”, International Journal Applications (0975-8887) Volume 91-No.10, April 2014.
  2. Malti Baghel, Shikha Agrawal, Sanjay Silakari,” Survey of Metaheuristic Algorithms for Combinatorial Optimization” International Journal of Computer Applications (0975-887) Volume 58– No.19, November 2012.
  3. Beatriz de la Iglesia,” Evolutionary computation for feature selection in classification problems”, Data Mining and Knowledge Discovery, volume 3, issue 6, 2013.
  4. Nitin Bhatia, Vandana,” Survey of Nearest Neighbor Techniques”, International Journal of Computer Science and Information Security, Volume 8, No, 2, 2010.
  5. Binita Kumari, Tripti Swarnkar,” Filter versus Wrapper Feature Subset Selection in Large Dimensionality Micro array: A Review”, International Journal of Computer Science and Information Technologies, Vol. 2 (3), ISSN: 0975-9646, 2011.
  6. C.Castillo, B.D.Davison,” Adversarial Web Search”, Information Retrieval, vol. 4, pp.377-486, 2010.
  7. Ashish Chandra, Mohammad Suaib, and Dr. Rizwan Beg, “Low Cost Page Quality Factors to Detect Web Spam”, Informatics Engineering, An International Journal, Vol.2, No.3, September 2014
  8. Web Spam UK 2007, http://chato.cl/webspam/datasets/uk2007/.
  9. Ong Chung Sin,“Hybrid Genetic Algorithm With Multi-Parents Recombination for Job Shop Scheduling Problems” Thesis, 2013.
  10. E.Convey.“Porn sneaks way back on web”.The Boston Herlad, 1996.
  11. Padraig Cunningham, Sarah Jane Delany,” K-Nearest Neighbour Classifiers”, Technical Report UCD-CSI-2007.
  12. Mahdieh Danandeh Oskuie, Seyed Naser Razavi, “A Survey of Web Spam Detection Techniques”, International Journal of Computer Applications Technology and Research (2319–8656), Volume 3–Issue 3, 180 -185, 2014.
  13. Dimitris Bertsimas, John Tsitsiklis,”Simulated Annealing”, Statistical Science, Volume 8, No. 1, 10-15, 1993.
  14. Marco Dorigo and Thomas stutzle, ” Ant Colony Optimization”, MIT, 2004.
  15. Harris Drucker,” Support Vector Machines For Spam Categorization”, IEEE Transactions On Neural Networks, Vol. 10, No. 5, September 1999
  16. A.E.Eiben,P-E.Raue,Zs.Ruttkay, “Genetic Algorithms with Multi-Parent Recombination”, Proceedings of the third Conference on Parallel Problem Solving from Nature, LNCS 866, Springer-Verlag, pp.78-87, 1994.
  17. D.Fetterly, M.Manasse, M.Najork, “Spam, Damn Spam, And Statistics: Using Statistical Analysis To Locate Spam Web Pages”, In Proceeding of the Seventh Workshop on the Web and Databases, pp.1-6, June 2004.
  18. Z.Gyongyi,H.Garcia-Molina, J.Perdsersen. Combating web spam with Trust Rank. In VLDB 2004.
  19. z.Gyongyi and H.Garcia-Molina. “Web Spam Taxonomy”. Proceeding first international Workshop on Adversarial Information Retrieval on the Web, Japan, May 2005
  20. Kanchan Hans, Laxmi Ahuja, S.K. Muttoo, “ Approaches for Web Spam Detection“ International Journal of Computer Applications (0975-8887) Volume 101, No.1 September 2014.
  21. M.R.Henzinger,R.Motwani, and C.Silverstein. “Challenges in web search engines”. SIGIR Forum, 36, September 2002.
  22. John H. Holland,” Genetic Algorithms”, http://www.econ.iastate.edu/tesfatsi/holland.GAIntro.htm, 2005.
  23. Zhang Hongxin,“Naive Bayes Classification”, state key lab of CAD&CG, 2009.
  24. Jzhang,”A Brief Introduction to Support Vector Machine”, lecture notes, 2011.
  25. Kalavathi K, Nimitha safar PV, “Performance Comparison between Naïve Bayes, Decision Tree, and K-Nearest Neighbour”, International Journal of Emerging Research in Management & Technology. ISSN:2278-9359,Vol-4, Issue-6, 2015.
  26. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46 Sept. 1999.
  27. James Kennedy and Russell Eberhart, ”Particle Swarm Optimization”,IEEE,1995
  28. Eamonn Keogh, ”Naïve Bayes Classifier”, Pattern Recognition and Machine Learning, Springer Verlag, 2006
  29. Vijay Krishnan and Rashmi Raj, Web spam detection with Anti-Trust rank, In AIRWeb’06, August 2006.
  30. Niels Landwehr,Mark Hall,Eibe Frank,”Logistic Model Trees”, University of Waikato,Germany,2004.
  31. R.Lempel, S.Moran. The Stochastic approach for link structure analysis (SALSA) and the TKC Effect. Computer Networks 33 (2000) 387- 401. www.elsevier.com/locate/comnet
  32. Manuel Egele, Clemens Kolbitsch, Christian Platzer, “Removing Web Spam Links from Search Engine Results”, Journal computing virol, Springer, 2011.
  33. Mike Chapple, “Classification”, Database Expert, http://databases.about.com/od/datamining/g/classification.htm
  34. Ming Leung, ”Decision Trees&Decision Rules”, Lecturenotes, 2007
  35. https://moz.com/top500
  36. Gamal Abd El-Nasser A. Said, Abeer M. Mahmoud, El-Sayed M. El-Horbaty,”A Comparative Study of Meta-heuristic Algorithms for Solving Quadratic Assignment Problem” International Journal of Advanced Computer Science and Applications, Vol. 5, No. 1, 2014
  37. Fiona Nielsen, Geert Rasmussen, “Neural Networks – algorithms and applications”, Niels Brock Business College, synopsis, 2001.
  38. Alexandros Ntoulas, marc najork, mark manasse, Dennis fetterly, “Detecting Spam Web Pages through Content Analysis”, International World Wide Web Conference Committee, 2006.
  39. H.Osman and G.Laporte. “Metaheuristics: A bibliography. Annals of operations Research,513-623,1996
  40. Larry Page, Sergey Brin, The PageRank citation ranking: bringing order to the web. 1999
  41. http://www.pagetraffic.com/blog/most-popular-keywords-on-search-engines, 2014
  42. Kirti Pandey, Pallavi Jain, ”Implementation of Modified Genetic Algorithm Based on the Sub Graph Formation of Travelling Salesman Problem”, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 5, Issue 7,July 2015 ISSN: 2277 128X
  43. Payam Refaeilzadeh, Lei Tang, Huan Liu,” Cross Validation” Arizona State University, 2008
  44. Alan perkins, “The classification of Search Engine Spam, http://www.silverdisc.co.uk/articles/spam-classification/, Sep 2001.
  45. Cristina Petri,” Decision Trees”, Lecture notes 2010
  46. Pratibha Thakur, Amar Jeet Singh,“Study of Various Crossover Operators in Genetic Algorithms”, International Journal of Advanced Research in Computer Science and Software Engineering, Vol. 4 , Issue 3, 2014.
  47. Richard, “Web Spam Detection”, Yahoo Research, 2007
  48. Robert P.W. Duin,” Learned from Neural Networks”, Pattern Recognition Group,Nertherlands,2000
  49. Saranya C, Manikandan G, ”A Study on Normalization Techniques for Privacy Preserving Data Mining”, International Journal of Engineering and Technology, Vol 5 No 3 ISSN : 0975-4024, 2013
  50. Seema Mane,S. S. Sonawani,Dr. Sachi n Sakhare, and Prof. P. V. Kulkarni, ”Multi-objective Evolutionary Algorithms for Classification: A Review”, International Journal of Application or Innovation in Engineering & Management, Vol. 3, Issue 10, 2014
  51. Gary Stein, Bing Chen, Annie S. Wu, Kien A. Hua, ”Decision Tree Classifier For Network Intrusion Detection With GA-based Feature Selection”, 2005
  52. Toolika Arora, Yogita Gigras, “A Survey Of Comparison Between Various Metaheuristic Techniques For Path Planning Problem”, International Journal Of Computer Engineering & Science, ISSN: 2231 6590, Nov. 2013.
  53. Victor M. Prieto , Manuel Alvarez, Rafael Lopez-Garcia and Fidel Cacheda, ”Analysis and Detection of Web Spam by means of Web Content”, In Proceedings of the 5th Information Retrieval Facility Conference, 2012
  54. Vikash Kumar Singh,” Machine Learning Techniques for Detecting Untrusted pages on the Web”, thesis, NIT, 2009.
  55. Y.M.Wang,M.Ma,Y.Niu, and H.Chen,” Spam Double-Funnel: Connecting Web Spammers with Advertisers,” In Proceedings of the 16th International Conference on World Wide Web, ACM, 2007
  56. http://www.wordstream.com/popular-keywords/
  57. Wu X. et al. “Top 10 algorithms in data mining”,Knowledge Information Systems, DOI: 10.1007/s10115-007-0114-2, 2008
  58. Jihoon Yang, Vasant Honavar,” Feature Subset Selection Using a Genetic Algorithm”, Iowa State University Digital Repository, Computer Science Technical Reports, 1997.

Keywords

Search Engine Spam, Decision Tree, Genetic Algorithm, Tabu Search, Spamdexing, Feature Selection, Metaheuristic Approach