Call for Paper - November 2023 Edition
IJCA solicits original research papers for the November 2023 Edition. Last date of manuscript submission is October 20, 2023. Read More

Crawling the Hidden Web: An Approach to Dynamic Web Indexing

Print
PDF
International Journal of Computer Applications
© 2012 by IJCA Journal
Volume 55 - Number 1
Year of Publication: 2012
Authors:
Moumie Soulemane
Mohammad Rafiuzzaman
Hasan Mahmud
10.5120/8717-7290

Moumie Soulemane, Mohammad Rafiuzzaman and Hasan Mahmud. Article: Crawling the Hidden Web: An Approach to Dynamic Web Indexing. International Journal of Computer Applications 55(1):7-15, October 2012. Full text available. BibTeX

@article{key:article,
	author = {Moumie Soulemane and Mohammad Rafiuzzaman and Hasan Mahmud},
	title = {Article: Crawling the Hidden Web: An Approach to Dynamic Web Indexing},
	journal = {International Journal of Computer Applications},
	year = {2012},
	volume = {55},
	number = {1},
	pages = {7-15},
	month = {October},
	note = {Full text available}
}

Abstract

The majority of the websites encapsulating online information are dynamic and hence too sophisticated for many traditional search engines to index. With the ever growing quantity of such hidden web pages, this issue continues to raise diverse opinions between the research and practitioner among the web mining communities. Several aspects enriching these dynamic web pages are bringing more challenges day-by-day to index them. By explaining these aspects and challenges, in this paper we have presented a framework for dynamic web indexing. With the implementation of this framework and the results which we have found from it, all the necessary experimental setup and the developmental processes are explained. We have concluded by exposing a possible future scope through the integration of Hadoop-Mapreduce with this framework to update and maintain the index.

References

  • Dan Sisson. Google SEO secrets, the complete guide, pp. 26–28, 2006.
  • S. Raghavan, H. Garcia-Molina. Crawling the Hidden Web, in: Proc. of the 27th Int. Conf. on Very Large Databases (VLDB 2001), September 2001.
  • Dilip Kumar Sharmal, A. k. Sharma2. Analysis of techniques for detection of web search interfaces, 2YMCA University of Science and Technology, Faridabad, Haryana, India,http://www. csi-india. org/web/csi/studentskorner-december10, accessed on June, 2011.
  • A. Ntoulas, Petros Zerfos, Junghoo Cho, Downloading Textual Hidden Web Content through Keyword Queries, JCDL '05. Proceedings of the 5th ACM/IEEE-CS Joint Conference, 2005.
  • Luciano Barbosa, Juliano Freire, siphoning hidden-web data through keyword-based interfaces, Journal of Information and Data management, 2010.
  • http://www. w3schools. com/html/html_forms. asp, accessed on, June 2011.
  • Wiley, Data Mining the Web Uncovering Patterns. (2007) .
  • .
  • Pradeep, Shubha Singh, NewNet- Crawling Deep Web, IJCSNS International Journal of Computer Science and Network Security, VOL. 10 No. 5, pp. 129-130, May 2010.
  • http://www. worldwidewebsize. com/, accessed on June, 2010.
  • J Bar-Ilan - Methods for comparing rankings of search engine result-2005, http://www. seojerusalem. com/googles-best-kept-secret/, http://www. search-marketing. info/search-algorithm/index. htm, accessed on June, 2010.
  • David Hawking, Web Search Engines-1, pp. 87-88, 2006.
  • Jayant Madhavan, David Ko, Luc jaKot, Vignesh Ganapathy, Alex Rasmussen, Alon Halevy. "Google's Deep-Web Crawl", Proceedings of the International Conference on Very Large Databases (VLDB), 2008.
  • http://www. dmoz. org/, accessed on June, 2010.
  • Brijendra Singh, Hemant Kumar Singh. "Web Data Mining Research: A Survey", IEEE, 2010.
  • http://www. ncbi. nlm. nih. gov/pubmed, accessed on June, 2010.
  • C. H. Chang, M. Kayed, M. R. Girgis, K. F. Shaalan," A survey of web information extraction systems". IEEE Transactions on Knowledge and Data Engineering 18(10), pp. 1411–1428, 2006.
  • P. Wu, J. R. Wen, H. Liu, W. Y. Ma,"Query selection techniques for efficient crawling of structured web sources". In: Proc. of ICDE, 2006.
  • Wang Hui-chang, Ruan,Shu-hua, Tang,Qi-jie. "The Implementation of a Web Crawler URL Filter Algorithm Based on Caching". Second International Workshop on Computer Science and Engineering, IEEE, 2009.
  • Jeffrey Dean, Sanjay Ghemawat. "MapReduce: Simplified Data Processing on Large Clusters". To appear in OSDI, 2004 http://labs. google. com/papers/mapreduce. html.
  • http://hadoop. apache. org/, accessed on june, 2010.
  • King-Ip Lin, Hui Chen. "Automatic Information Discovery from the "Invisible Web"", Information Technology: Coding and Computing (ITCC'02), IEEE, 2002.
  • S. Chakrabarti, Mining the web: Discovering knowledge from Hypertext Data, p. 67. Morgan Kaufmann Publishers, 2003.
  • Hasan Mahmud, Moumie Soulemane, Muhammad Rafiuzzaman, 'Framework for dynamic indexing from hidden web', IJCSI, Vol. 8, Issue 5, September 2011.