CFP last date
22 April 2024
Reseach Article

Trinity for Web Data Extraction using Efficient Algorithm

Published on December 2015 by Sayali Khodade, Roshani Ade
National Conference on Advances in Computing
Foundation of Computer Science USA
NCAC2015 - Number 1
December 2015
Authors: Sayali Khodade, Roshani Ade
87f94a2f-ab4f-44b9-9ea8-5310553fdd4c

Sayali Khodade, Roshani Ade . Trinity for Web Data Extraction using Efficient Algorithm. National Conference on Advances in Computing. NCAC2015, 1 (December 2015), 18-22.

@article{
author = { Sayali Khodade, Roshani Ade },
title = { Trinity for Web Data Extraction using Efficient Algorithm },
journal = { National Conference on Advances in Computing },
issue_date = { December 2015 },
volume = { NCAC2015 },
number = { 1 },
month = { December },
year = { 2015 },
issn = 0975-8887,
pages = { 18-22 },
numpages = 5,
url = { /proceedings/ncac2015/number1/23356-5014/ },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Proceeding Article
%1 National Conference on Advances in Computing
%A Sayali Khodade
%A Roshani Ade
%T Trinity for Web Data Extraction using Efficient Algorithm
%J National Conference on Advances in Computing
%@ 0975-8887
%V NCAC2015
%N 1
%P 18-22
%D 2015
%I International Journal of Computer Applications
Abstract

Now a days there are increasing number of users on the internet. The internet is having a huge collection of web data which is very useful for the users. Web data extractors are used to crawl the data from web documents. The planned approach which operates on two or more web records at once, which is created at same server-side template and takes in a regular expression that models it and can later be used to retrieve information from same records. The template introduces some shared patterns that do not provide any relevant data and can thus be disregarded. The technique gives better results for multiword queries comparatively other existing techniques and input errors do not have any negative impact on its effectiveness.

References
  1. Sleiman, H. A and Corchuelo, R. : Trinity: On Using Trinary Trees for UnsupervisedWeb Data Extraction In: Knowledge and Data Engineering, pp. 1544-1556. IEEE Transactions (2014).
  2. Chia Hui Chang and Kayed, Mohammed and Girgis, M. R. and Shaalan, K. F. : A Survey of Web Information Extraction Systems In: Knowledge and Data Engineering, pp. 1411-1428. IEEE International Conference (2006)
  3. Kayed, Mohammed and Chia Hui Chang and Shaalan, K. and Girgis, M. R. : FiVaTech: Page-Level Web Data Extraction from Template Pages In: Data MiningWorkshops, pp. 15-20. IEEE International Conference (2007)
  4. Arvind Arasu and Garcia-Molina, H. : Extracting structured data from Web pages(Poster) In: Data Engineering, pp. 698-710. IEEE International Conference (2003)
  5. V. Crescenzi, G. Mecca, and P. Merialdo, "Road runner: Towards automatic data extraction from large web sites," in Proc. 27th Int. Conf. VLDB, Rome, Italy, 2001, pp. 109–118.
  6. C. -H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan, "A survey of web information extraction systems," IEEE Trans. Knowl. DataEng. , vol. 18, no. 10, pp. 1411–1428, Oct. 2006.
  7. H. A. Sleiman and R. Corchuelo, "A survey on region extractors from web documents," IEEE Trans. Knowl. Data Eng. , vol. 25, no. 9, pp. 1960–1981, Sept. 2012.
  8. W. W. Cohen, M. Hurst, and L. S. Jensen, "A flexible learning system for wrapping tables and lists in HTML documents," in Proc. 11th Int. Conf. WWW, 2002, pp. 232–241.
  9. V. Crescenzi and G. Mecca, "Automatic information extraction from large websites," J. ACM, vol. 51, no. 5, pp. 731–779, Sept. 2004.
  10. M. Kayed and C. -H. Chang, "FiVaTech: Page-level web dataextraction from template pages," IEEE Trans. Knowl. Data Eng. ,vol. 22, no. 2, pp. 249–263, Feb. 2010.
  11. A. Arasu and H. Garcia-Molina. "Extracting Structured Data from Web Pages," Proc. ACM SIGMOD, pp. 337-348, 2003.
  12. Valiente, G. Tree edit distance and common subtrees. Research Report LSI-02-20-R, University Politecnica de Catalunya, Barcelona, Spain, 2002
Index Terms

Computer Science
Information Sciences

Keywords

Web Data Extraction Automatic Wrapper Generation Wrappers Unsupervised Learning