CFP last date
20 May 2024
Call for Paper
June Edition
IJCA solicits high quality original research papers for the upcoming June edition of the journal. The last date of research paper submission is 20 May 2024

Submit your paper
Know more
Reseach Article

A Classifier for Schema Types Generated by Web Data Extraction Systems

by Mohammed Kayed, Awny Sayed, Marwa Hashem
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 107 - Number 1
Year of Publication: 2014
Authors: Mohammed Kayed, Awny Sayed, Marwa Hashem
10.5120/18716-9936

Mohammed Kayed, Awny Sayed, Marwa Hashem . A Classifier for Schema Types Generated by Web Data Extraction Systems. International Journal of Computer Applications. 107, 1 ( December 2014), 27-36. DOI=10.5120/18716-9936

@article{ 10.5120/18716-9936,
author = { Mohammed Kayed, Awny Sayed, Marwa Hashem },
title = { A Classifier for Schema Types Generated by Web Data Extraction Systems },
journal = { International Journal of Computer Applications },
issue_date = { December 2014 },
volume = { 107 },
number = { 1 },
month = { December },
year = { 2014 },
issn = { 0975-8887 },
pages = { 27-36 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume107/number1/18716-9936/ },
doi = { 10.5120/18716-9936 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:39:56.477663+05:30
%A Mohammed Kayed
%A Awny Sayed
%A Marwa Hashem
%T A Classifier for Schema Types Generated by Web Data Extraction Systems
%J International Journal of Computer Applications
%@ 0975-8887
%V 107
%N 1
%P 27-36
%D 2014
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Generating Web site schema is a core step for value-added services on the web such as comparative shopping and information integration systems. Several approaches have been developed to detect this schema. For a real web site, due to the complexity of the site schema, post process of this schema such as labeling the schema types, comparing among different schema types and generating an extractor to extract instances of a schema type is a challenge. In this paper, a new tree structured called schema-type semantic model is proposed as a classifier for a schema type. Given some instances of a schema type, HTML tags contents, DOM trees structural information and visual information of these instances are exploited for the classifier construction. Using multivariate normal distribution, the classifier can be used to compare between two different schema types; i. e. , the classifier can be used for schema mapping which is a core step of information integration. Also, the suggested classifier can be used to detect and extract instances of a schema type; i. e. , it can be used as an extractor for web data extraction systems. Furthermore, the classifier can be used to improve the performance of the schema generated by web data extraction systems; i. e. , the classifier can be used to get, as much as possible, a perfect schema. The experiments show an encourage result with the schemas of the test web sites (a data set of 40 web sites).

References
  1. Arasu A. and Garcia-Molina H. ," Extracting Structured Data from Web Pages", Proc. ACM SIGMOD, pp. 337-348, 2003.
  2. Chang C-H. , Kayed M. , Girgis M. and Shaalan K. , "A Survey of Web Information Extraction Systems", IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 10, pp. 1411-1428, 2006.
  3. Kayed M. and Chang C. -H. , "Page-level web data extraction from template pages", IEEE Trans. on Know and Data Eng. , vol. 22, no. 2, pp. 249–263, 2010.
  4. Crescenzi V. , Mecca G. and Merialdo P. , " RoadRunner: towards-automatic data extraction from large Web sites," Proceedings of the 26t International Conference on very Large Database Systems (VLDB), Rome, Italy, pp. 109-118, 2001.
  5. Wang and Lochovsky F. ,"Data Extraction and Label Assignment for Web Databases", Proc. Int'l Conf. World Wide Web (WWW-12), pp. 187-196, 2003.
  6. Zhai Y. and Liu B. , "Web Data Extraction Based on Partial Tree Alignment", Proc. Int'l Conf. World Wide Web (WWW-14), pp. 76-85, 2005.
  7. Simon K. and Lausen G. , "ViPER: Augmenting Automatic Information Extraction with Visual Perceptions", CIKM 2005, 2005.
  8. Thamviset W. and Wongthanavasu S. , "Information extraction for deep web using repetitive subject pattern". World Wide Web, August 2013.
  9. Derouiche N. , Cautis B. , Abdessalem T. , "Automatic Extraction of Structured Web Data with Domain Knowledge". 28th Int. Conference on Data Engineering, pp. 726-737, 2012.
  10. Jinglun G. , Zhou Y. , Barner K. , "View: Visual Information Extraction Widget for improving chart images accessibility", 19th IEEE Int. Conference on Image Processing, pp. 2865-2868, 2012.
  11. Algergawy A. , Nayak R. and Saake G. , "Element similarity measures in XML schema matching", Information Sciences, pp. 4975-4998, 2010.
  12. Milo T. and Zohar S. , "Using schema matching to simplify heterogeneous data translation", Proc. 24th Int. Conf. On Very Large Data Bases, pp. 122–133, 1998.
  13. Rahm E. and Bernstein P. , "A survey of approaches to automatic schema matching", VLDB Journal, vol. 10, no. 4, pp. 334-350, 2001.
  14. Lerner S. , "A model for compound type changes encountered in schema evolution", ACM Trans, Database System, vol. 25, no. 1, pp. 83-127, 2000.
  15. Bertino E. , Guerrini G. and Mesiti M. , "A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications", Information Systems, vol. 29, pp. 23–46, 2004.
  16. Palopoli L. , Sacca D. , Terracina G. and Ursino D. , "Uniform techniques for deriving similarities of objects and sub schemas in heterogeneous databases", IEEE Trans. Knowledge. Data Eng, vol. 15, no. 2, pp. 271-294, 2003.
  17. Yeh P. , Porter B. and Barker K. , "Using transformations to improve semantic matching", in: Proceedings of K-CAP'03, Sanibel Island, FL, pp. 180-189, 2003.
  18. Noy N. and Musen M. , "The PROMPT suite: interactive tools for ontology merging and mapping", J. Hum . Computer. Stud, vol. 59(6), pp. 983-1024, 2003.
  19. Fellbaum C. , "WordNet: An Electronic Lexical Database", The MIT Press, Cambridge, MA, 1998.
  20. Lenat D. , "CYC: a large-scale investment in knowledge infrastructure", Commun. ACM, vol. 38, no. 11, pp. 33-38, 1995.
  21. Berlin J. and Motro A. , "Database schema matching using machine learning with feature selection", CAISE 2002, Toronto, ON, pp. 452-466, 2002.
  22. Li W. , Clifton C. and Liu S. , "Database integration using neural networks: implementation and experiences", Knowledge. Inf. Syst, vol. 2(1), pp. 73-96, 2000.
  23. Mirbel I. , "Semantic integration of conceptual schemas", Data & Knowledge Engineering, vol. 21, no. 2, pp. 183-195, 1997.
  24. Madhavan J. , Bernstein P. and Rahm E. , "Generic schema matching with cupid", 27th Int. Conferences on Very Large Databases, pp. 49–58, 2001.
  25. Benkley S. , Fandozzi J. , Housman E. and Woodhouse G. , "Data element tool-based analysis (DELTA)", The MITRE Corporation, Bedford, MA, Technical Report, MTR 95B0000147, 1995.
  26. Zhao H. and Ram S. , "Clustering schema elements for semantic integration of heterogeneous data sources", J. of Database Management, vol. 15(4), pp. 88–106, 2004.
  27. Zhao H. and Ram S. , "Clustering similar schema elements across heterogeneous databases: a first step in database integration", in: K. Siau (Ed. ), Advanced Topics in Database Research, Idea Group Publishing, vol. 5, pp. 235–256, 2006.
  28. Algergawy A. , Schallehn E. and Saake G. , "A Schema Matching-based Approach to XML Schema Clustering", Linz, Austria. ACM 978 1-60558-349-5/08/0011, November pp. 24-26, 2008.
  29. Kushmerick N. , "Wrapper Verification," World Wide Web Journal, vol. 3, no. 2, pp. 79-94, 2000.
Index Terms

Computer Science
Information Sciences

Keywords

Schema Mapping Schema Type Classifier Schema Filtration Web Data Extraction