Comparative Cost Analysis Of Template Extraction from Heterogeneous Web Documents

Call for Paper

August Edition

IJCA solicits high quality original research papers for the upcoming August edition of the journal. The last date of research paper submission is 20 July 2026

Submit your paper

Know more

The week's pick

CAD-Genesis: An Open-Source AI-Powered Add-in for Natural Language-Driven Parametric CAD Modeling and Cross-Platform Integration in SolidWorks and Fusion 360

Anil Mandloi Prakhi Mandloi

Random Articles

Honey-patterns: Recognizing Pattern based Attacks on Websites

Mar

2017

Modified K-Means Algorithm for Effective Clustering of Categorical Data Sets

March

2014

Designing Internet of Things System for Checking Cattle Rustling in Nigeria

Jan

2017

Identifying the Topic-Specific Influential Users in Twitter

Feb

2018

Reseach Article

Comparative Cost Analysis Of Template Extraction from Heterogeneous Web Documents

Published on December 2014 by Jyoti Mhaske

Innovations and Trends in Computer and Communication Engineering

Foundation of Computer Science USA

ITCCE - Number 2

December 2014

Authors: Jyoti Mhaske

Jyoti Mhaske . Comparative Cost Analysis Of Template Extraction from Heterogeneous Web Documents. Innovations and Trends in Computer and Communication Engineering. ITCCE, 2 (December 2014), 16-18.

@article{

author = { Jyoti Mhaske },

title = { Comparative Cost Analysis Of Template Extraction from Heterogeneous Web Documents },

journal = { Innovations and Trends in Computer and Communication Engineering },

issue_date = { December 2014 },

volume = { ITCCE },

number = { 2 },

month = { December },

year = { 2014 },

issn = 0975-8887,

pages = { 16-18 },

numpages = 3,

url = { /proceedings/itcce/number2/19048-2013/ },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Proceeding Article

%1 Innovations and Trends in Computer and Communication Engineering

%A Jyoti Mhaske

%T Comparative Cost Analysis Of Template Extraction from Heterogeneous Web Documents

%J Innovations and Trends in Computer and Communication Engineering

%@ 0975-8887

%V ITCCE

%N 2

%P 16-18

%D 2014

%I International Journal of Computer Applications

Abstract

Extracting structured information from unstructured and semi-structured machine-readable documents automatically it plays vital role in now a days. So most websites are using common templates with contents to populate the information to achieve good publishing productivity. Where Internet is the major resource for extracting the information. In recent days Template detection technique received lot of concentration to improve in different aspects like performance of search engine , clustering and classification of web documents , as templates degrade the performance and accuracy of web application for a machines because of irrelevant template terms. So Novel algorithms is useful for extracting templates from a large number of web documents which are generated from heterogeneous templates. Using the similarity of underlying template structures in the document cluster the web documents so that template for each cluster is extracted simultaneously.

References

Chulyun Kim and Kyuseok Shim, Member, IEEE,"TEXT: Automatic Tem- plate Extraction from Heterogeneous Web Pages, IEEE Transactions on knoeldge and data engineering, VOL. 23, NO. 4,APRIL 2011.
Document Object Model (dom) Level 1 Speci?cation Version 1. 0, http://www. w3. org/TR/REC-DOM-Level-1, 2010.
Xpath Speci?cation, http://www. w3. org/TR/xpath, 2010.
D. Chakrabarti, R. Kumar, and K. Punera, Page-Level Template Detection via Isotonic Smoothing, Proc. 16th Intl Conf. World Wide Web (WWW),2007.
M. D. Plumbley, Clustering of Sparse Binary Data Using a Minimum Description Length Approach, http://www. elec. qmul. ac. uk/stanfo/markp/, 2002.
Chang and S. Lui. IEPAD: Information extraction based on pattern discovery. In Proc. of 2001 Intl. World Wide Web Conf. , pages 681–688, 2001.
M. N. Garofalakis, A. Gionis, R. Rastogi, S. Seshdri, and K. Shim, "Xtract: A System for Extracting Document Type Descrip- tors from Xml Documents," Proc. ACM SIGMOD, 2000.
K. Vieira, A. S. da Silva, N. Pinto, E. S. de Moura, J. M. B. Cavalcanti, and J. Freire, "A Fast and Robust Method for Web Page Template Detection and Removal," Proc. 15th ACM Int'l Conf. Information and Knowledge Management , 2006. 9] T. M. Cover and J. A. Thomas, Elements of Information Theory. Wiley Interscience, 1991.
F. Pan, X. Zhang, and W. Wang, "Crd: Fast Co-Clustering on Large Data Sets Utilizing Sampling-Based Matrix Decomposi- tion," Proc. ACM SIGMOD, 2008.
J. Rissanen, "Modeling by Shortest Data Description," Automatica, vol. 14, pp. 465-471, 1978.
H. Zhao, W. Meng, and C. Yu, "Automatic Extraction of Dynamic Record Sections from Search Engine Result Pages," Proc. 32nd Int'l Conf. Very Large Data Bases (VLDB), 2006.

Index Terms

Computer Science

Information Sciences

Keywords

Web Template Extraction Clustering Documents Minimum Description Length Principle.