CFP last date
20 June 2024
Reseach Article

Text Extraction from PDF document

Published on January 2013 by D. Sasirekha, E. Chandra
Amrita International Conference of Women in Computing - 2013
Foundation of Computer Science USA
AICWIC - Number 3
January 2013
Authors: D. Sasirekha, E. Chandra

D. Sasirekha, E. Chandra . Text Extraction from PDF document. Amrita International Conference of Women in Computing - 2013. AICWIC, 3 (January 2013), 17-19.

author = { D. Sasirekha, E. Chandra },
title = { Text Extraction from PDF document },
journal = { Amrita International Conference of Women in Computing - 2013 },
issue_date = { January 2013 },
volume = { AICWIC },
number = { 3 },
month = { January },
year = { 2013 },
issn = 0975-8887,
pages = { 17-19 },
numpages = 3,
url = { /proceedings/aicwic/number3/9876-1318/ },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
%0 Proceeding Article
%1 Amrita International Conference of Women in Computing - 2013
%A D. Sasirekha
%A E. Chandra
%T Text Extraction from PDF document
%J Amrita International Conference of Women in Computing - 2013
%@ 0975-8887
%N 3
%P 17-19
%D 2013
%I International Journal of Computer Applications

Documents in PDF format are nowadays called the Universal document format. PDF to speech converter systems involves many steps to achieve. Text extraction is the primary step From PDF to do further processing. In this paper we start with the brief discussion about the steps involved in extracting the text from PDF documents. The aim of this paper is to give the introduction with some basic concepts on PDF, and with text extraction concepts, which will be useful for the readers who are less familiar in this area of research.

  1. http://desktoppub. about. com/od/electronicpublishing/g/pdf. htm
  2. http://www. digitalpreservation. gov/formats/fdd/fdd000030. shtml
  3. http://www. techterms. com/definition/pdf
  4. http://www. webopedia. com/TERM/P/PDF. html
  5. Lin, X. , Gao, L. , Tang, Z. , Lin, X. , & Hu, X. 2011. Mathematical formula identification in PDF documents. In Document Analysis and Recognition (ICDAR), 2011 International Conference on (pp. 1419-1423)
  6. AJEDIG, M. A. , Li, F. , & ur Rehman, A. 2011. A PDF Text Extractor Based on PDF-Renderer. In Proceedings of the International MultiConference of Engineers and Computer Scientists (Vol. 1)
  7. Gupta, G. , Niranjan, S. , Shrivastava, A. , & Sinha, R. 2006. Document Layout Analysis and Classification and Its Application in OCR. In Enterprise Distributed Object Computing Conference Workshops, 2006. EDOCW'06. 10th IEEE International (pp. 58-58)
  8. Williams S. Lovegrove and David F. Brailsford 1995 Document analysis of PDF files: methods, results and implications", Electronic publishing ,vol. 8 (2&3),20-220.
  9. S. Audithan, R M. Chandrasekaran 2009 Document text extraction from document images using Haar Discrete Wavelet Transform" , EJSR.
  10. Claudie Faure, Nicole Vincent 2009 Simultaneous detection of vertical and horizontal text lines based on perceptual organization Proc. SPIE 7247, Document Recognition and Retrieval XVI, 72470M doi:10. 1117/12. 805504,2009
  11. K. S. Sesh Kumar, Anoop M. Namboodiri, and C. V. Jawahar 2006 Learning segmentation of documents with complex scripts ICVGIP'06 Proceedings of the 5th Indian Conference on Computer Vision, Graphics and Image Processing, pp. 749-760.
  12. Song Mao, Azriel Rosenfeld, and Tapas Kanungo 2003 Document structure analysis algorithms: A literature survey Vol. 5010 of SPIE Proceedings, SPIE, pp. 197-207.
  13. Tamir Hassan" Object-Level Document Analysis of PDF Files", DocEng'09, September 16-18, 2009, Munich, Germany.
Index Terms

Computer Science
Information Sciences


Text Extraction Pdf Text Extraction Technique