Call for Paper - October 2019 Edition
IJCA solicits original research papers for the October 2019 Edition. Last date of manuscript submission is September 20, 2019. Read More

Line and Word Segmentation Approach for Printed Documents

Print
PDF
RTIPPR
© 2010 by IJCA Journal
Number 1 - Article 4
Year of Publication: 2010
Authors:
Nallapareddy Priyanka
Srikanta Pal
Ranju Mandal

Nallapareddy Priyanka, Srikanta Pal and Ranju Manda. Article:Line and Word Segmentation Approach for Printed Documents. IJCA,Special Issue on RTIPPR (1):30–36, 2010. Published By Foundation of Computer Science. BibTeX

@article{key:article,
	author = {Nallapareddy Priyanka and Srikanta Pal and Ranju Manda},
	title = {Article:Line and Word Segmentation Approach for Printed Documents},
	journal = {IJCA,Special Issue on RTIPPR},
	year = {2010},
	number = {1},
	pages = {30--36},
	note = {Published By Foundation of Computer Science}
}

Abstract

Line and word segmentation is one of the important step of OCR systems. In this paper we have proposed a robust method for segmentation of individual text lines based on the modified histogram obtained from run length based smearing. A complete line and word segmentation system for some popular Indian printed languages is presented here. Both foreground and background information are used here for accurate line segmentation. There may be some touching or overlapping characters between two consecutive text lines and most of the line segmentation errors are generated due to touching and overlapping character occurrences. Sometimes, interline space and noises make line segmentation a difficult task. Our method can take care of this situation accurately. Word segmentation from individual lines is also discussed here. We have tested our method on documents of Bangla, Devnagari, Kannada, Telugu scripts as well as some multi-script documents and we have obtained encouraging results from our proposed technique.

Reference

  • U. Pal and B.B. Chaudhuri, “Indian script character recognition: A Survey”, Pattern Recognition, vol. 37, pp. 1887-1899, 2004.
  • B. B. Chaudhuri and U. Pal, “A complete printed Bangla OCR system”, Pattern Recognition, vol.31, pp.531-549, 1998.
  • K. Wong, R. Casey and F. Wahl “Document Analysis System “, IBM j.Res . Dev., 26(6), pp.647-656, 1982.
  • Likforman-Sulem, L., Zahour, A. and Taconet, B., “Text line Segmentation of Historical Documents: a Survey”, International Journal on Document Analysis and Recognition, Springer, Vol. 9, Issue 2, pp.123-138, 2007.
  • F. Hones and J. Litcher, “Layout extraction of mixed mode documents”, Machine Vision Application, vol. 7, pp. 237–246, 1994.
  • K. Kise, W. Iwata, and K. Matsumoto, “A computational geometric approach to text line extraction from binary document images”, in Proc. IAPR Workshop Document Analysis Systems, pp. 364-375, 1998.
  • D. S. Le, G. R. Thoma, and H. Wechsler, “Automatic page orientation and skew angle detection for binary document images”, Pattern Recognition, vol. 27, pp. 1325-1344, 1994.
  • G. Nagy, S. Seth, and M. Viswanathan, “A prototype document image analysis system for technical journals”, Computer, vol. 25, pp. 10-22, 1992.
  • L. O’Gorman, “The document spectrum for page layout analysis”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 15, pp. 1162–1173, 1993.
  • U. Pal, M. Mitra, and B. B. Chaudhuri, “Multi-skew detection of Indian script documents”, in Proc. 6th Int. Conf. Document Analysis Recognition, pp. 292-296, 2001.
  • H. Yan, “Skew correction of document images using interline cross-correlation”, CVGIP: Graph. Models Image Process, vol. 55, pp. 538–543, 1993.
  • G. Magy, Twenty years of Document Analysis in PAMI, IEEE Trans. In PAMI, Vol.22, pp. 38-61, 2000.
  • Vijay Kumar, Pankaj K.Senegar, ”Segmentation of Printed Text in Devnagari Script and Gurmukhi Script ”, IJCA: International Journal of Computer Applications, Vol.3,pp. 24-29, 2010.
  • M.K. Jindal, R.K. Sharma and G.S. Lehal,"Segmentation of Horizontally overlapping Lines in Printed Indian Scripts",International Journal of Computational Intelligence Research,vol-3, pp.277-286, 2007.
  • U. Pal and Sagarika Datta, "Segmentation of Bangla Unconstrained Handwritten Text", Proc. 7th Int. Conf. on Document Analysis and Recognition, pp.1128-1132, 2003.
  • U. Pal and P. P. Roy, "Multi-oriented and curved text lines extraction from Indian documents", IEEE Trans. On Systems, Man and Cybernetics- Part B, vol.34, pp.1676-1684, 2004.