CFP last date
20 March 2024
Reseach Article

Development of Nepali Character Database for Character Recognition based on Clustering

by Aadesh Neupane
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 107 - Number 11
Year of Publication: 2014
Authors: Aadesh Neupane

Aadesh Neupane . Development of Nepali Character Database for Character Recognition based on Clustering. International Journal of Computer Applications. 107, 11 ( December 2014), 42-46. DOI=10.5120/18799-0315

@article{ 10.5120/18799-0315,
author = { Aadesh Neupane },
title = { Development of Nepali Character Database for Character Recognition based on Clustering },
journal = { International Journal of Computer Applications },
issue_date = { December 2014 },
volume = { 107 },
number = { 11 },
month = { December },
year = { 2014 },
issn = { 0975-8887 },
pages = { 42-46 },
numpages = {9},
url = { },
doi = { 10.5120/18799-0315 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
%0 Journal Article
%1 2024-02-06T22:40:50.556578+05:30
%A Aadesh Neupane
%T Development of Nepali Character Database for Character Recognition based on Clustering
%J International Journal of Computer Applications
%@ 0975-8887
%V 107
%N 11
%P 42-46
%D 2014
%I Foundation of Computer Science (FCS), NY, USA

Character Recognition tasks requires large set of reliable dataset to apply recognition algorithms and generate efficient models out of them. In case of Nepali language, no such character dataset exists for character recognition research, at least in the public domain. Nepali language has 36 consonant characters, 12 vowels character and each vowel character can modify each consonant characters. In this regard, there can be total of 446 characters including Nepali numeric characters. So, manually creating dataset for Nepali characters requires tons of effort, cost and time. In this paper, an elegant way of creating Nepali character dataset using semi-supervised clustering approach is described which minimizes effort and time. Also, optimization is done on existing segmentation algorithm [1] to segment Nepali characters for both handwritten and scanned Nepali text. Complex features are extracted from these segmented characters by applying Discrete Cosine Transform and Wavelet transform. Thus, these extracted features are used to create database of Nepali characters using phash and k-means cluster. Presently, the database contains 38,493 characters distributed among 52 different clusters.

  1. Bal Krishna Bal and Prajwal Rupakheti, Research Report on the Nepali OCR, PANL10n Admin Reports, September 2009
  2. Eugene Borovikov, A survey of modern optical character recognition techniques (DRAFT), February 2004
  3. Vijay Kumar and Pankaj K Sengar, Segmentation of Printed Text in Devanagari Script and Gurmukhi Script, International Journal of Computer Applications, vol. 3, No. 8, pp 30–33, June 2010.
  4. Mitrakshi B. Patil ,and Vaibhav Narawade, Recognition of Handwritten Devnagari Characters through Segmentation and Artificial Neural Networks, Internation Journal of Engineering Research & Technology(IJERT), vol. 1, No. 6, August 2012.
  5. Veena Bansal, and R. M. K. Sinha, Segmentation of Touching and Fused Devanagari Characters, Indian Institute of Technology, Kanpur
  6. Ratnashil N Khobragade1, Dr. Nitin A. Koli and Mahendra S Makesar, A Survey of Recognition of Devnagari Script, International Journal of Computer Applications and Information Technology (IJCAIT), vol. 2, No. 1, January 2013.
  7. Richard G. Casey and Eric Lecolinet, A survey of Methods and Strategies in Character Segmentation, IEEE Transaction on PAMI, pp 690-706, 1996.
  8. Mudit Agrawal, Huanfeng Ma, and David Doermann, Generalization of Hindi OCR Using Adaptive Segmentation and Font Files, 2009.
  9. Sanjeev Maharjan, MPP Nepali OCR Report, PANL10n Admin Reports, July 2010 .
  10. Anilkumar N Holambe, Ravindra C Thool, Combining Multiple Feature Extraction Technique and Classifiers for Increasing Accuracy for Devanagari OCR, IJSCE, Vol. 3, No. 4, September 2013.
  11. Sheetal Dabra, Sunil Agrawal, and Rama Krishna Challa, A Novel Feature Set for Recognition of Similar Shaped Handwritten Hindi Characters Using Machine Learning, CCSEA 2011, Vol. 02, pp. 25-35, 2011.
  12. Andrew B. Watson, Image Compression Using the Discrete Cosine Transform, Mathematical Journal, Vol. 4, No. 1, pp. 81-88, 1994.
  13. Bian Yang,Fan Gu, and XiaMu Niu Image, Perceptual Hashing, IIH-MSP, pp. 167-172, December 2006.
  14. Christoph Zauner, Implementation and Benchmarking of Perceptual Image Hash Functions, Ph. D. Thesis, University of Sichere Informationsssteme, Hagenberg, July 2010.
  15. Lewis A. S and Knowles G, Image Compression using the 2-D wavelet transform, Image Processing, IEEE Transactions, Vol. 1, No. 2, pp. 244-250.
  16. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann ,and Ian H. Witten, The WEKA data mining software: an update, SIGKDD Explorations, Vol. 11, No. 1, pp. 10-18.
  17. Jacob Goldberger, Shiri Gordon, and Hayit Greenspan, Unsupervised Image-Set Clustering Using an Information Theoretic Framework, IEEE Transactions on Image Processing, Vol. 15, No. 2, pp. 449-458, February 2006.
  18. Venkat Rasagna, Anand Kumar, C. V. Jawahar, and R. Manmatha, Robust Recognition of Documents by Fusing Results of Word Clusters,
  19. John W. Eaton and David Bateman and Soren Hauberg, GNU Octave version 3. 0. 1 manual: a high-level interactive language for numerical computations, CreateSpace Independent Publishing Platform, 2009 .
  20. A. P. Dempster; N. M. Laird; D. B. Rubin, Maximum Likelihood from Incomplete data via the EM Algorithm, Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1, pp. 1-38, 1997.
Index Terms

Computer Science
Information Sciences


Nepali Character Segmentation Nepali Character Database Nepali Character Recognition Nepali Character Clustering.