CFP last date
20 May 2024
Reseach Article

Machine Learning based Multilingual OCR

by Chandrahas Gaikwad, Satish Akolkar, Reshma Khodade, Deepali Dalal, Smita S. Pawar
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 117 - Number 7
Year of Publication: 2015
Authors: Chandrahas Gaikwad, Satish Akolkar, Reshma Khodade, Deepali Dalal, Smita S. Pawar
10.5120/20568-2963

Chandrahas Gaikwad, Satish Akolkar, Reshma Khodade, Deepali Dalal, Smita S. Pawar . Machine Learning based Multilingual OCR. International Journal of Computer Applications. 117, 7 ( May 2015), 27-31. DOI=10.5120/20568-2963

@article{ 10.5120/20568-2963,
author = { Chandrahas Gaikwad, Satish Akolkar, Reshma Khodade, Deepali Dalal, Smita S. Pawar },
title = { Machine Learning based Multilingual OCR },
journal = { International Journal of Computer Applications },
issue_date = { May 2015 },
volume = { 117 },
number = { 7 },
month = { May },
year = { 2015 },
issn = { 0975-8887 },
pages = { 27-31 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume117/number7/20568-2963/ },
doi = { 10.5120/20568-2963 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:58:43.763962+05:30
%A Chandrahas Gaikwad
%A Satish Akolkar
%A Reshma Khodade
%A Deepali Dalal
%A Smita S. Pawar
%T Machine Learning based Multilingual OCR
%J International Journal of Computer Applications
%@ 0975-8887
%V 117
%N 7
%P 27-31
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Paperless business has led to high speed amelioration in the world of technology. Storage, processing and retrieval of data have thus become effortless. To avoid unnecessary alterations during these phases, dossiers are stored as images or as Printable Document Format (PDF). But when real time modifications are to be made, barriers occur due to platform and script dependency, leading to complications. In this project, a generic way to overcome this problem has been presented through the concept of machine learning. A learning character set and a PDF of the identical script constitute the input. The unique features of various characters in the character set are learnt by the machine through various classifiers, and a map for the same is searched in the PDF and correspondingly profiles are generated. These classifiers distinguish the characters based on number of ripples in their patterns, number of regions and other parameters. Comparison is made between both and exact match is declared as result. This project eradicates the need to 'start from scratch' for processing newly encountered script, as observed in the conventional software due to its 'classifier reuse' strategy. It touches the social aspect in situations, where data is available with the user, but in a format in which manipulation is tiresome. In such cases, user can simply give the respective PDF and its character set as input, and obtain corresponding editable version as an output.

References
  1. Text Classification Using Machine Learning Techniques, M. IKONOMAKIS, S. KOTSIANTIS, V. TAMPAKAS
  2. Machine Learning for Image Classification and Clustering Using a Universal Distance Measure, Uzi Chester and Joel Ratsaby, Electrical and Electronics Engineering Department, Ariel University of Samaria, ARIEL 40700
  3. Cursive character recognition – a character segmentation method using projection profile-based technique Roberto J. Rodrigues, Antonio Carlos Gay Thomé
  4. A Two Stage Classification Approach to Tamil Handwriting Recognition. S. Hewavitharana, Department of Computer Science, University of Colombo, Colombo 03, Sri Lanka, H. C. Fernando, Sri Lanka Institute of Information Technology, Colombo 03, Sri Lanka
  5. Peter W. Frey and David J. Slate,"Letter Recognition Using Holland style Adaptive Classifiers" Department of Psychology, Northwestern University, Evanston, IL 60208
  6. A Simple and Effective Optical Character Recognition System for Digits Recognition using the Pixel-Contour Features and Mathematical Parameters, Jenil Shah, Viral Gokani.
  7. Tree Structured Data Analysis: AID, CHAID and CART Leland Wilkinson, SPSS Inc. , 233 South Wacker, Chicago, IL 60606, Department of Statistics, Northwestern University, Evanston, IL 60201
Index Terms

Computer Science
Information Sciences

Keywords

Multilingual Optical Character Recognition Machine Learning Classifiers.