Building English-Punjabi Parallel corpus for Machine Translation

Shishpal Jindal; Vishal Goyal; Jaskarn Singh Bhullar

Call for Paper

August Edition

IJCA solicits high quality original research papers for the upcoming August edition of the journal. The last date of research paper submission is 21 July 2025

Submit your paper

Know more

The week's pick

FORENSIC ANALYSIS FRAMEWORKS FOR ENCRYPTED CLOUD STORAGE INVESTIGATIONS

Joy Awoleye Sarah Mavire Allan Munyira Kelvin Magora

Random Articles

Article:A Comparative study of Face Recognition with Principal Component Analysis and Cross-Correlation Technique

November

2010

Evaluating Embedded GPUs Performance via Computer Vision Applications

Jul

2020

Detection and Identification of Mass Structure in Digital Mammogram

September

2013

A Two Hop Power Adaptive MAC Protocol for Densely Populated Wireless Networks

March

2013

Reseach Article

Building English-Punjabi Parallel corpus for Machine Translation

by Shishpal Jindal, Vishal Goyal, Jaskarn Singh Bhullar

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 180 - Number 8

Year of Publication: 2017

Authors: Shishpal Jindal, Vishal Goyal, Jaskarn Singh Bhullar

10.5120/ijca2017916036

Shishpal Jindal, Vishal Goyal, Jaskarn Singh Bhullar . Building English-Punjabi Parallel corpus for Machine Translation. International Journal of Computer Applications. 180, 8 ( Dec 2017), 26-29. DOI=10.5120/ijca2017916036

@article{ 10.5120/ijca2017916036,

author = { Shishpal Jindal, Vishal Goyal, Jaskarn Singh Bhullar },

title = { Building English-Punjabi Parallel corpus for Machine Translation },

journal = { International Journal of Computer Applications },

issue_date = { Dec 2017 },

volume = { 180 },

number = { 8 },

month = { Dec },

year = { 2017 },

issn = { 0975-8887 },

pages = { 26-29 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume180/number8/28821-2017916036/ },

doi = { 10.5120/ijca2017916036 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-07T01:00:06.651482+05:30

%A Shishpal Jindal

%A Vishal Goyal

%A Jaskarn Singh Bhullar

%T Building English-Punjabi Parallel corpus for Machine Translation

%J International Journal of Computer Applications

%@ 0975-8887

%V 180

%N 8

%P 26-29

%D 2017

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Objective Parallel corpus is the key resource for English Punjabi machine translation. At wide level there is no availability of English-Punjabi corpora. There is a primary requirement of parallel corpus for the training of statistical machine translation. Methods/Analysis In this paper, authors focus on building English-Punjabi corpus at large scale. It posed difficulties and the intensive labor to develop the corpus. We are intricate on the collection as well as the flow of work for the construction of parallel corpus. Now after getting the raw text, we need to refine the corpus in such a way that every source language sentence should have corresponding target language sentence. Findings The paper attempts to explore existing tools as well as building new tools. One of the goals is alignment of bilingual corpus. The alignment algorithms are used to tune the sentences. The accuracy depends on the type of corpus. Novelty/Improvement A cautious endeavor has been made to capture different types of texts.

References

P. Baker, A. Hardie, T. McEnery, R. Xiao, K. Bontcheva, H. Cunningham, R. Gaizauskas, O. Hamza, D. Maynard, V. Tablan, C. Ursu, B. D. Jayaram, M. Leisher, “Corpus linguistics and South Asian languages: corpus creation and tool development”, Literary Linguist. Comput. Vo. 19, pp. 509–524, 2004.
G. N. Jha, “The TDIL program and the Indian language corpora initiative (ILCI)”, Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10). European Language Resources Association, 2010.
N. Choudhary, “Web-drawn Corpus for Indian languages: a case of Hindi”, Proceedings of the ICISIL, vol. 139, pp. 218–223. 2011.
M. Shrivastava, P. Bhattacharyya, “Hindi POS tagger using naive stemming: harnessing morphological information without extensive Linguistic knowledge”, Proceedings of the International Conference on NLP (ICON08), 2008.
S. Dandapat, S. Sarkar, A. Basu, “Automatic part-of-speech tagging for Bengali: an approach for morphologically rich languages in a poor resource scenario”, Proceedings of the Association for Computational Linguistic, pp 221–224, 2007.
A. Bharati, D. M. Sharma, L. Bai, R. Sangal, “Anncorra: Annotating Corpora”, LTRC, IIIT, Hyderabad, 2006.
S. Baskaran, K. Bali, M. Choudhury, T. Bhattacharya, P. Bhattacharyya, G. N. Jha, S. Rajendran, K. Saravanan, L. Sobha, B. M. Subbarao, “A Commonparts-of-speech tag set framework for Indian languages”, Proceedings of the 6th International Language Resources and Evaluation (LREC’08), 2008.
V. Goyal, G. S. Lehal, “Hindi morphological analyzer and generator”, Proceedings of the 1st International Conference on Emerging Trends in Engineering and Technology, 2008.
T. Bögel, M. Butt, A. Hautli, S. Sulger, “Developing a finite-state morphological analyzer for Urdu and Hindi”, Proceedings of the 6th International Workshop on Finite-State Methods and Natural Language Processing, 2007.
V. Goyal and G. S. Lehal, “N-Grams Based Word Sense Disambiguation: A Case Study of Hindi to Punjabi Machine Translation System”, International Journal of Translation, Vol. 23(1), pp. 99-113, 2011.
V. Goyal and G. S. Lehal, “Advances in Machine Translation Systems”, Language In India, Vol. 9, pp. 138-150, 2010.
V. Goyal and G. S. Lahal, “Hindi Morphological Analyzer and Generator”, IEEE Computer Society Press, Washington, DC, USA 1156-1159, 2008.
P. Brown, S. A. D. Pietra, V. J. D. Pietra, R. L. Mercer, “The Mathematics of Statistical Machine Translation: Parameter Estimation”, Computational Linguistics, Vol. 19 (2), pp. 263-311, 1993.
V. B. Dang, and B. Ho, “Automatic Construction of English-Vietnamese Parallel Corpus through Web Mining”, Proceedings of the International Conference on Innovation and Vision for the Future, pp..261-266, 2007.
www.sikhiwiki.org/index.php/Guru_Granth_Sahib
www.pseb.ac.in
www.christos-c.com/bible
www.tdil.mit.gov.in

Index Terms

Computer Science

Information Sciences

Keywords

Bilingual corpora Machine-translation English Punjabi NLP.