CFP last date
20 May 2024
Reseach Article

Improving Transliteration Mining by Integrating Expert Knowledge with Statistical Approaches

by Ohnmar Htun, andrew Finch, Eiichiro Sumita, Yoshiki Mikami
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 58 - Number 17
Year of Publication: 2012
Authors: Ohnmar Htun, andrew Finch, Eiichiro Sumita, Yoshiki Mikami
10.5120/9373-3821

Ohnmar Htun, andrew Finch, Eiichiro Sumita, Yoshiki Mikami . Improving Transliteration Mining by Integrating Expert Knowledge with Statistical Approaches. International Journal of Computer Applications. 58, 17 ( November 2012), 12-22. DOI=10.5120/9373-3821

@article{ 10.5120/9373-3821,
author = { Ohnmar Htun, andrew Finch, Eiichiro Sumita, Yoshiki Mikami },
title = { Improving Transliteration Mining by Integrating Expert Knowledge with Statistical Approaches },
journal = { International Journal of Computer Applications },
issue_date = { November 2012 },
volume = { 58 },
number = { 17 },
month = { November },
year = { 2012 },
issn = { 0975-8887 },
pages = { 12-22 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume58/number17/9373-3821/ },
doi = { 10.5120/9373-3821 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:02:45.387832+05:30
%A Ohnmar Htun
%A andrew Finch
%A Eiichiro Sumita
%A Yoshiki Mikami
%T Improving Transliteration Mining by Integrating Expert Knowledge with Statistical Approaches
%J International Journal of Computer Applications
%@ 0975-8887
%V 58
%N 17
%P 12-22
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

This paper contributes a study of methods for integrating human expert knowledge with machine learning approaches for determining phonetic similarity of word pairs. A method is proposed which allows a human to provide a structure for the edit costs that are based around a phonetically-motivated model of phoneme sound groups, and the machine to determine precise values for these costs within two different frameworks based on stochastic edit distance: a method based on one-to-one expectation maximization (EM) alignment and a Bayesian many-to-many alignment approach. A preliminary study is within the context of cross-language word similarity in transliteration mining. The experiments were performed on a Myanmar-English mining task; the principle approach is expected to be most useful for low-resource language pairs, where human expert knowledge can compensate for a lack of data resources. The results show that the approach outperforms baseline systems based on only human knowledge and only on machine learning. This approach showed the choice of edit cost is a strong factor in determining the performance of the edit-distance-based techniques used in these experiments. The learned edit costs consistently outperformed a simple set of plausible costs selected by a human expert. Furthermore, providing a structure to the weights for the machine learning process reduced the number of parameters to be learned simplifying and speeding up the learning task. This method is expected to mitigate issues with data sparseness when learning models for low-resource languages. The reduction in the number of model parameters led to improvements in recall in these experiments, even though the model was considerably smaller, validating the choice of structure.

References
  1. Kevin Knight and Jonathan Graeh: Machine Transliteration, Journal of Association for Computational Linguistics, vol. 24, no. 4, (1998).
  2. Andrew Finch, Keiji Yasuda, Hideo Okuma, Eiichiro Sumita, and Satoshi Nakamura: A Bayesian Model of Transliteration and Its Human Evaluation When Integrated into a Machine Translation System, IEICE Transactions on Information and Systems E94-D, 10, 1889-1900, (2011).
  3. Eric Sven Ristad and Peter N. Yianilos: Learning String-Edit Distance, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 5, (1998).
  4. Takaaki Fukunishi, Andrew Finch, Seiichi Yamamoto, Eiichiro Sumita: A Bayesian Alignment Approach to Transliteration Mining, ACM Transactions on Asian Language Information Processing, vol. 9, no. 4, article. 39, (2012).
  5. K Saravanan, Raghavendra Udupa and A Kumaran: Improving Cross-Language Information Retrieval by Transliteration Mining and Generation, proceedings of Tamil Internet Conference, in Philadelphia, (2011).
  6. He, X: Using word dependent transition models in HMM based word alignment for statistical machine translation, proceeding of 2nd ACL Workshop on Statistical Machine Translation, (2007).
  7. Kareem Darwish: Transliteration Mining with Phonetic Conflation and Iterative Training, proceedings of the 2010 Named Entities Workshop, ACL 2010, pages 53-56, (2010)
  8. Ali EI Kahki, Kareem Darwish, Ahmed Saad EI Din, Mohamed Abd EI-Wahab, Ahmed Hefny, and Waleed Ammar: Improved Transliteration Mining Using Graph Reinforcement, proceedings of EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1384-1393, (2011).
  9. Ali EI Kahki, Kareem Darwish, Ahmed Saad EI Din, and Mohamed Abd EI-Wahab: Transliteration Mining Using Large Training Test Sets, proceedings of 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 243-252, (2012).
  10. V. I. Levenshtein: Binary codes capable of correcting deletions, insertions, and reversals, Journal of Soviet Physics Doklady, vol. 10, no. 8, pp. 707–709, (1966).
  11. Hassan Sajjad, Alexander Fraser, and Helmut Schmid: A Statistical Model for Unsupervised and Semi-Supervised Transliteration Mining, proceedings of Association for Computational Linguistics (ACL-2012) conference, (2012).
  12. Andrew Finch and Eiichiro Sumita: A Bayesian Model of Bilingual Segmentation for Transliteration, proceedings of the 7th International Workshop on Spoken Language Translation, pages 259-266, (2010).
  13. Jin-Shea Kuo, Haizhou Li, and Ying-Kuei Yang: A Phonetic Similarity Model for Automatic Extraction of Transliteration Pairs, ACM Trans. Asian Language Information Processing, vol. 6, no. 2, article 6, (2007).
  14. R. C. Russell and K. M. Odell: Soundex phonetic comparison system [cf. U. S. Patents 1261167(1918), 1435663 (1922)], USA, (1922).
  15. David Odden: Introducing Phonology, Cambridge University Press, pp. 34-39, (2005).
  16. Shigeaki Kodama: String Edit Distance for Computing Phonological Similarity between Words, proceedings of International Symposium on Global Multidisciplinary Engineering, (2010). .
  17. Ohnmar Htun, Shigeaki, Kodama, Yoshiki Mikami: Cross-Language Phonetic Similarity Measure on Terms Appeared in Asian Language, International Journal of Intelligent Information Processing, vol. 2, no. 2, (2011).
  18. Eric Brill, Gary Kacmarcik, Chris Brockett: Automatically Harvesting Katakana-English Term Pairs fromSearch, Asia Federation of Natural Language Processing, (2001).
  19. A Kumaran, Mitesh Khapra, and Haizhou Li: Whitepaper on NEWS 2010 Shared Task on Transliteration Mining, Whitepaper of NEWS 2010 Shared Task on Transliteration Generation, (2010).
  20. Word: Myanmar Language Commission (MLC), http://en. wikipedia. org/wiki/MLC_Transcription_System
  21. University of Foreign Language, Yangon, Myanmar: An introductory course in Myanmar language, (2005).
  22. Myanmar Language Commission, Ministry of Education, Myanmar: Myanmar-English Dictionary, 9th Edition, (2008)
Index Terms

Computer Science
Information Sciences

Keywords

Machine Learning Transliteration Mining Cross Language Information Retrieval Phonetic Similarity Statistical Approaches Stochastic Edit Distance