A Statistical Approach for Estimating Language Model Reliability with Effective Smoothing Technique

Gend Lal Prajapati; Rekha Saha

Call for Paper

March Edition

IJCA solicits high quality original research papers for the upcoming March edition of the journal. The last date of research paper submission is 20 February 2026

Submit your paper

Know more

The week's pick

A Knowledge-Graph–Driven Multimodal Large Model for Semantic Understanding and Controllable Generation of Intangible Cultural Heritage

Jundi Yang Heng Yao

Random Articles

Reseach Article

A Statistical Approach for Estimating Language Model Reliability with Effective Smoothing Technique

by Gend Lal Prajapati, Rekha Saha

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 123 - Number 16

Year of Publication: 2015

Authors: Gend Lal Prajapati, Rekha Saha

10.5120/ijca2015905763

Gend Lal Prajapati, Rekha Saha . A Statistical Approach for Estimating Language Model Reliability with Effective Smoothing Technique. International Journal of Computer Applications. 123, 16 ( August 2015), 31-35. DOI=10.5120/ijca2015905763

@article{ 10.5120/ijca2015905763,

author = { Gend Lal Prajapati, Rekha Saha },

title = { A Statistical Approach for Estimating Language Model Reliability with Effective Smoothing Technique },

journal = { International Journal of Computer Applications },

issue_date = { August 2015 },

volume = { 123 },

number = { 16 },

month = { August },

year = { 2015 },

issn = { 0975-8887 },

pages = { 31-35 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume123/number16/22046-2015905763/ },

doi = { 10.5120/ijca2015905763 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T23:12:54.494606+05:30

%A Gend Lal Prajapati

%A Rekha Saha

%T A Statistical Approach for Estimating Language Model Reliability with Effective Smoothing Technique

%J International Journal of Computer Applications

%@ 0975-8887

%V 123

%N 16

%P 31-35

%D 2015

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Language Model smoothing is an imperative technology which deals with unseen test data by re-evaluating some zero-probability n-grams and assign them bare minimum non-zero values. There is an assortment of smoothing techniques employed to trim down tiny amount of probability from the probable grams and share out to zero probable grams within a Language Model. Kneser Ney and Latent Dirichlet Allocation algorithm are two probable techniques used for proficient smoothing. In this paper, a scheme is proposed for effective smoothing by combining Kneser Ney and Latent Dirichlet Allocation approaches. Moreover, another scheme is proposed to measure the reliability of a Language Model and determine the association between entropy and perplexity. These schemes are demonstrated by appropriate examples.

References

Teemu, V.H. and Virpoija, S. 2003. On growing and pruning Kneser Ney Smoothed N-gram Models. IEEE Transaction in Audio, Speech, and Language Processing. 1617-1624.
Sethy, A., Georgiou, P., Ramabhandran, B. and Narayan, S. 2007. An Iterative Relative minimization Based Data Selection Approach for N-gram Model Adaptation. IEEE Transaction on Audio, Speech, and Language Processing. 13-23.
Blei, D. M., Andrew, Y. and Micheal, I. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research. 993–1022.
Witten, I.H. and Bell, T.C. 1991. The Zero-Frequency Problem: Estimating the probabilities of Novel Events in Adaptive Text Compression. IEEE Transaction on Information Theory. 1085 – 1094.
Gao, J. and Lee, K.F. 2000. Distribution-based pruning of backoff language models. Association for Computational Linguistics. 579-588.
Yuret, D. 2008. Smoothing a tera-word language model. Association for Computational Linguistics. 141-144.
Chen, S.F. and Goodman, J.T. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language. 359–394.
Hazem, A. and Morin E. 2013. A Comparison of Smoothing Techniques for Bilingual Lexicon Extraction from Comparable Corpora. Association for Computational Linguistics. 24-33.
Shen, Z.Y., Sun, J. and Shen, Y.D. 2008. Collective Latent Dirichlet Allocation. Data Mining ICDM. 1019-1024.
Chen, S., Beeferman, D. and Rosenfeld, R. 2002. Evaluation Metrics for Language Models. Association for Computational Linguistics. 176-182.
Kim, W., Khudanpur, S. and Wu, J. 2001. Smoothing Issues in the Structured Language Model. EuroSpeech. 717-720.
Gao, J. and Zhang, M. 2002. Improving language model size reduction sing better pruning criteria. Association for Computational Linguistics. 176-182.
Taraba, B. 2007. Kneser–Ney Smoothing With a Correcting Transformation for Small Data Sets. IEEE Transaction on Audio, Speech, and Language Processing. 1912-1921.
Zhai, C. and Lafferty, J. 2001. A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. SIGIR conference on Research and development in information retrieval. 334-342.
Wei, X., Crof and W. B. 2006. LDA-Based Document Models for Ad-hoc Retrieval. SIGIR conference on Research and development in information retrieval. 178-185.
Chung, Y.M. and Lee, J.E. 2001. A Corpus-Based Approach to Comparative Evaluation of Statistical Term Association Measures. Journal Of The American Society For Information Science And Technology. 283–296.
Huang, F.L., Yu, M.S. and Hwang, C.Y. 2013. An Empirical Study of Good-Turing Smoothing for Language Models on Different Size Corpora of Chinese. Journal of Computer and Communications. 14-19.
Ding, G. and Wang B. 2005. GJM-2: A Special Case of General Jelinek-Mercer Smoothing Method. G.G. Lee et al. (Eds.): AIRS, Vol. 3689. Springer-Verlag Berlin Heidelberg. 491 – 496
Sundermeyer, M., Schl¨uter, R. and Ney, H. 2011. On the Estimation of Discount Parameters for Language Model Smoothing. Interspeech Florence, Italy. 1433-1436.

Index Terms

Computer Science

Information Sciences

Keywords

Smoothing Pruning Entropy Perplexity Data Sparsity Statistical Control Information Retrieval.