Concatenative Speech Synthesis: A Review

Rubeena A. Khan; J. S. Chitode

Call for Paper

August Edition

IJCA solicits high quality original research papers for the upcoming August edition of the journal. The last date of research paper submission is 20 July 2026

Submit your paper

Know more

The week's pick

Quantifying Label-Induced Bias in Large Language Model Self and Cross Evaluations

Muskan Saraf Sajjad Rezvani Boroujeni Justin Beaudry Hossein Abedi Tom Bush

Random Articles

Survey of Methods of Solving TSP along with its Implementation using Dynamic Programming Approach

August

2012

Coordinator Location Effects in AODV Routing Protocol in ZigBee Mesh Network

October

2015

A Simple and Efficient Roadmap to Process Fingerprint Images in Frequency Domain

February

2015

Architectural Distortion Detection in Mammogram using Contourlet Transform and Texture Features

July

2013

Reseach Article

Concatenative Speech Synthesis: A Review

by Rubeena A. Khan, J. S. Chitode

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 136 - Number 3

Year of Publication: 2016

Authors: Rubeena A. Khan, J. S. Chitode

10.5120/ijca2016907992

Rubeena A. Khan, J. S. Chitode . Concatenative Speech Synthesis: A Review. International Journal of Computer Applications. 136, 3 ( February 2016), 1-6. DOI=10.5120/ijca2016907992

@article{ 10.5120/ijca2016907992,

author = { Rubeena A. Khan, J. S. Chitode },

title = { Concatenative Speech Synthesis: A Review },

journal = { International Journal of Computer Applications },

issue_date = { February 2016 },

volume = { 136 },

number = { 3 },

month = { February },

year = { 2016 },

issn = { 0975-8887 },

pages = { 1-6 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume136/number3/24130-2016907992/ },

doi = { 10.5120/ijca2016907992 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T23:36:00.132091+05:30

%A Rubeena A. Khan

%A J. S. Chitode

%T Concatenative Speech Synthesis: A Review

%J International Journal of Computer Applications

%@ 0975-8887

%V 136

%N 3

%P 1-6

%D 2016

%I Foundation of Computer Science (FCS), NY, USA

Abstract

The primary objective of this paper is to provide an overview of existing Concatenative Text-To-Speech synthesis techniques. Concatenative speech synthesis can be broadly categorized into three categories, Diphone Based, Corpus based and Hybrid. Diphone based speech synthesis relies on different signal processing techniques such as PSOLA, FD-PSOLA etc. These signal processing techniques introduce unwanted artifacts in the synthesized speech. The most popularly used method is the Unit selection synthesis which is a corpus based synthesis method. This method produces the most natural sounding synthetic speech.

References

Sak, Hasim, Tunga GUNGOR, and Yasar SAFKAN. "A corpus-based concatenative speech synthesis system for Turkish." Turk J Elec Engin 14.2 (2006).
Newton, P.S.R. :Review of methods of Speech Synthesis, EE Dept., IIT Bombay(November 2011)
Tabet Y, and Mohamed Boughazi. "Speech synthesis techniques. A survey." Systems, Signal Processing and their Applications (WOSSPA), 2011 7th International Workshop on. IEEE, 2011.”.
Indumathi A., and E. Chandra. "Survey on speech synthesis." Int J Signal Process 6 (2012): 140-5.
Samuel Thomas, "Natural sounding Text-to-speech synthesis based on syllable-like units," M S Thesis, IIT, Madras,2007.
Indumathi, A., and E. Chandra. "Survey on speech synthesis." Int J Signal Process 6 (2012): 140-5.
M. Nageshwara Rao, Samuel Thomas, T. Nagarajan, and Hema A. Murthy, Text-to-Speech Synthesis using syllable-like units Proceedings of National Conference on Communications, IIT, India. 2005.
Narendra, N. P., and K. Sreenivasa Rao. "Optimal weight tuning method for unit selection cost functions in syllable based text-to-speech synthesis." Applied Soft Computing 13.2 (2013): 773-781.
A.J. Hunt and A.W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in Proceedings of IEEE Int. Conf. Acoust., Speech, and Signal Processing, vol. 1, pp. 373–376, 1996.
Heather Cryer and Sarah Home (2010),"Review of methods for evaluating synthetic speech", RNIB Centre for Accessible Information (CAI) Technical report No. 8, 1-12.
Aimilios Chalamandaris, Sotiris Karabetsos, Pirros Tsiakoulis,and Spyros Raptis(2010)," A Unit Selection Text-to-Speech Synthesis System Optimized for Use with Screen Readers ", IEEE Transactions on Consumer Electronics, Vol. 56, No. 3, 1890-1897.
E.veera raghavendra, srinivas Desai, B.yegnanarayana , Alan W.Black, Kishore Prahallad ,“Global Syllable Set For Building Speech Synthesis In Indian Languages “,in Proceedings of IEEE 2008 workshop on Spoken Language Technologies, Goa, India, December 2008.
John Kominek, Tanja Schultz, and Alan W. Black. "Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion." In SLTU, pp. 63-68. 2008.
Shinnosuke Takamichi et. al, “Parameter generation methods with rich context models for high-quality and flexible text-to-speech synthesis”, IEEE Journal Of Selected Topics In Signal Processing, Vol. 8,No.2, April 2014 pp 239-250.
Stas Tiomkin, David Malah, Slava Shechtman, and Zvi Kons, “A hybrid text-to-speech system that combines concatenative and statistical synthesis units” IEEE Transactions on Audio, SPEECH, and Language Processing, vol. 19, no. 5, JULY 2011 pp 1278-1288.
Mandal, Shyamal Kumar Das and Datta, Asoke kumar, “Epoch Synchronous non-overlap-add (ESNOLA) method based concatenative speech synthesis system for Bangla”. ISCA workshop on Speech Synthesis, Bonn, Germany, August 22-24, 2007
Toma, Ştefan-Adrian, et al. "A TD-PSOLA based method for speech synthesis and compression." Communications (COMM), 2010 8th International Conference on. IEEE, 2010.
Mattheyses, We-sley, Werner Verhelst, and Piet Verhoeve. "Robust pitch marking for prosodic modification of speech using TD-PSOLA." Proceedings of the IEEE Benelux/DSP Valley Signal Processing Symposium, SPS-DARTS. 2006. pp. 43-46.
Schnell, Norbert, et al. "Synthesizing a choir in real-time using Pitch Synchronous Overlap Add (PSOLA)." Proceedings of the International Computer Music Conference. 2000, pp.102-108
Mukherjee, Sankar, and Shyamal Kumar Das Mandal. "A Bengali speech synthesizer on Android OS." Proceedings of the 1st Workshop on Speech and Multimodal Interaction in Assistive Environments. Association for Computational Linguistics, 2012, pp. 43–46.
Language technological journal of TDIL : vishvabharat Epoch Synchronous Non-Overlapping Add (ESNOLA) Approach Concatenative Text to Speech Synthesis-A Technical Report.[Online]http://tdil.mit.gov.in/
Pammi, Sathish, Marc Schröder, Marcela Charfuelan, Oytun Türk, and Ingmar Steiner. "Synthesis of listener vocalisations with imposed intonation contours." In SSW, 2010, pp. 240-245.
Rao, K. Sreenivasa, and B. Yegnanarayana. "Prosody modification using instants of significant excitation." Audio, Speech, and Language Processing, IEEE Transactions on vol.14, no. ``3 (2006): 972-980.
Heiga Zen, Keiichi Tokuda, Alan W. Black ,“Statistical parametric speech synthesis”, Speech Communication vol.51,no.11,2009,pp. 1039–1064.
Raitio, Tuomo, et al. "HMM-based speech synthesis utilizing glottal inverse filtering." Audio, Speech, and Language Processing, IEEE Transactions on vol.19, no.1, 2011, pp. 153-165.
Yu, Kai, Heiga Zen, François Mairesse, and Steve Young. "Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis." Speech communication, vol.53, no. 6, 2011, pp. 914-923.
Lu, H., Ling, Z. H., Wei, S., Dai, L. R., & Wang, R. H.,” Automatic error detection for unit selection speech synthesis using log likelihood ratio based SVM classifier”. In INTERSPEECH, Vol. 10, 2010, and pp. 162-165).
Fu-Chiang Chou, Chiu-Yu Tseng, and Lin-Shan Lee, “A Set of Corpus-Based Text-to-Speech Synthesis Technologies for Mandarin Chinese”, IEEE Transactions on Speech and Audio Processing, vol. 10, no. 7, 2002, pp 481-494.
V. Kamakshi Prasad, T. Nagarajan and Hema A. Murthy, “Automatic segmentation of continuous speech using minimum phase group delay functions”, Speech Communications, Elsevier publications, Vol.42, 2004,pp.429-446.
Size of Speech Corpora ( As on july 2014) , [Online] http://www.ldcil.org/resourcesSpeechCorp.aspx
John Kominek, Alan W Black,” THE CMU ARCTIC SPEECH DATABASES”, 5th ISCA Speech Synthesis Workshop – Pittsburgh, 2004, pp 223-224.
[Online],CMU ARCTIC speech synthesis databases, http://festvox.org/cmu arctic/
Online],CMU FAF speech synthesis databases, http://festvox.org/cmu_faf/
[ Online],CMU SIN speech synthesis databases, http://festvox.org/cmu_sin/
http://festvox.org/dbs/dbs_kdt.html
Catherine Stevens, Nicole Lees, Julie Vonwiller , Denis Burnham ,” On-line experimental methods to evaluate text-to-speech (TTS) synthesis: effects of voice gender and signal quality on intelligibility, naturalness and preference”, Computer Speech and Language Elsevier publications Vol. 19, 2005,pp 129–146.
Azis, Nur Aziza, et al. "Evaluation of text-to-speech synthesizer for Indonesian language using semantically unpredictable sentences test: IndoTTS, eSpeak, and google translate TTS." Advanced Computer Science and Information System (ICACSIS), 2011 International Conference on. IEEE, pp. 237-242.
Heiga Zen, Norbert Braunschweiler, Sabine Buchholz, Mark J. F. Gales, Kate Knill, Sacha Krstulovic, and Javier Latorre, “Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization”, ." Audio, Speech, and Language Processing, IEEE Transactions on 20, no. 6 (2012),pp 1713-1724.
http://tcts.fpms.ac.be/synthesis/mbrola.html
Marc Schröder and Jürgen Trouvain. "The German text-to-speech synthesis system MARY: A tool for research, development and teaching." International Journal of Speech Technology 6, no. 4 (2003),pp- 365-377.
Black Alan. Paul Taylor, Richard Caley, and Rob Clark. "The festival speech synthesis system." University of Edinburgh 1 (2002).
Black. Alan. and Kevin A. Lenzo. "Flite: a small fast run-time synthesis engine." 4th ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis. 2001.
Black Alan and Paul Taylor. "CHATR: a generic speech synthesis system." Proceedings of the 15th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 1994.
Spector, A. Z. 1989. Achieving application requirements. In Distributed Systems, S. Mullender

Index Terms

Computer Science

Information Sciences

Keywords

TTS PSOLA TD-PSOLA FD-PSOLA ESNOLA MOS SUS DRT HMM.