An Overview of Speaker Recognition and Implementation of Speaker Diarization with Transcription

Arthav Mane; Janhavi Bhopale; Ria Motghare; Priya Chimurkar

Call for Paper

March Edition

IJCA solicits high quality original research papers for the upcoming March edition of the journal. The last date of research paper submission is 20 February 2026

Submit your paper

Know more

The week's pick

A Knowledge-Graph–Driven Multimodal Large Model for Semantic Understanding and Controllable Generation of Intangible Cultural Heritage

Jundi Yang Heng Yao

Random Articles

Reseach Article

An Overview of Speaker Recognition and Implementation of Speaker Diarization with Transcription

by Arthav Mane, Janhavi Bhopale, Ria Motghare, Priya Chimurkar

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 175 - Number 31

Year of Publication: 2020

Authors: Arthav Mane, Janhavi Bhopale, Ria Motghare, Priya Chimurkar

10.5120/ijca2020920867

Arthav Mane, Janhavi Bhopale, Ria Motghare, Priya Chimurkar . An Overview of Speaker Recognition and Implementation of Speaker Diarization with Transcription. International Journal of Computer Applications. 175, 31 ( Nov 2020), 1-6. DOI=10.5120/ijca2020920867

@article{ 10.5120/ijca2020920867,

author = { Arthav Mane, Janhavi Bhopale, Ria Motghare, Priya Chimurkar },

title = { An Overview of Speaker Recognition and Implementation of Speaker Diarization with Transcription },

journal = { International Journal of Computer Applications },

issue_date = { Nov 2020 },

volume = { 175 },

number = { 31 },

month = { Nov },

year = { 2020 },

issn = { 0975-8887 },

pages = { 1-6 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume175/number31/31646-2020920867/ },

doi = { 10.5120/ijca2020920867 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-07T00:39:56.666252+05:30

%A Arthav Mane

%A Janhavi Bhopale

%A Ria Motghare

%A Priya Chimurkar

%T An Overview of Speaker Recognition and Implementation of Speaker Diarization with Transcription

%J International Journal of Computer Applications

%@ 0975-8887

%V 175

%N 31

%P 1-6

%D 2020

%I Foundation of Computer Science (FCS), NY, USA

Abstract

This paper presents an overview of the generic process of a speaker recognition system and an implementation of its usage in a speaker diarization process. The motivation behind this paper is to present a simple implementation of a speaker diarization system that inculcates the usage of speaker recognition, speech segmentation and speech transcription. On the basis of various speech features such as Mel Frequency Cepstral Coefficients (MFCCs), Joint Factor Analysis (JFA), i-vectors, Probabilistic Linear Discriminant Analysis (PLDA), etc., speaker modelling is done to train Gaussian Mixture Models (GMMs), Hidden Markov Models (HMMs) and to use clustering. Speaker diarization is then implemented to get speakers speech segments which are then converted into text for the user. The methods discussed, and thus implemented, emphasize on maximum identification rate and minimal error in order to develop the functionality of speaker diarization and audio transcription and are aimed at helping the user to create a manuscript of the conversations that take place between multiple people.

References

K. Selvan, A. Joseph and K. K. Anish Babu, “Speaker recognition system for security applications,” 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS), Trivandrum, pp. 26-30, 2013.
P. Verma, P. K. Das, “i-Vectors in speech processing applications: a survey,” Int J Speech Technol 18, pp. 529546, 2015.
S. Swamy, K. V. Ramakrishnan, “An Efficient Speech Recognition System,” Computer Science Engineering: An International Journal (CSEIJ), vol. 3, no. 4, 2013.
X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland and O. Vinyals, “Speaker Diarization: A Review of Recent Research,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 2, pp. 356-370, February 2012.
S. E. Tranter and D. A. Reynolds, “An overview of automatic speaker diarization systems,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1557-1565, September 2006.
G. Sell and D. Garcia-Romero, “Speaker diarization with PLDA i-vector scoring and unsupervised calibration, 2014 IEEE Spoken Language Technology Workshop (SLT), South Lake Tahoe, NV, pp. 413-417, 2014.
D. A. Reynolds, “An Overview of Automatic Speaker Recognition Technology, IEEE International Conference on Acoustics, Speech, and Signal Processing, Florida, pp. IV-4072-IV-4075, 2002.
V. Tiwari, “MFCC and its applications in speaker recognition, 2010.
S. Memon, M. Lech and L. He, “Using information theoretic vector quantization for inverted MFCC based speaker verification, 2009 2nd International Conference on Computer, Control and Communication, Karachi, pp. 1-5, 2009.
M. Sahidullah and G. Saha, “On the use of Distributed DCT in Speaker Identification, 978-1-4244-4589-3, 2009.
S. Kim, T. Eriksson, Hong-Goo Kang and Dae Hee Youn, “A pitch synchronous feature extraction method for speaker recognition, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Que., pp. I-405, 2004.
M. A. Hossan, S. Memon and M. A. Gregory, “A novel approach for MFCC feature extraction, 2010 4th International Conference on Signal Processing and Communication Systems, Gold Coast, QLD, pp. 1-5, 2010.
P. Kenny, G. Boulianne, P. Ouellet and P. Dumouchel, “Joint Factor Analysis Versus Eigenchannels in Speaker Recognition, IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1435-1447, May 2007.
P. Kenny, G. Boulianne, P. Ouellet and P. Dumouchel, “Speaker and Session Variability in GMM-Based Speaker Verification, IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1448-1460, May 2007.
P. Kenny, P. Ouellet, N. Dehak, V. Gupta and P. Dumouchel, “A Study of Interspeaker Variability in Speaker Verification, IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 5, pp. 980-988, July 2008.
O. Glembek, L. Burget, N. Dehak, N. Brummer and P. Kenny, “Comparison of scoring methods used in speaker recognition with Joint Factor Analysis, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, pp. 4057-4060, 2009.
F. Leu and G. Lin, “An MFCC-Based Speaker Identification System, 2017 IEEE 31st International Conference on Advanced Information Networking and Applications (AINA), Taipei, pp. 1055-1062, 2017.
N. Dehak, “Discriminative and generative approaches for long- and short-term speaker characteristics modeling: Application to speaker verification, PhD thesis, Ecole de Technologie Suprieure (Canada), 2009.
N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel and P. Ouellet, “Front-End Factor Analysis for Speaker Verification, IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, May 2011.
M. McLaren and D. van Leeuwen, “Improved speaker recognition when using i-vectors from multiple speech sources, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, pp. 5460-5463, 2011.
N. Dehak, R. Dehak, J. Glass, D. Reynolds and P. Kenny, “Cosine Similarity Scoring without Score Normalization Techniques, Proceedings Odyssey 2010 The speaker and language recognition workshop, pp. 1519, 2010.
D. A. Reynolds and R. C. Rose, “Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models, IEEE Transactions on Speech and Audio Processing, vol. 3, no. 1, pp. 72-83, January 1995.
D. A. Reynolds, T. F. Quatieri and R. B. Dunn, “Speaker Verification Using Adapted Gaussian Mixture Models, Digital Signal Processing, vol. 10, issue 1-3, pp. 19-41, 2000.
D. Burton, “Text-dependent speaker verification using vector quantization source coding, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 35, no. 2, pp. 133-143, February 1987.
W. Chen, Q. Hong and X. Li, “GMM-UBM for text-dependent speaker recognition, 2012 International Conference on Audio, Language and Image Processing, Shanghai, pp. 432-435, 2012.
A. Sarkar and Z. Tan, “Text Dependent Speaker Verification Using unsupervised HMM-UBM and Temporal GMM-UBM, Interspeech, San Francisco, 2016.
J. Gauvain and Chin-Hui Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains, IEEE Transactions on Speech and Audio Processing, vol. 2, no. 2, pp. 291-298, April 1994.
T. Giannakopoulos, “pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis, PLoS One 10, 2015.
M. Plumpe, A. Acero, H. Hon, X. Huang, “HMM-based Smoothing For Concatenative Speech Synthesis, 5th International Conference on Spoken Language Processing (ICSLP 98), Sydney, 1998.
T. Giannakopoulos and S. Petridis, “Fisher Linear Semi-Discriminant Analysis for Speaker Diarization, IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 7, pp. 1913-1922, September 2012.
L. Vendramin, R. Campello and E. Hruschka, “On the Comparison of Relative Clustering Validity Criteria, 2009 SIAM International Conference on Data Mining, 2009.
Anthony Zhang (Uberi), “SpeechRecognition” Python library, https://github.com/Uberi/speech recognition

Index Terms

Computer Science

Information Sciences

Keywords

Speaker Recognition Speaker Modelling Audio Segmentation Speaker Diarization Speech Processing Speech Recognition Transcription