Experimental Analysis of an Interactive MFCC + AHC Speaker Diarization Framework Across Multi-Domain Audio Conditions

Sayyada Sara Banu; Ratnadeep R. Deshmukh

Call for Paper

April Edition

IJCA solicits high quality original research papers for the upcoming April edition of the journal. The last date of research paper submission is 20 March 2026

Submit your paper

Know more

The week's pick

Explainable Hybrid Deep Learning for Automated Diagnosis of Canine Mammary Tumors

Elham Shawky Salama Heba Askr Ashraf Darwish Aboul Ella Hassanien

Random Articles

Reseach Article

Experimental Analysis of an Interactive MFCC + AHC Speaker Diarization Framework Across Multi-Domain Audio Conditions

by Sayyada Sara Banu, Ratnadeep R. Deshmukh

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Number 77

Year of Publication: 2026

Authors: Sayyada Sara Banu, Ratnadeep R. Deshmukh

10.5120/ijca2026926227

Sayyada Sara Banu, Ratnadeep R. Deshmukh . Experimental Analysis of an Interactive MFCC + AHC Speaker Diarization Framework Across Multi-Domain Audio Conditions. International Journal of Computer Applications. 187, 77 ( Jan 2026), 35-43. DOI=10.5120/ijca2026926227

@article{ 10.5120/ijca2026926227,

author = { Sayyada Sara Banu, Ratnadeep R. Deshmukh },

title = { Experimental Analysis of an Interactive MFCC + AHC Speaker Diarization Framework Across Multi-Domain Audio Conditions },

journal = { International Journal of Computer Applications },

issue_date = { Jan 2026 },

volume = { 187 },

number = { 77 },

month = { Jan },

year = { 2026 },

issn = { 0975-8887 },

pages = { 35-43 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume187/number77/experimental-analysis-of-an-interactive-mfcc-ahc-speaker-diarization-framework-across-multi-domain-audio-conditions/ },

doi = { 10.5120/ijca2026926227 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2026-02-01T00:33:39.946829+05:30

%A Sayyada Sara Banu

%A Ratnadeep R. Deshmukh

%T Experimental Analysis of an Interactive MFCC + AHC Speaker Diarization Framework Across Multi-Domain Audio Conditions

%J International Journal of Computer Applications

%@ 0975-8887

%V 187

%N 77

%P 35-43

%D 2026

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Automatic Speaker Diarization (ASD)—the process of determining “who spoke when”—is essential for transcription, conversational analytics, call-center monitoring, courtroom recordings, and multilingual human–computer interaction. Classical systems based on MFCCs, GMMs, and hierarchical clustering are interpretable but struggle in noisy, overlapping, and diverse audio conditions, while modern deep-learning approaches like x-vectors, ECAPA-TDNN, and Wav2Vec 2.0 offer higher accuracy but lack transparency. This study evaluates a visualization-enhanced MFCC–GMM–AHC diarization framework across AMI, VoxCeleb, CALLHOME, Mozilla Common Voice, and a custom English–Hindi dataset. The system integrates adaptive VAD, MFCC + Δ + Δ² features, GMM modeling, AHC clustering, and Viterbi re-segmentation with rich diagnostic tools. Results show strong segmentation quality and speaker separability, with DER improving from 12.8% (MFCC–GMM) to 4.7% (Wav2Vec 2.0). The framework demonstrates robust, interpretable, and multi-domain performance.

References

Anguera, Xavier, et al. “Speaker Diarization: A Review of Recent Research.” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 2, 2012, pp. 356–370.
Baevski, Alexei, et al. “Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.” Advances in Neural Information Processing Systems (NeurIPS), 2020.
Carletta, Jean. “Unleashing the AMI Meeting Corpus.” Machine Learning, vol. 62, no. 1, 2006, pp. 55–72.
Chen, S. S., and P. S. Gopalakrishnan. “Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion.” DARPA Broadcast News Workshop, 1998.
Dehak, Najim, et al. “Front-End Factor Analysis for Speaker Verification.” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, 2011, pp. 788–798.
Desplanques, Brecht, et al. “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification.” Interspeech, 2020.
Fujita, Yu, et al. “End-to-End Neural Speaker Diarization with Self-Attention.” IEEE ASRU Workshop, 2019.
Garofolo, John, et al. CALLHOME American English Speech (LDC97S42). Linguistic Data Consortium, 1997.
Hershey, John R., et al. “Deep Clustering: Discriminative Embeddings for Segmentation and Separation.” IEEE ICASSP, 2016.
Hinton, Geoffrey, et al. “Deep Neural Networks for Acoustic Modeling in Speech Recognition.” IEEE Signal Processing Magazine, vol. 29, no. 6, 2012, pp. 82–97.
Hsu, Wei-Ning, et al. “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units.” IEEE Transactions on Audio, Speech, and Language Processing, 2021.
Johnson, Douglas, et al. “The Rich Transcription 2007 Meeting Recognition Evaluation.” Multimodal Technologies for Perception of Humans (CLEAR), Springer, 2008.
Kahn, Jacob, et al. “Libri-Light: A Benchmark for ASR with Limited or No Supervision.” ICASSP, 2020.
Kashyap, Abhinav, et al. “Self-Supervised Speaker Diarization.” Interspeech, 2021.
King, Daniel. “Dlib-ml: A Machine Learning Toolkit.” Journal of Machine Learning Research, vol. 10, 2009, pp. 1755–1758.
Liu, Ying. “Spectral Clustering for Speaker Diarization.” Interspeech, 2019.
McAuliffe, Michael, et al. “Montreal Forced Aligner: Trainable Text-Alignment.” Interspeech, 2017.
Park, Daniel S., et al. “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition.” Interspeech, 2019.
Reynolds, Douglas A., and Richard C. Rose. “Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models.” IEEE Transactions on Speech and Audio Processing, vol. 3, no. 1, 1995, pp. 72–83.
Ryant, Neville, et al. “The First DIHARD Speech Diarization Challenge.” Interspeech, 2018.
Ryant, Neville, et al. “The Second DIHARD Speech Diarization Challenge.” Interspeech, 2019.
Snyder, David, et al. “X-Vectors: Robust DNN Embeddings for Speaker Recognition.” IEEE ICASSP, 2018.
Snyder, David, et al. “Speaker Recognition Using Deep Neural Networks Trained on Long Speech Segments.” Interspeech, 2017.
Sun, Jionghao, et al. “Speaker Diarization with Improved VAD and Embedding Refinement.” Interspeech, 2020.
Vijayasenan, Dheera, et al. “Information Theoretic Approaches to Speaker Diarization.” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 7, 2009, pp. 1386–1397.
Wang, Qiantong, et al. “WavLM: A Unified Framework for Self-Supervised Learning of Full-Stack Speech Processing Tasks.” IEEE Journal of Selected Topics in Signal Processing, 2022.
Xu, Yixin, et al. “Self-Supervised Learning for Speaker Diarization Using Graph Attention Networks.” ICASSP, 2021.
Yella, Sharath Kumar, et al. “Improved Overlap Detection for Speaker Diarization Using Speech Separation Techniques.” Interspeech, 2014.
Zavaliagkos, George, et al. “Speaker Segmentation and Clustering Using Hidden Markov Models.” DARPA Broadcast News Transcription Workshop, 1998.
Zhang, Andong, et al. “Fully Supervised Speaker Diarization.” ICASSP, 2019.

Index Terms

Computer Science

Information Sciences

Keywords

MFCC-GMM-AHC Automatic Speaker Diarization (ASD) MFCC GMM Bayesian Information Criterion (BIC)