An Interactive MFCC-Driven Hierarchical Clustering Framework for Automatic Speaker Diarization with Visual Analytics

Sayyada Sara Banu; Ratnadeep R. Deshmukh

Call for Paper

April Edition

IJCA solicits high quality original research papers for the upcoming April edition of the journal. The last date of research paper submission is 20 March 2026

Submit your paper

Know more

The week's pick

Explainable Hybrid Deep Learning for Automated Diagnosis of Canine Mammary Tumors

Elham Shawky Salama Heba Askr Ashraf Darwish Aboul Ella Hassanien

Random Articles

Reseach Article

An Interactive MFCC-Driven Hierarchical Clustering Framework for Automatic Speaker Diarization with Visual Analytics

by Sayyada Sara Banu, Ratnadeep R. Deshmukh

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Number 77

Year of Publication: 2026

Authors: Sayyada Sara Banu, Ratnadeep R. Deshmukh

10.5120/ijca2026926226

Sayyada Sara Banu, Ratnadeep R. Deshmukh . An Interactive MFCC-Driven Hierarchical Clustering Framework for Automatic Speaker Diarization with Visual Analytics. International Journal of Computer Applications. 187, 77 ( Jan 2026), 28-34. DOI=10.5120/ijca2026926226

@article{ 10.5120/ijca2026926226,

author = { Sayyada Sara Banu, Ratnadeep R. Deshmukh },

title = { An Interactive MFCC-Driven Hierarchical Clustering Framework for Automatic Speaker Diarization with Visual Analytics },

journal = { International Journal of Computer Applications },

issue_date = { Jan 2026 },

volume = { 187 },

number = { 77 },

month = { Jan },

year = { 2026 },

issn = { 0975-8887 },

pages = { 28-34 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume187/number77/an-interactive-mfcc-driven-hierarchical-clustering-framework-for-automatic-speaker-diarization-with-visual-analytics/ },

doi = { 10.5120/ijca2026926226 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2026-02-01T00:33:39.939904+05:30

%A Sayyada Sara Banu

%A Ratnadeep R. Deshmukh

%T An Interactive MFCC-Driven Hierarchical Clustering Framework for Automatic Speaker Diarization with Visual Analytics

%J International Journal of Computer Applications

%@ 0975-8887

%V 187

%N 77

%P 28-34

%D 2026

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Automatic Speaker Diarization (ASD) is the task of determining “who spoke when” in multi-speaker audio recordings without prior speaker labels. This paper presents a transparent, tunable, and GUI-driven diarization framework that integrates MFCC + Δ + Δ² embeddings, adaptive percentile-based Voice Activity Detection (VAD), and Agglomerative Hierarchical Clustering (AHC) with configurable distance metrics and linkage strategies. The system provides complete control over preprocessing, segmentation, clustering, and post-processing, while offering rich visual analytics including waveform-aligned speaker timelines, spectrograms, MFCC heatmaps, PCA-based embedding scatter plots, Silhouette-driven cluster diagnostics, and conversational metrics. Experimental evaluation shows that the proposed MFCC + AHC pipeline achieves stable speaker grouping with clear cluster separation and reduced fragmentation after post-processing, achieving a diarization error rate between 5.8% and 8.1% on test recordings. The tool supports RTTM/CSV/JSON export and is suitable for research, education, conversational analysis, and domain-specific diarization studies requiring interpretability and flexibility.

References

Anguera, Xavier, et al. “Speaker Diarization: A Review of Recent Research.” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 2, 2012, pp. 356–370.
Tranter, Stuart, and Douglas Reynolds. “An Overview of Automatic Speaker Diarization Systems.” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 5, 2006, pp. 1557–1565.
Kenny, Patrick, et al. “A Study of Interspeaker Variability in Speaker Verification.” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 5, 2008, pp. 980–988.
Reynolds, Douglas A., and Richard C. Rose. “Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models.” IEEE Transactions on Speech and Audio Processing, vol. 3, no. 1, 1995, pp. 72–83.
Bimbot, Frederic, et al. “A Tutorial on Text-Independent Speaker Verification.” EURASIP Journal on Advances in Signal Processing, 2004.
Wajahat, Md, and Tanvir Habib. “Voice Activity Detection Using Short-Time Energy and Zero-Crossing Rate for Speech Enhancement.” International Journal of Computer Applications, vol. 179, no. 23, 2018.
Sadjadi, Seyed Omid, and John HL Hansen. “Unsupervised Noise Robustness Improvement for Voice Activity Detection Using Voicing Measures.” IEEE Signal Processing Letters, vol. 20, no. 3, 2013, pp. 197–200.
Kinnunen, Tomi, and Haizhou Li. “An Overview of Text-Independent Speaker Recognition: From Features to Supervectors.” Speech Communication, vol. 52, no. 1, 2010, pp. 12–40.
Davis, Steven, and Paul Mermelstein. “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences.” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, 1980, pp. 357–366.
Zhang, Chenpeng, et al. “Robust Speaker Clustering Using Cluster Purity.” Interspeech, 2013.
El-Khoury, Jérémy, et al. “Enhancement of Speaker Diarization: A Comparison of Clustering Methods.” IEEE ICASSP, 2009.
Tóth, László. “Hierarchical Clustering in Speech Technology.” Acta Cybernetica, vol. 16, no. 1, 2003, pp. 1–12.
Garofolo, John S., et al. “The Rich Transcription 2004 Meeting Recognition Evaluation.” NIST RT04, 2004.
Carletta, Jean. “Unleashing the AMI Meeting Corpus.” Machine Learning, vol. 68, no. 2, 2007, pp. 155–173.
Ryant, Neville, et al. “The First DIHARD Speech Diarization Challenge.” Interspeech, 2018.
Sell, Gregory, and Daniel Garcia-Romero. “Speaker Diarization with PLDA i-Vector Scoring and Unsupervised Calibration.” IEEE SLT, 2014.
Snyder, David, et al. “X-Vectors: Robust DNN Embeddings for Speaker Recognition.” ICASSP, 2018.
Desplanques, Brecht, et al. “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification.” Interspeech, 2020.
Hsu, Wei-Ning, et al. “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units.” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
Baevski, Alexei, et al. “Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.” NeurIPS, 2020.
Chen, Yao, et al. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” arXiv preprint arXiv:1802.03426, 2018.
Jolliffe, Ian T., and Jorge Cadima. “Principal Component Analysis: A Review and Recent Developments.” Philosophical Transactions of the Royal Society A, vol. 374, no. 2065, 2016.
El-Shafey, Laurent. “PLDA with Two Sources of Inter-Session Variability.” IEEE Transactions on Audio, Speech, and Language Processing, 2013.
Bozonnet, Sébastien, et al. “Improved Speaker Diarization Using Speaker Role Information.” Interspeech, 2012.
Anguera, Xavier, and Chuck Wooters. “Frame Level Clustering of Acoustic Features for Speaker Diarization.” Interspeech, 2006.

Index Terms

Computer Science

Information Sciences

Keywords

Speaker diarization MFCC hierarchical clustering adaptive VAD Silhouette score PCA UMAP speech segmentation RTTM conversational analytics acoustic feature visualization clustering diagnostics.