Detection of Synthetic or Cloned Voices using Deep Learning and Acoustic Feature Analysis

Kaushik Sinha; Debalina Sinha Jana

Call for Paper

January Edition

IJCA solicits high quality original research papers for the upcoming January edition of the journal. The last date of research paper submission is 22 December 2025

Submit your paper

Know more

The week's pick

A Hybrid Transformer-CNN Framework with Early and Late Fusion for Robust Skin Lesion Classification

Raihan Tanvir

Random Articles

Reseach Article

Detection of Synthetic or Cloned Voices using Deep Learning and Acoustic Feature Analysis

by Kaushik Sinha, Debalina Sinha Jana

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Number 17

Year of Publication: 2025

Authors: Kaushik Sinha, Debalina Sinha Jana

10.5120/ijca2025925228

Kaushik Sinha, Debalina Sinha Jana . Detection of Synthetic or Cloned Voices using Deep Learning and Acoustic Feature Analysis. International Journal of Computer Applications. 187, 17 ( Jul 2025), 47-52. DOI=10.5120/ijca2025925228

@article{ 10.5120/ijca2025925228,

author = { Kaushik Sinha, Debalina Sinha Jana },

title = { Detection of Synthetic or Cloned Voices using Deep Learning and Acoustic Feature Analysis },

journal = { International Journal of Computer Applications },

issue_date = { Jul 2025 },

volume = { 187 },

number = { 17 },

month = { Jul },

year = { 2025 },

issn = { 0975-8887 },

pages = { 47-52 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume187/number17/detection-of-synthetic-or-cloned-voices-using-deep-learning-and-acoustic-feature-analysis/ },

doi = { 10.5120/ijca2025925228 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2025-07-09T01:07:29.325589+05:30

%A Kaushik Sinha

%A Debalina Sinha Jana

%T Detection of Synthetic or Cloned Voices using Deep Learning and Acoustic Feature Analysis

%J International Journal of Computer Applications

%@ 0975-8887

%V 187

%N 17

%P 47-52

%D 2025

%I Foundation of Computer Science (FCS), NY, USA

Abstract

The advancement of generative deep learning models has enabled the creation of synthetic and cloned voices that are increasingly indistinguishable from genuine human speech. While these innovations provide numerous benefits in accessibility and personalized services, they also raise serious concerns in the realms of cybersecurity, misinformation, and digital forensics. This paper proposes a robust detection framework that leverages deep neural networks combined with advanced spectro-temporal acoustic features. A hybrid CNN-BiLSTM model is used for binary classification between real and synthetic speech. The model is evaluated on a comprehensive dataset that includes a wide range of synthesized voices generated using state-of-the-art voice cloning technologies. The proposed system achieves a detection accuracy of 96.4% and exhibits strong generalizability across synthesis methods and audio compression formats. The findings underscore the model's potential as a vital tool in multimedia forensics and digital authentication.

References

A. van den Oord et al., "WaveNet: A generative model for raw audio," arXiv preprint arXiv:1609.03499, 2016.
J. Shen et al., "Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions," in Proc. IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP), 2018, pp. 4779–4783.
Y. Ren et al., "FastSpeech: Fast, robust and controllable text to speech," in Adv. Neural Inf. Process. Syst. (NeurIPS), 2019, vol. 32.
B. Zhang et al., "Voice synthesis for accessibility: Opportunities and risks," ACM Trans. Comput.-Hum. Interact., vol. 29, no. 3, pp. 1–31, 2022.
J. Kreps et al., "The threat of synthetic media and deepfakes in digital forensics," J. Digit. Forensics, Secur. Law, vol. 17, no. 1, pp. 1–17, 2022.
S. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, pp. 357–366, Aug. 1980.
D. Reynolds et al., "Speaker verification using adapted Gaussian mixture models," Digital Signal Process., vol. 10, no. 1–3, pp. 19–41, 2000.
X. Yang et al., "CNN-based detection of synthetic speech using a short-term spectral feature," in Proc. INTERSPEECH, 2019, pp. 1078–1082.
M. Schuster and K. K. Paliwal, "Bidirectional recurrent neural networks," IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681, Nov. 1997.
A. Baevski et al., "wav2vec 2.0: A framework for self-supervised learning of speech representations," in Adv. Neural Inf. Process. Syst. (NeurIPS), 2020, vol. 33, pp. 12449–12460.
A. Radford et al., "Whisper: Robust speech recognition via large-scale weak supervision," OpenAI, 2022. [Online]. Available: https://openai.com/research/whisper
M. Todisco et al., "ASVspoof 2019: Future horizons in spoofed and fake audio detection," Comput. Speech Lang., vol. 63, pp. 101075, Mar. 2020.
M. Chettri et al., "WaveFake: A dataset to facilitate audio deepfake detection," arXiv preprint arXiv:2010.09245, 2020.
K. Ahmed et al., "Fake or Real? Detecting AI-generated audio with raw waveforms," in Proc. IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP), 2021, pp. 636–640.
T. Kinnunen et al., "Vulnerability of speaker verification systems to spoofing," in Proc. IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP), 2008, pp. 4825–4828.
B. Wang et al., "VALL-E: Neural codec language models are zero-shot text to speech synthesizers," Microsoft Research, 2023. [Online]. Available: https://arxiv.org/abs/2301.02111
Suno AI, "Bark: Transformer-based text-to-audio model," GitHub, 2023. [Online]. Available: https://github.com/suno-ai/bark
J. Ho et al., "Cascaded diffusion models for high fidelity text-to-speech synthesis," in Adv. Neural Inf. Process. Syst. (NeurIPS), 2022, vol. 35, pp. 16445–16459.
Z. Wu and H. Li, "Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition," in Proc. INTERSPEECH, 2013, pp. 715–719.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet classification with deep convolutional neural networks," in Adv. Neural Inf. Process. Syst. (NeurIPS), 2012, vol. 25, pp. 1097–1105.

Index Terms

Computer Science

Information Sciences

Keywords

Multimedia forensics synthetic voice detection cloned voice deep learning CNN-BiLSTM spectrogram audio forensics GAN-generated speech