CFP last date
20 August 2025
Reseach Article

Detection of Synthetic or Cloned Voices using Deep Learning and Acoustic Feature Analysis

by Kaushik Sinha, Debalina Sinha Jana
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 17
Year of Publication: 2025
Authors: Kaushik Sinha, Debalina Sinha Jana
10.5120/ijca2025925228

Kaushik Sinha, Debalina Sinha Jana . Detection of Synthetic or Cloned Voices using Deep Learning and Acoustic Feature Analysis. International Journal of Computer Applications. 187, 17 ( Jul 2025), 47-52. DOI=10.5120/ijca2025925228

@article{ 10.5120/ijca2025925228,
author = { Kaushik Sinha, Debalina Sinha Jana },
title = { Detection of Synthetic or Cloned Voices using Deep Learning and Acoustic Feature Analysis },
journal = { International Journal of Computer Applications },
issue_date = { Jul 2025 },
volume = { 187 },
number = { 17 },
month = { Jul },
year = { 2025 },
issn = { 0975-8887 },
pages = { 47-52 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number17/detection-of-synthetic-or-cloned-voices-using-deep-learning-and-acoustic-feature-analysis/ },
doi = { 10.5120/ijca2025925228 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2025-07-09T01:07:29.325589+05:30
%A Kaushik Sinha
%A Debalina Sinha Jana
%T Detection of Synthetic or Cloned Voices using Deep Learning and Acoustic Feature Analysis
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 17
%P 47-52
%D 2025
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The advancement of generative deep learning models has enabled the creation of synthetic and cloned voices that are increasingly indistinguishable from genuine human speech. While these innovations provide numerous benefits in accessibility and personalized services, they also raise serious concerns in the realms of cybersecurity, misinformation, and digital forensics. This paper proposes a robust detection framework that leverages deep neural networks combined with advanced spectro-temporal acoustic features. A hybrid CNN-BiLSTM model is used for binary classification between real and synthetic speech. The model is evaluated on a comprehensive dataset that includes a wide range of synthesized voices generated using state-of-the-art voice cloning technologies. The proposed system achieves a detection accuracy of 96.4% and exhibits strong generalizability across synthesis methods and audio compression formats. The findings underscore the model's potential as a vital tool in multimedia forensics and digital authentication.

References
  1. A. van den Oord et al., "WaveNet: A generative model for raw audio," arXiv preprint arXiv:1609.03499, 2016.
  2. J. Shen et al., "Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions," in Proc. IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP), 2018, pp. 4779–4783.
  3. Y. Ren et al., "FastSpeech: Fast, robust and controllable text to speech," in Adv. Neural Inf. Process. Syst. (NeurIPS), 2019, vol. 32.
  4. B. Zhang et al., "Voice synthesis for accessibility: Opportunities and risks," ACM Trans. Comput.-Hum. Interact., vol. 29, no. 3, pp. 1–31, 2022.
  5. J. Kreps et al., "The threat of synthetic media and deepfakes in digital forensics," J. Digit. Forensics, Secur. Law, vol. 17, no. 1, pp. 1–17, 2022.
  6. S. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, pp. 357–366, Aug. 1980.
  7. D. Reynolds et al., "Speaker verification using adapted Gaussian mixture models," Digital Signal Process., vol. 10, no. 1–3, pp. 19–41, 2000.
  8. X. Yang et al., "CNN-based detection of synthetic speech using a short-term spectral feature," in Proc. INTERSPEECH, 2019, pp. 1078–1082.
  9. M. Schuster and K. K. Paliwal, "Bidirectional recurrent neural networks," IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681, Nov. 1997.
  10. A. Baevski et al., "wav2vec 2.0: A framework for self-supervised learning of speech representations," in Adv. Neural Inf. Process. Syst. (NeurIPS), 2020, vol. 33, pp. 12449–12460.
  11. A. Radford et al., "Whisper: Robust speech recognition via large-scale weak supervision," OpenAI, 2022. [Online]. Available: https://openai.com/research/whisper
  12. M. Todisco et al., "ASVspoof 2019: Future horizons in spoofed and fake audio detection," Comput. Speech Lang., vol. 63, pp. 101075, Mar. 2020.
  13. M. Chettri et al., "WaveFake: A dataset to facilitate audio deepfake detection," arXiv preprint arXiv:2010.09245, 2020.
  14. K. Ahmed et al., "Fake or Real? Detecting AI-generated audio with raw waveforms," in Proc. IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP), 2021, pp. 636–640.
  15. T. Kinnunen et al., "Vulnerability of speaker verification systems to spoofing," in Proc. IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP), 2008, pp. 4825–4828.
  16. B. Wang et al., "VALL-E: Neural codec language models are zero-shot text to speech synthesizers," Microsoft Research, 2023. [Online]. Available: https://arxiv.org/abs/2301.02111
  17. Suno AI, "Bark: Transformer-based text-to-audio model," GitHub, 2023. [Online]. Available: https://github.com/suno-ai/bark
  18. J. Ho et al., "Cascaded diffusion models for high fidelity text-to-speech synthesis," in Adv. Neural Inf. Process. Syst. (NeurIPS), 2022, vol. 35, pp. 16445–16459.
  19. Z. Wu and H. Li, "Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition," in Proc. INTERSPEECH, 2013, pp. 715–719.
  20. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet classification with deep convolutional neural networks," in Adv. Neural Inf. Process. Syst. (NeurIPS), 2012, vol. 25, pp. 1097–1105.
Index Terms

Computer Science
Information Sciences

Keywords

Multimedia forensics synthetic voice detection cloned voice deep learning CNN-BiLSTM spectrogram audio forensics GAN-generated speech