Subtitle Generating Media Player using Mozilla DeepSpeech Model

Waat Perera; B. Hettige

Call for Paper

March Edition

IJCA solicits high quality original research papers for the upcoming March edition of the journal. The last date of research paper submission is 20 February 2026

Submit your paper

Know more

The week's pick

A Knowledge-Graph–Driven Multimodal Large Model for Semantic Understanding and Controllable Generation of Intangible Cultural Heritage

Jundi Yang Heng Yao

Random Articles

Reseach Article

Subtitle Generating Media Player using Mozilla DeepSpeech Model

by Waat Perera, B. Hettige

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 185 - Number 28

Year of Publication: 2023

Authors: Waat Perera, B. Hettige

10.5120/ijca2023923033

Waat Perera, B. Hettige . Subtitle Generating Media Player using Mozilla DeepSpeech Model. International Journal of Computer Applications. 185, 28 ( Aug 2023), 34-42. DOI=10.5120/ijca2023923033

@article{ 10.5120/ijca2023923033,

author = { Waat Perera, B. Hettige },

title = { Subtitle Generating Media Player using Mozilla DeepSpeech Model },

journal = { International Journal of Computer Applications },

issue_date = { Aug 2023 },

volume = { 185 },

number = { 28 },

month = { Aug },

year = { 2023 },

issn = { 0975-8887 },

pages = { 34-42 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume185/number28/32871-2023923033/ },

doi = { 10.5120/ijca2023923033 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-07T01:27:19.143799+05:30

%A Waat Perera

%A B. Hettige

%T Subtitle Generating Media Player using Mozilla DeepSpeech Model

%J International Journal of Computer Applications

%@ 0975-8887

%V 185

%N 28

%P 34-42

%D 2023

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Subtitles plays a major role when comes to consuming media. Most of the time either media content comes without any subtitles or comes with basic subtitles in the native language. So, finding subtitles from another language than the native or creating subtitles for a new media content wasn’t an easy task. For famous films, tv shows or sometimes songs could find subtitles in more than one language but there are majority of content that isn’t exposed to internet. To address this issue this paper proposes a method to generate real-time subtitles for selected languages using English language media files through the existing Mozilla DeepSpeech and Google Cloud Platform Translation API. This proposed system takes any English media content from .mp4 file format as the input and generate subtitle according to the users desired language preference as a .srt output. Further, this paper also describes an overview of existing methods for Speech to Text conversion, advantages and disadvantages that are compared with Mozilla DeepSpeech model. The system has been tested with Human evaluation methods as well as automated evaluation method namely BLEU.

References

A. Ramani, A. Rao, V. Vidya, and V. B. Prasad, “Automatic Subtitle Generation for Videos,” in 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), Mar. 2020, pp. 132–135. doi: 10.1109/ICACCS48705.2020.9074180.
N. Radha and R. Pradeep, “Automated subtitle generation,” vol. 10, pp. 24741–24746, Jan. 2015.
B. Xu, C. Tao, Z. Feng, Y. Raqui, and S. Ranwez, A Benchmarking on Cloud based Speech-To-Text Services for French Speech and Background Noise Effect. 2021.
P. R. Hjulström, “Evaluation of a speech recognition system,” 2015. https://www.semanticscholar.org/paper/Evaluation-of-a-speech-recognition-system-Hjulstr%C3%B6m/49c1997d54811c7eb79463260f3513c2a89b7235 (accessed Oct. 10, 2022).
J. Huang et al., “The IBM Rich Transcription Spring 2006 Speech-to-Text System for Lecture Meetings,” May 2006, pp. 432–443. doi: 10.1007/11965152_38.
R. D. Sharp et al., “The Watson speech recognition engine,” in 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Apr. 1997, pp. 4065–4068 vol.5. doi: 10.1109/ICASSP.1997.604839.
F. Filippidou and L. Moussiades, “Α Benchmarking of IBM, Google and Wit Automatic Speech Recognition Systems,” Artificial Intelligence Applications and Innovations, vol. 583, pp. 73–82, May 2020, doi: 10.1007/978-3-030-49161-1_7.
M. Stenman, “Automatic speech recognition An evaluation of Google Speech,” undefined, 2015, Accessed: Oct. 10, 2022. [Online]. Available: https://www.semanticscholar.org/paper/Automatic-speech-recognition-An-evaluation-of-Stenman/69dab8bf2f729ed94f53a2dd5df03799258b34a8
N. Anggraini, A. Kuniawan, L. Wardhani, and N. Hakiem, “Speech Recognition Application for the Speech Impaired using the Android-based Google Cloud Speech API,” Telkomnika (Telecommunication Computing Electronics and Control), vol. 16, pp. 2733–2739, Dec. 2018, doi: 10.12928/TELKOMNIKA.v16i6.9638.
J. Y. Chan and H. H. Wang, “Speech Recorder and Translator using Google Cloud Speech-to-Text and Translation | Journal of IT in Asia,” Dec. 2021, Accessed: Oct. 10, 2022. [Online]. Available: https://publisher.unimas.my/ojs/index.php/JITA/article/view/2815
A. Agarwal and T. Zesch, Robustness of end-to-end Automatic Speech Recognition Models -- A Case Study using Mozilla DeepSpeech. 2021.
A. Agarwal and T. Zesch, “LTL-UDE at Low-Resource Speech-to-Text Shared Task: Investigating Mozilla DeepSpeech in a low-resource setting,” p. 5.
E. Nacimiento-García, C. S. González-González, and F. L. Gutiérrez-Vela, “Automatic captions on video calls, a must for the elderly: Using Mozilla DeepSpeech for the STT,” in Proceedings of the XXI International Conference on Human Computer Interaction, in Interacción ’21. New York, NY, USA: Association for Computing Machinery, Sep. 2021, pp. 1–7. doi: 10.1145/3471391.3471392.
A. Sherstinsky, “Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network,” Physica D: Nonlinear Phenomena, vol. 404, p. 132306, Mar. 2020, doi: 10.1016/j.physd.2019.132306.
H. Sak, A. Senior, and F. Beaufays, “Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition.” arXiv, Feb. 05, 2014. doi: 10.48550/arXiv.1402.1128.
G. E. Dahl, Dong Yu, Li Deng, and A. Acero, “Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition,” IEEE Trans. Audio Speech Lang. Process., vol. 20, no. 1, pp. 30–42, Jan. 2012, doi: 10.1109/TASL.2011.2134090.
A. Amberkar, P. Awasarmol, G. Deshmukh, and P. Dave, “Speech Recognition using Recurrent Neural Networks,” Mar. 2018, pp. 1–4. doi: 10.1109/ICCTCT.2018.8551185.
A. F. Agarap, “Deep Learning using Rectified Linear Units (ReLU).” arXiv, Feb. 07, 2019. Accessed: Oct. 10, 2022. [Online]. Available: http://arxiv.org/abs/1803.08375
C. K. On, P. M. Pandiyan, S. Yaacob, and A. Saudi, “Mel-frequency cepstral coefficient analysis in speech recognition,” in 2006 International Conference on Computing & Informatics, Jun. 2006, pp. 1–5. doi: 10.1109/ICOCI.2006.5276486.
L. Muda, M. Begam, and I. Elamvazuthi, “Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques.” arXiv, Mar. 22, 2010. doi: 10.48550/arXiv.1003.4083.
R. Vergin, D. O’Shaughnessy, and A. Farhat, “Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 7, no. 5, pp. 525–532, Sep. 1999, doi: 10.1109/89.784104.
A. Graves, “Connectionist Temporal Classification,” in Supervised Sequence Labelling with Recurrent Neural Networks, A. Graves, Ed., in Studies in Computational Intelligence. Berlin, Heidelberg: Springer, 2012, pp. 61–93. doi: 10.1007/978-3-642-24797-2_7.
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural ’networks,” presented at the ICML 2006 - Proceedings of the 23rd International Conference on Machine Learning, Jan. 2006, pp. 369–376. doi: 10.1145/1143844.1143891.
H. Scheidl, S. Fiel, and R. Sablatnig, “Word Beam Search: A Connectionist Temporal Classification Decoding Algorithm,” in 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), Aug. 2018, pp. 253–258. doi: 10.1109/ICFHR-2018.2018.00052.
A. Hannun, “Sequence Modeling with CTC,” Distill, vol. 2, no. 11, p. e8, Nov. 2017, doi: 10.23915/distill.00008.
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a Method for Automatic Evaluation of Machine Translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, Jul. 2002, pp. 311–318. doi: 10.3115/1073083.1073135.
D. Dmello, “donnabelldmello/nlp-bleu.” Nov. 17, 2019. Accessed: Oct. 22, 2022. [Online]. Available: https://github.com/donnabelldmello/nlp-bleu

Index Terms

Computer Science

Information Sciences

Keywords

Deep Learning DeepSpeech Language Translation Media Player Mozilla Speech Recognition Speech to Text Subtitle Generation.