TubeExtractor: A Crawler and Converter for Generating Research DataSet from YouTube Videos

Pooja Ajwani; Harshal Arolkar

Call for Paper

August Edition

IJCA solicits high quality original research papers for the upcoming August edition of the journal. The last date of research paper submission is 20 July 2026

Submit your paper

Know more

The week's pick

Quantifying Label-Induced Bias in Large Language Model Self and Cross Evaluations

Muskan Saraf Sajjad Rezvani Boroujeni Justin Beaudry Hossein Abedi Tom Bush

Random Articles

Combining Advanced Encryption Standard (AES) and One Time Pad (OTP) Encryption for Data Security

November

2012

DNA Cryptography with Chaotic Mapping on Images: A Comparative Study

September

2014

Multi Novel Class Classification of Feature Evolving Data Streams with J48

August

2015

Enhancing Cloud Computing Security using AES Algorithm

April

2013

Reseach Article

TubeExtractor: A Crawler and Converter for Generating Research DataSet from YouTube Videos

by Pooja Ajwani, Harshal Arolkar

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 182 - Number 50

Year of Publication: 2019

Authors: Pooja Ajwani, Harshal Arolkar

10.5120/ijca2019918738

Pooja Ajwani, Harshal Arolkar . TubeExtractor: A Crawler and Converter for Generating Research DataSet from YouTube Videos. International Journal of Computer Applications. 182, 50 ( Apr 2019), 14-17. DOI=10.5120/ijca2019918738

@article{ 10.5120/ijca2019918738,

author = { Pooja Ajwani, Harshal Arolkar },

title = { TubeExtractor: A Crawler and Converter for Generating Research DataSet from YouTube Videos },

journal = { International Journal of Computer Applications },

issue_date = { Apr 2019 },

volume = { 182 },

number = { 50 },

month = { Apr },

year = { 2019 },

issn = { 0975-8887 },

pages = { 14-17 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume182/number50/30537-2019918738/ },

doi = { 10.5120/ijca2019918738 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-07T01:14:52.721605+05:30

%A Pooja Ajwani

%A Harshal Arolkar

%T TubeExtractor: A Crawler and Converter for Generating Research DataSet from YouTube Videos

%J International Journal of Computer Applications

%@ 0975-8887

%V 182

%N 50

%P 14-17

%D 2019

%I Foundation of Computer Science (FCS), NY, USA

Abstract

With the advent of the internet and e-resources, there has been an exponential growth of data available to the users. Amongst many content providers, YouTube succeeds in securing the second most popular website in the world. The data from YouTube is easily available to the users, due to which many researchers gather YouTube videos as their dataset for research. Searching the required video for data analysis from YouTube is a cumbersome task as YouTube is overloaded with trillions of videos. Researchers thus need to spend a huge amount of time to get required dataset. To save the time taken by researchers for accumulating dataset, an open source application “TubeExtractor” is proposed in this paper. The TubeExtractor application will allow researchers to download the videos and its metadata from YouTube based on the desired parameters provided by the researcher. The TubeExtractor will also provide as an output a plain text file of the downloaded video. This file can be used by the researchers to do additional processing of their choice if required. The keywords to download the videos are provided to the crawler in the form of a document, generated using a keyphrase extractor algorithm. If the vtt (Video Text Tracks) file of the video to be downloaded is available then a plain text file is created using a two-step parser. This TubeExtractor can save enough time of researchers.

References

Aliaa A.A. Youssif, Atef Z.Ghalwash, Islam A.Amer, “KPE: An Automatic Keyphrase Extraction Algorithm” , International Conference on Information Systems and Computational Intelligence (ICISCI 2011), 2011.
Ashish Sureka, Ponnurangam Kumaraguru, Atul Goyal, and Sidharth Chhabra, “Mining YouTube to Discover Extremist Videos, Users and Hidden Communities”, Information Retrieval Technology, 6458,13-24.
Chirag Shah, “Supporting Research Data Collection from YouTube with TubeKit” Journal of Information Technology & Politics (JITP), 7(2-3), 226-240 [DOI].
Egor Lakomkin Sven Magg CorneliusWeber Stefan Wermter, “KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos”, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (System Demonstrations), pages 90–95.
Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin and Craig G. Nevill-Manning, “KEA: Practical Automatic Keyphrase Extraction” in Proceeding of DL '99 Proceedings of the fourth ACM conference on Digital libraries, Pages 254-255 , Berkeley, California, USA — August 11 - 14, 1999.
Kayvan Kousha, Mike Thelwall, Mahshid Abdoli, “ The role of online videos in research communication: A content analysis of YouTube videos cited in academic publications”, Journal of the American Society for Information Science and Technology 63(9):1710-1727 · September 2012.
LetianWang, Fang Li, “SJTULTLAB: Chunk Based Method for Keyphrase Extraction”, Proceedings of the 5th International Workshop on Semantic Evaluation, ACL 2010, pages 158–161.
Luca Rossetto and Heiko Schuldt, “Web Video in Numbers An Analysis of Web-Video Metadata”, arXiv preprint arXiv:1707.01340 (2017).
Nirmala Pudota, Antonina Dattolo, Andrea Baruzzo, Felice Ferrara, Carlo Tasso, “Automatic keyphrase extraction and ontology mining for content-based tag recommendation” , International Journal of Intelligent Systems - New Trends for Ontology-Based Knowledge Discovery, Volume 25 Issue 12, December 2010 , Pages 1158-1186 .
Rada Mihalcea and Paul Tarau, “TextRank: Bringing Order into Texts”, Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 2004.
Stuart Rose, Dave Engel, Nick Cramer and Wendy Cowley, Automatic keyword extraction from individual documents”, Proceedings of the 5th International Workshop on Semantic Evaluation, ACL 2010, pages 158–161, Uppsala, Sweden, 15-16 July 2010. 2010 Association for Computational Linguistics.
Thomas Steiner, Hannes Mühleisen, Ruben Verborgh, Pierre-Antoine Champin, Benoît Encelle, Yannick Prié , “Weaving the Web(VTT) of Data”, LDO16014 (7th International Workshop about Linked Data on the Web), April 8, 2014, Seoul, Korea.
Wang Bingwei1, Yu Su2, “The Research on Related Technologies of Web Crawler”, International Refereed Journal of Engineering and Science (IRJES), ISSN (Online) 2319-183X, (Print) 2319-1821, Volume 6, Issue 4 (April 2017), PP.16-19.
Yuhao Fan, “Design and Implementation of Distributed Crawler System Based on Scrapy”, IOP Conf. Series: Earth and Environmental Science 108 (2018) 042086 doi :10.1088/1755-1315/108/4/042086
www.wikipedia.org
www.alexa.com/siteinfo/youtube.com
https://github.com/rg3/youtube-dl/blob/master/README.md
https://en.wikipedia.org/wiki/Web_crawler
https://www.youtube.com/

Index Terms

Computer Science

Information Sciences

Keywords

Crawler Keyphrase extractor parser youtube-dl vtt RAKE.