CFP last date
20 May 2024
Reseach Article

An Efficient Approach for Storing and Accessing Small Files with Big Data Technology

by Bharti Gupta, Rajender Nath, Girdhar Gopal, Kartik
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 146 - Number 1
Year of Publication: 2016
Authors: Bharti Gupta, Rajender Nath, Girdhar Gopal, Kartik
10.5120/ijca2016910611

Bharti Gupta, Rajender Nath, Girdhar Gopal, Kartik . An Efficient Approach for Storing and Accessing Small Files with Big Data Technology. International Journal of Computer Applications. 146, 1 ( Jul 2016), 36-39. DOI=10.5120/ijca2016910611

@article{ 10.5120/ijca2016910611,
author = { Bharti Gupta, Rajender Nath, Girdhar Gopal, Kartik },
title = { An Efficient Approach for Storing and Accessing Small Files with Big Data Technology },
journal = { International Journal of Computer Applications },
issue_date = { Jul 2016 },
volume = { 146 },
number = { 1 },
month = { Jul },
year = { 2016 },
issn = { 0975-8887 },
pages = { 36-39 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume146/number1/25365-2016910611/ },
doi = { 10.5120/ijca2016910611 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:49:08.921319+05:30
%A Bharti Gupta
%A Rajender Nath
%A Girdhar Gopal
%A Kartik
%T An Efficient Approach for Storing and Accessing Small Files with Big Data Technology
%J International Journal of Computer Applications
%@ 0975-8887
%V 146
%N 1
%P 36-39
%D 2016
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Hadoop is an open source Apache project and a software framework for distributed processing of large datasets across large clusters of computers with commodity hardware. Large datasets include terabytes or petabytes of data where as large clusters means hundreds or thousands of nodes. It supports master slave architecture, which involves one master node and thousands of slave nodes. NameNode acts as the master node which stores all the metadata of files and various data nodes are slave nodes which stores all the application data. It becomes a bottleneck, when there is a need to process numerous number of small files because the NameNode utilizes the more memory to store the metadata of files and data nodes consume more CPU time to process numerous number of small files. This paper presents a novel technique to handle small file problems with Hadoop technology based on file merging, caching and correlation strategies. The experimental results shows that the proposed technique reduces the amount of data storage at NameNode, average memory usage of DataNodes and improves the access efficiency of small files in Hadoop Distributed File System up to 88.57% as compared with the general solution Hadoop Archive.

References
  1. Shvachko, K., Hairong, K., Radia, S., Chansler, R. 2010. The Hadoop Distributed File System. In proceedings of IEEE 26th Symposium in Mass Storage Systems and Technologies (MSST). 1-10.
  2. Yuan, Yu, Cui, C., Wu, Y., and Chen, Z. 2013. Performance analysis of Hadoop for handling small files in single node. Computer Engineering and Application. Vol. 49, no. 3. 57-60.
  3. White T. 2009. The Small Files Problem. http://www.cloudera.com/blog/2009/02/the small files problem.
  4. Dong, B., Qiu J., and Zheng Q. 2010. A Novel Approach to improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint Files. IEEE International Conference on Services Computing, 978-0-7695-4126-6/10. 65-72.
  5. White, T. 2010. Hadoop: The Definitive Guide. 2nd ed. O'Reilly Media, Sebastopol, CA. 41-45.
  6. Jiang, L., Li, B., and Song, M. 2010. The optimization of hdfs based on small files. In 3rd IEEE International Conference on Broadband Network and Multimedia Technology, IC-BNMT. 912-915.
  7. Mackey, G., Sehrish, S., and Wang, J. Aug 31- Sep 4, 2009. Improving metadata management for small files in HDFS. In proceedings of IEEE International Conference on Cluster Computing and Workshops. New Orleans, USA. 1-4.
  8. Min, L., and Yokata, H. 2010. Comapring hadoop and fat-btree based access method for small file I/o applications. Web-age information management, Lecture notes in computer science. Vol. 6184, Springer. 182-193.
  9. Shen, C., Lu, W., Wu, J., and Wei, B. 2010. A digital library architecture supporting massive small files and efficient replica maintenance. In Proceedings of the 10th annual joint conference on digital libraries. ACM. 391-2.
  10. Liu, X., Han, J., Zhong, Y., Han, C., and He, X. 2009. Implementing webgis on hadoop: a case study of improving small file i/o performance on HDFS. In IEEE international conference on cluster computing and workshops, CLUSTER'09. 1-8.
  11. Shvachko, K. 2007. Name-node memory size estimates and optimization proposal. https://issues.apache.org/jira/browse/HADOOP-17S.
  12. Dong, B., Zheng, Q., Tian, F., Chao, K.M., Ma, R., Anane, R. July 2012. An optimized approach for storing and accessing small files on cloud storage. In Proceedings of Journal of Network and Computer Applications 35. 1847-1862.
  13. Gupta, B., Nath, R., Gopal, G. April, 2016. A Novel Techniques to Handle Small Files with Big Data Technology. In Proceedings of Vivechana : A National Conference on Advances in Computer Science and Engineering (ACSE) held at Department of Computer Science & Applications, Kurukshetra University, Kurukshetra, Haryana, India on 29-30 April 2016.
Index Terms

Computer Science
Information Sciences

Keywords

Hadoop HDFS Map Reduce small files in Hadoop small files storage in Hadoop.