CFP last date
20 May 2024
Reseach Article

An Approach to mining massive Data

Published on May 2012 by Reena Bharathi, Nitin N Keswani, Siddesh D Shinde
National Conference on Recent Trends in Computing
Foundation of Computer Science USA
NCRTC - Number 4
May 2012
Authors: Reena Bharathi, Nitin N Keswani, Siddesh D Shinde
a0904098-0c02-498b-aef2-58bf3c0c71fe

Reena Bharathi, Nitin N Keswani, Siddesh D Shinde . An Approach to mining massive Data. National Conference on Recent Trends in Computing. NCRTC, 4 (May 2012), 32-36.

@article{
author = { Reena Bharathi, Nitin N Keswani, Siddesh D Shinde },
title = { An Approach to mining massive Data },
journal = { National Conference on Recent Trends in Computing },
issue_date = { May 2012 },
volume = { NCRTC },
number = { 4 },
month = { May },
year = { 2012 },
issn = 0975-8887,
pages = { 32-36 },
numpages = 5,
url = { /proceedings/ncrtc/number4/6542-1032/ },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Proceeding Article
%1 National Conference on Recent Trends in Computing
%A Reena Bharathi
%A Nitin N Keswani
%A Siddesh D Shinde
%T An Approach to mining massive Data
%J National Conference on Recent Trends in Computing
%@ 0975-8887
%V NCRTC
%N 4
%P 32-36
%D 2012
%I International Journal of Computer Applications
Abstract

Modern internet applications, scientific applications have created a need to manage immense amounts of data quickly. According to a Study, the amount of information created and replicated is forecasted to reach 35 zettabytes (trillion gigabytes) by the end of this decade. The exponentially growing dataset is known as Big Data. Big Data is generated by number of sources like Social Networking and Media, Mobile Devices, Internet Transactions, Networked Devices and Sensors Data mining is the process of extracting interesting, non-trivial, implicit, previously unknown and potentially useful patterns or knowledge from huge amount of data [9]. Traditional mining algorithms are not applicable to Big data as the algorithms are not scalable In many of these types of applications, the data is extremely large and hence there is an ample opportunity to exploit parallelism, in the management & analysis of this type of data. Earlier methods of dealing with massive data were by using the concepts of parallel processing / computing, with a setup of multiple nodes / processors. With the advent of Internet, distributed processing using the powers of multiple servers located on the internet became popular. This led to the development of S/w frameworks, to deal with analysis & management of massive datasets. These s/w frameworks use the concept of a distributed file system, where data & computations on it can be distributed across a large collection of processors. In this paper, we propose a method for dealing with large data sets, using the concept of distributed file systems and related distributed processing, The Apache HADOOP (HDFS). HDFS is a software framework that supports data-intensive distributed applications and enables applications to work with thousands of nodes and petabytes of data. Hadoop MapReduce [1] is a software framework for distributed processing of large data sets on compute clusters, which enable most of the common calculations on large scale data to be performed on large collections of computers , efficiently & tolerant to h/w failures during computations. We include in this paper , a case study of a mining application, for mining a large data set ( a Email Log) that uses the Apache Hadoop framework for preprocessing the data & converting it into a form, acceptable as input to traditional mining algorithms.

References
  1. Tom White, 2011, Hadoop The Definitive Guide 2nd Edition 2010, O'Reilly.
  2. Brian F Cooper, Eric Baldeschwieler, Rodrigo Fonseca, James J Kistler, P. P. S. Narayan, Chuck Neerdaels, Toby Negrin, Raghu Ramakrishnan, Adam Silberstein, Utkarsh Srivastava, Raymie Stata, 2009, IEEE, Building a Cloud for Yahoo!
  3. Jitesh Shetty, Jafar Adibi, The Enron Email Dataset Database schema and Brief Statistical Report.
  4. Druba Borthakur, 2009 Microsoft Research Seattle, Hadoop Architecture and its usage at Facebook
  5. Andrew Pavlo, Erik Paulson, Alexander Rasin ,Daniel J. Abadi , David J. DeWitt, Samuel Madden, Michael Stonebraker, Copyright 2009 ACM, A Comparison of Approaches to Large-Scale Data Analysis
  6. Siyang Dai, Jinxiong Tan, Zhi Zhang, Zeyang YU, Shuai Yuan, MR Language Reference Manual
  7. Jimmy Lin, Chris Dyer, 2010, Data-Intensive Text Processing with MapReduce
  8. Dell | Cloudera Solution for Apache Hadoop Deployment Guide, www. dell. com
  9. S. Jiawei Han and Micheline Kamber. 2006. Data mining Concepts and techniques: Morgan/Kauffman publishers.
Index Terms

Computer Science
Information Sciences

Keywords

Hadoop Hdfs Mapreduce Dendograms Clusters Web Services Cloud