An Approach to mining massive Data

Call for Paper

August Edition

IJCA solicits high quality original research papers for the upcoming August edition of the journal. The last date of research paper submission is 21 July 2025

Submit your paper

Know more

The week's pick

Navigating the Future of Cybersecurity: A Strategic Approach to Crypto Agility for Modern Enterprises

Aditya Gupta

Random Articles

Passenger Travel behavior Model in Railway Network Simulation

Apr

2017

Review of Application of Internet of Things in Agriculture in India

Aug

2018

Web Application Top 10 OWASP Attacks and Defence Mechanism

Aug

2023

An Incorporated Voting Strategy on Majority and Score- based Fuzzy Voting Algorithms for Safety-Critical Systems

July

2014

Reseach Article

An Approach to mining massive Data

Published on May 2012 by Reena Bharathi, Nitin N Keswani, Siddesh D Shinde

National Conference on Recent Trends in Computing

Foundation of Computer Science USA

NCRTC - Number 4

May 2012

Authors: Reena Bharathi, Nitin N Keswani, Siddesh D Shinde

Reena Bharathi, Nitin N Keswani, Siddesh D Shinde . An Approach to mining massive Data. National Conference on Recent Trends in Computing. NCRTC, 4 (May 2012), 32-36.

@article{

author = { Reena Bharathi, Nitin N Keswani, Siddesh D Shinde },

title = { An Approach to mining massive Data },

journal = { National Conference on Recent Trends in Computing },

issue_date = { May 2012 },

volume = { NCRTC },

number = { 4 },

month = { May },

year = { 2012 },

issn = 0975-8887,

pages = { 32-36 },

numpages = 5,

url = { /proceedings/ncrtc/number4/6542-1032/ },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Proceeding Article

%1 National Conference on Recent Trends in Computing

%A Reena Bharathi

%A Nitin N Keswani

%A Siddesh D Shinde

%T An Approach to mining massive Data

%J National Conference on Recent Trends in Computing

%@ 0975-8887

%V NCRTC

%N 4

%P 32-36

%D 2012

%I International Journal of Computer Applications

Abstract

Modern internet applications, scientific applications have created a need to manage immense amounts of data quickly. According to a Study, the amount of information created and replicated is forecasted to reach 35 zettabytes (trillion gigabytes) by the end of this decade. The exponentially growing dataset is known as Big Data. Big Data is generated by number of sources like Social Networking and Media, Mobile Devices, Internet Transactions, Networked Devices and Sensors Data mining is the process of extracting interesting, non-trivial, implicit, previously unknown and potentially useful patterns or knowledge from huge amount of data [9]. Traditional mining algorithms are not applicable to Big data as the algorithms are not scalable In many of these types of applications, the data is extremely large and hence there is an ample opportunity to exploit parallelism, in the management & analysis of this type of data. Earlier methods of dealing with massive data were by using the concepts of parallel processing / computing, with a setup of multiple nodes / processors. With the advent of Internet, distributed processing using the powers of multiple servers located on the internet became popular. This led to the development of S/w frameworks, to deal with analysis & management of massive datasets. These s/w frameworks use the concept of a distributed file system, where data & computations on it can be distributed across a large collection of processors. In this paper, we propose a method for dealing with large data sets, using the concept of distributed file systems and related distributed processing, The Apache HADOOP (HDFS). HDFS is a software framework that supports data-intensive distributed applications and enables applications to work with thousands of nodes and petabytes of data. Hadoop MapReduce [1] is a software framework for distributed processing of large data sets on compute clusters, which enable most of the common calculations on large scale data to be performed on large collections of computers , efficiently & tolerant to h/w failures during computations. We include in this paper , a case study of a mining application, for mining a large data set ( a Email Log) that uses the Apache Hadoop framework for preprocessing the data & converting it into a form, acceptable as input to traditional mining algorithms.

References

Tom White, 2011, Hadoop The Definitive Guide 2nd Edition 2010, O'Reilly.
Brian F Cooper, Eric Baldeschwieler, Rodrigo Fonseca, James J Kistler, P. P. S. Narayan, Chuck Neerdaels, Toby Negrin, Raghu Ramakrishnan, Adam Silberstein, Utkarsh Srivastava, Raymie Stata, 2009, IEEE, Building a Cloud for Yahoo!
Jitesh Shetty, Jafar Adibi, The Enron Email Dataset Database schema and Brief Statistical Report.
Druba Borthakur, 2009 Microsoft Research Seattle, Hadoop Architecture and its usage at Facebook
Andrew Pavlo, Erik Paulson, Alexander Rasin ,Daniel J. Abadi , David J. DeWitt, Samuel Madden, Michael Stonebraker, Copyright 2009 ACM, A Comparison of Approaches to Large-Scale Data Analysis
Siyang Dai, Jinxiong Tan, Zhi Zhang, Zeyang YU, Shuai Yuan, MR Language Reference Manual
Jimmy Lin, Chris Dyer, 2010, Data-Intensive Text Processing with MapReduce
Dell | Cloudera Solution for Apache Hadoop Deployment Guide, www. dell. com
S. Jiawei Han and Micheline Kamber. 2006. Data mining Concepts and techniques: Morgan/Kauffman publishers.

Index Terms

Computer Science

Information Sciences

Keywords

Hadoop Hdfs Mapreduce Dendograms Clusters Web Services Cloud