An Approach to mining massive Data

Call for Paper

June Edition

IJCA solicits high quality original research papers for the upcoming June edition of the journal. The last date of research paper submission is 20 May 2025

Submit your paper

Know more

The week's pick

Disease Detection in Tea Leaves: A Hybrid Model Using YOLOv7 and DCNN

Md Zahidul Kabir Md Sourav Hossen Sumiya Kaisar Keya

Random Articles

Kannada Word Sense Disambiguation for Machine Translation

November

2011

Performance Analysis of WDM-RoF System with CO-OFDM for Long Distance Communication

November

2015

Efficient Multipath Routing Protocol based on Path Survivability Factor

October

2014

Article:Economic Thermal Power Dispatch with Emission Constraint and Valve Point Effect Loading Using Improved Tabu Search Algorithm

July

2010

Reseach Article

An Approach to mining massive Data

Published on May 2012 by Reena Bharathi, Nitin N Keswani, Siddesh D Shinde

National Conference on Recent Trends in Computing

Foundation of Computer Science USA

NCRTC - Number 4

May 2012

Authors: Reena Bharathi, Nitin N Keswani, Siddesh D Shinde

Reena Bharathi, Nitin N Keswani, Siddesh D Shinde . An Approach to mining massive Data. National Conference on Recent Trends in Computing. NCRTC, 4 (May 2012), 32-36.

@article{

author = { Reena Bharathi, Nitin N Keswani, Siddesh D Shinde },

title = { An Approach to mining massive Data },

journal = { National Conference on Recent Trends in Computing },

issue_date = { May 2012 },

volume = { NCRTC },

number = { 4 },

month = { May },

year = { 2012 },

issn = 0975-8887,

pages = { 32-36 },

numpages = 5,

url = { /proceedings/ncrtc/number4/6542-1032/ },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Proceeding Article

%1 National Conference on Recent Trends in Computing

%A Reena Bharathi

%A Nitin N Keswani

%A Siddesh D Shinde

%T An Approach to mining massive Data

%J National Conference on Recent Trends in Computing

%@ 0975-8887

%V NCRTC

%N 4

%P 32-36

%D 2012

%I International Journal of Computer Applications

Abstract

Modern internet applications, scientific applications have created a need to manage immense amounts of data quickly. According to a Study, the amount of information created and replicated is forecasted to reach 35 zettabytes (trillion gigabytes) by the end of this decade. The exponentially growing dataset is known as Big Data. Big Data is generated by number of sources like Social Networking and Media, Mobile Devices, Internet Transactions, Networked Devices and Sensors Data mining is the process of extracting interesting, non-trivial, implicit, previously unknown and potentially useful patterns or knowledge from huge amount of data [9]. Traditional mining algorithms are not applicable to Big data as the algorithms are not scalable In many of these types of applications, the data is extremely large and hence there is an ample opportunity to exploit parallelism, in the management & analysis of this type of data. Earlier methods of dealing with massive data were by using the concepts of parallel processing / computing, with a setup of multiple nodes / processors. With the advent of Internet, distributed processing using the powers of multiple servers located on the internet became popular. This led to the development of S/w frameworks, to deal with analysis & management of massive datasets. These s/w frameworks use the concept of a distributed file system, where data & computations on it can be distributed across a large collection of processors. In this paper, we propose a method for dealing with large data sets, using the concept of distributed file systems and related distributed processing, The Apache HADOOP (HDFS). HDFS is a software framework that supports data-intensive distributed applications and enables applications to work with thousands of nodes and petabytes of data. Hadoop MapReduce [1] is a software framework for distributed processing of large data sets on compute clusters, which enable most of the common calculations on large scale data to be performed on large collections of computers , efficiently & tolerant to h/w failures during computations. We include in this paper , a case study of a mining application, for mining a large data set ( a Email Log) that uses the Apache Hadoop framework for preprocessing the data & converting it into a form, acceptable as input to traditional mining algorithms.

References

Tom White, 2011, Hadoop The Definitive Guide 2nd Edition 2010, O'Reilly.
Brian F Cooper, Eric Baldeschwieler, Rodrigo Fonseca, James J Kistler, P. P. S. Narayan, Chuck Neerdaels, Toby Negrin, Raghu Ramakrishnan, Adam Silberstein, Utkarsh Srivastava, Raymie Stata, 2009, IEEE, Building a Cloud for Yahoo!
Jitesh Shetty, Jafar Adibi, The Enron Email Dataset Database schema and Brief Statistical Report.
Druba Borthakur, 2009 Microsoft Research Seattle, Hadoop Architecture and its usage at Facebook
Andrew Pavlo, Erik Paulson, Alexander Rasin ,Daniel J. Abadi , David J. DeWitt, Samuel Madden, Michael Stonebraker, Copyright 2009 ACM, A Comparison of Approaches to Large-Scale Data Analysis
Siyang Dai, Jinxiong Tan, Zhi Zhang, Zeyang YU, Shuai Yuan, MR Language Reference Manual
Jimmy Lin, Chris Dyer, 2010, Data-Intensive Text Processing with MapReduce
Dell | Cloudera Solution for Apache Hadoop Deployment Guide, www. dell. com
S. Jiawei Han and Micheline Kamber. 2006. Data mining Concepts and techniques: Morgan/Kauffman publishers.

Index Terms

Computer Science

Information Sciences

Keywords

Hadoop Hdfs Mapreduce Dendograms Clusters Web Services Cloud