CFP last date
22 April 2024
Reseach Article

Spotting Outliers in Large Distributed Datasets using Cell Density based Approach

by A.rama Satish, P.bala Krishna Prasad
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 122 - Number 8
Year of Publication: 2015
Authors: A.rama Satish, P.bala Krishna Prasad
10.5120/21717-4858

A.rama Satish, P.bala Krishna Prasad . Spotting Outliers in Large Distributed Datasets using Cell Density based Approach. International Journal of Computer Applications. 122, 8 ( July 2015), 1-7. DOI=10.5120/21717-4858

@article{ 10.5120/21717-4858,
author = { A.rama Satish, P.bala Krishna Prasad },
title = { Spotting Outliers in Large Distributed Datasets using Cell Density based Approach },
journal = { International Journal of Computer Applications },
issue_date = { July 2015 },
volume = { 122 },
number = { 8 },
month = { July },
year = { 2015 },
issn = { 0975-8887 },
pages = { 1-7 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume122/number8/21717-4858/ },
doi = { 10.5120/21717-4858 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:09:59.846262+05:30
%A A.rama Satish
%A P.bala Krishna Prasad
%T Spotting Outliers in Large Distributed Datasets using Cell Density based Approach
%J International Journal of Computer Applications
%@ 0975-8887
%V 122
%N 8
%P 1-7
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Outliers are abnormal instances or observations. Detecting data outliers is a very important concept in Knowledge data discovery. Outlier detection has been studied in the context of a large number of research areas like large distributed systems, data mining, wireless sensor networks(WSN), health monitoring, environmental science, statistics, etc. , Density based (DB) outlier detection techniques are robust in detecting outliers. In many applications, too much voluminous distributed data is generating every day. Finding deviating observations in the large distributed database rather than in any individual database is not a simple task. Integrating distributed database cause two major problems. First, render massive data from different databases. In addition, data integration may cause violation of data security and leakage of sensitive information. In this work we propose cell density based mechanism for outlier detection (CDOD) in large distributed databases. A centralized detection paradigm is used; it allows overcoming the expensive data integration and information leakage. The experimental results show robustness for finding outliers in large number of databases, instances and attributes.

References
  1. Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications, volume 27. ACM, 1998.
  2. Fabrizio Angiulli, Stefano Basta, Stefano Lodi, and Claudio Sartori. A distributed approach to detect outliers in very large data sets. In Euro-Par 2010-Parallel Processing, pages 329–340. Springer, 2010.
  3. Vic Barnett and Toby Lewis. Outliers in statistical data, volume 3. Wiley New York, 1994.
  4. Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and J¨org Sander. Lof: identifying density-based local outliers. In ACM sigmod record, volume 29, pages 93–104. ACM, 2000.
  5. Martin Ester, Hans-Peter Kriegel, J¨org Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226–231, 1996.
  6. Douglas M Hawkins. Identification of outliers, volume 11. Springer, 1980.
  7. Alexander Hinneburg and Daniel A Keim. An efficient approach to clustering in large multimedia databases with noise. In KDD, volume 98, pages 58–65, 1998.
  8. Wen Jin, Anthony KH Tung, and Jiawei Han. Mining top-n local outliers in large databases. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 293–298. ACM, 2001.
  9. Edwin M Knorr and Raymond T Ng. Finding intensional knowledge of distance-based outliers. In VLDB, volume 99, pages 211–222, 1999.
  10. Edwin M Knox and Raymond T Ng. Algorithms for mining distancebased outliers in large datasets. In Proceedings of the International Conference on Very Large Data Bases, pages 392–403. Citeseer, 1998.
  11. Ankita Dubey Muruganantham B. Outlier detection using distributed mining technology in large database. International Journal of Computer Science and Engineering, 2(2):6–11, 2015.
  12. Raymond T Ng and Jiawei Han. Efficient and effective clustering methods for spatial data mining. In Proc. of, pages 144–155, 1994.
  13. Yaling Pei, Osmar R Zaiane, and Yong Gao. An efficient reference-based approach to outlier detection in large datasets. In Data Mining, 2006. ICDM'06. Sixth International Conference on, pages 478–487. IEEE, 2006.
  14. Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efficient algorithms for mining outliers from large data sets. In ACM SIGMOD Record, volume 29, pages 427–438. ACM, 2000.
  15. Jian Tang, Zhixiang Chen, Ada Wai-Chee Fu, and David W Cheung. Enhancing effectiveness of outlier detections for low density patterns. In Advances in Knowledge Discovery and Data Mining, pages 535–548. Springer, 2002.
  16. Ji Zhang, Wynne Hsu, and Mong Li Lee. Clustering in dynamic spatial databases. Journal of intelligent information systems, 24(1):5–27, 2005.
  17. Ji Zhang, Meng Lou, Tok Wang Ling, and Hai Wang. Hos-miner: a system for detecting outlyting subspaces of high-dimensional data. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 1265–1268. VLDB Endowment, 2004.
  18. Ji Zhang, Xiaohui Tao, and HuaWang. Outlier detection from large distributed databases. World Wide Web, 17(4):539–568, 2014.
  19. Ji Zhang and Hai Wang. Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowledge and information systems, 10(3):333–355, 2006.
  20. Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In ACM SIGMOD Record, volume 25, pages 103–114. ACM, 1996.
Index Terms

Computer Science
Information Sciences

Keywords

Data Mining KDD Large distributed databases Density based outlier detection.