CFP last date
20 September 2024
Reseach Article

A Novel Approach to Translate Structural Aggregation Queries to MapReduce Code

by Ahmed M. Abdelmoniem, Sameh Abdulah, Walid Atwa
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 186 - Number 33
Year of Publication: 2024
Authors: Ahmed M. Abdelmoniem, Sameh Abdulah, Walid Atwa
10.5120/ijca2024923879

Ahmed M. Abdelmoniem, Sameh Abdulah, Walid Atwa . A Novel Approach to Translate Structural Aggregation Queries to MapReduce Code. International Journal of Computer Applications. 186, 33 ( Aug 2024), 1-10. DOI=10.5120/ijca2024923879

@article{ 10.5120/ijca2024923879,
author = { Ahmed M. Abdelmoniem, Sameh Abdulah, Walid Atwa },
title = { A Novel Approach to Translate Structural Aggregation Queries to MapReduce Code },
journal = { International Journal of Computer Applications },
issue_date = { Aug 2024 },
volume = { 186 },
number = { 33 },
month = { Aug },
year = { 2024 },
issn = { 0975-8887 },
pages = { 1-10 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume186/number33/a-novel-approach-to-translate-structural-aggregation-queries-to-mapreduce-code/ },
doi = { 10.5120/ijca2024923879 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-08-11T02:24:58.532626+05:30
%A Ahmed M. Abdelmoniem
%A Sameh Abdulah
%A Walid Atwa
%T A Novel Approach to Translate Structural Aggregation Queries to MapReduce Code
%J International Journal of Computer Applications
%@ 0975-8887
%V 186
%N 33
%P 1-10
%D 2024
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Data management applications are growing and require more attention, especially in the “big data” era. Thus, supporting such applications with novel and efficient algorithms that achieve higher performance is critical. Array database management systems are one way to support these applications by dealing with data represented in n-dimensional data structures. For instance, software like SciDB and RasDaMan can be powerful tools to achieve the required performance on large-scale problems with multidimensional data. Like their relational counterparts, these management systems support specific array query languages as the user interface. As a popular programming model, MapReduce allows large-scale data analysis, facilitates query processing, and is used as a DB engine. Nevertheless, one major obstacle is the low productivity of developing MapReduce applications. Unlike high-level declarative languages such as SQL, MapReduce jobs are written in a low-level descriptive language, often requiring massive programming efforts and complicated debugging processes. This work presents a system that supports translating array queries expressed in the Array Query Language (AQL) in SciDB into MapReduce jobs. We focus on translating some unique structural aggregations, including circular, grid, hierarchical, and sliding aggregations. Unlike traditional aggregations in relational DBs, these structural aggregations are designed explicitly for array manipulation. Thus, our work can be considered an array-view counterpart of existing SQL to MapReduce translators like HiveQL and YSmart. Our translator supports structural aggregations over arrays to meet various array manipulations. The translator can also help user-defined aggregation functions with minimal user effort. We also show that our translator can generate optimized MapReduce code, which performs better than the short handwritten code by up to 10.84X.

References
  1. P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. The Multidimensional Database System RasDaMan. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pages 575–577, 1998.
  2. Peter Baumann. A Database Array Algebra for Spatio- Temporal Data and Beyond. In Next Generation Information Technologies and Systems, pages 76–93, 1999.
  3. Peter Baumann, Andreas Dehmel, Paula Furtado, Roland Ritsch, and Norbert Widmann. The multidimensional database system rasdaman. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pages 575–577, 1998.
  4. Peter Baumann, Dimitar Misev, Vlad Merticariu, and Bang Pham Huu. Array databases: concepts, standards, implementations. Journal of Big Data, 8(1):1–61, 2021.
  5. Andrey Bogomolov, Bruno Lepri, Jacopo Staiano, Emmanuel Letouz´e, Nuria Oliver, Fabio Pianesi, and Alex Pentland. Moves on the street: Classifying crime hotspots using aggregated anonymized data on people dynamics. Big data, 3(3):148–158, 2015.
  6. Paul G Brown. Overview of scidb: large scale array storage, processing and analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 963–968, 2010.
  7. Paul G. Brown. Overview of SciDB: large scale array storage, processing and analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pages 963–968, 2010.
  8. Antoni Buades and Bartomeu Coll. A non-local algorithm for image denoising. In CVPR, pages 60–65, 2005.
  9. Ronnie Chaiken, Bob Jenkins, Per-A˚ ke Larson, Bill Ramsey, Darren Shakib, SimonWeaver, and Jingren Zhou. Scope: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment, 1(2):1265–1276, 2008.
  10. Roberto Cornacchia, S´andor H´eman, Marcin Zukowski, Arjen P. Vries, and Peter Boncz. Flexible and efficient IR using array databases. VLDB J., 17(1):151–168, 2008.
  11. Kun Feng, Xian-He Sun, Xi Yang, and Shujia Zhou. Scidp: Support hpc and big data applications via integrated scientific data processing. In 2018 IEEE International Conference on Cluster Computing (CLUSTER), pages 114–123. IEEE, 2018.
  12. Alan F Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. Building a high-level dataflow system on top of map-reduce: the pig experience. Proceedings of the VLDB Endowment, 2(2):1414–1425, 2009.
  13. Jim Gray, David T Liu, Maria Nieto-Santisteban, Alex Szalay, David J DeWitt, and Gerd Heber. Scientific data management in the coming decade. SIGMOD Rec., 34:34–41, 2005.
  14. Marc Gyssens and Laks V. S. Lakshmanan. A foundation for multi-dimensional databases. In VLDB, pages 106–115, 1997.
  15. Tony Hey. The fourth paradigm – data-intensive scientific discovery. In Serap Kurbano˘glu, Umut Al, Phyllis Lepon Erdo˘gan, Yas¸ar Tonta, and Nazan Uc¸ak, editors, E-Science and Information Management, pages 1–1, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
  16. Bill Howe and David Maier. Algebraic manipulation of scientific datasets. VLDB J., 14(4):397–416, 2005.
  17. S Idreos, F Groffen, N Nes, S Manegold, S Mullender, and M Kersten. Monetdb: Two decades of research in column-oriented database. IEEE Data Engineering Bulletin, 35(1):40–45, 2012.
  18. Changjun Jiang, Jiahui Song, Guanjun Liu, Lutao Zheng, and Wenjing Luan. Credit card fraud detection: A novel approach using aggregation strategy and feedback mechanism. IEEE Internet of Things Journal, 5(5):3637–3647, 2018.
  19. Wei Jiang, Vignesh T Ravi, and Gagan Agrawal. A mapreduce system with an alternate api for multi-core environments. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pages 84–93. IEEE Computer Society, 2010.
  20. M. Kersten, Y. Zhang, M. Ivanova, and N. Nes. Sciql, a query language for science applications. In Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases, AD ’11, page 1–12, New York, NY, USA, 2011. Association for Computing Machinery.
  21. Chuck Lam. Hadoop in Action. Manning Publications Co., USA, 1st edition, 2010.
  22. Hyunjo Lee, Jae-Woo Chang, and Cheoljoo Chae. knn-join query processing algorithm on mapreduce for large amounts of data. In 2021 International Symposium on Electrical, Electronics and Information Engineering, pages 538–544, 2021.
  23. Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. Ysmart: Yet another sql-tomapreduce translator. In Distributed Computing Systems (ICDCS), 2011 31st International Conference on, pages 25– 36. IEEE, 2011.
  24. Alberto Lerner and Dennis Shasha. AQuery: query language for ordered data, optimization techniques, and experiments. In VLDB, pages 345–356, 2003.
  25. Arunprasad P. Marathe and Kenneth Salem. A Language for Manipulating Arrays. In VLDB, pages 46–55, 1997.
  26. Arunprasad P Marathe and Kenneth Salem. Query processing techniques for arrays. VLDB J., 11(1):68–91, 2002.
  27. Monetdb: an open-source column-oriented relational database management system (rdbms), November 2022. Available at https://www.oracle.com/a/tech/docs/ georaster-2021.pdf.
  28. Christian Navasca, Cheng Cai, Khanh Nguyen, Brian Demsky, Shan Lu, Miryung Kim, and Guoqing Harry Xu. Gerenuk: Thin computation over big native data using speculative program transformation. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 538–553, 2019.
  29. Regina Obe and Leo Hsu. PostGIS in Action. Manning Publications Co., USA, 2011.
  30. Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099–1110. ACM, 2008.
  31. Zhifei Pang, Sai Wu, Haichao Huang, Zhouzhenyan Hong, and Yuqing Xie. Aqua+: Query optimization for hybrid database-mapreduce system. Knowledge and Information Systems, 63(4):905–938, 2021.
  32. Michael Stonebraker, Daniel Abadi, David J DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, and Alexander Rasin. Mapreduce and parallel dbmss: friends or foes? Communications of the ACM, 53(1):64–71, 2010.
  33. Michael Stonebraker, Jacek Becla, David J. DeWitt, Kian- Tat Lim, David Maier, Oliver Ratzesberger, and Stanley B. Zdonik. Requirements for Science Data Bases and SciDB. In Proceedings of CIDR, 2009.
  34. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626–1629, 2009.
  35. Maarten Vermeij, Wilko Quak, Martin Kersten, and Niels Nes. MonetDB, a novel spatial columnstore DBMS. In Proceedings of FOSS4G, pages 193–199, 2008.
  36. Heri Wijayanto, Wenlu Wang, Wei-Shinn Ku, and Arbee Chen. Lshape partitioning: Parallel skyline query processing using mapreduce. IEEE Transactions on Knowledge and Data Engineering, 2020.
  37. Haoyuan Xing, Sofoklis Floratos, Spyros Blanas, Suren Byna, MPrabhat, KeshengWu, and Paul Brown. Arraybridge: Interweaving declarative array processing in scidb with imperative hdf5-based programs. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 977–988. IEEE, 2018.
  38. Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D Stott Parker. Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 1029–1040. ACM, 2007.
  39. Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10–10, 2010.
  40. Ramon Antonio Rodriges Zalipynis. Chronosdb: Distributed, file based, geospatial array dbms. Proceedings of the VLDB Endowment, 11(10):1247–1261, 2018.
  41. Ying Zhang, Martin Kersten, Milena Ivanova, and Niels Nes. SciQL: Bridging the Gap Between Science and Relational DBMS. In Proceedings of IDEAS, pages 124–133, September 2011.
Index Terms

Computer Science
Information Sciences
Big data
MapReduce
Database Managment Systems (DBMS)

Keywords

Array Query Language (AQL) Data Management Applications MapReduce Multidimensional Data SQL-to-MapReduce