International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 187 - Number 25 |
Year of Publication: 2025 |
Authors: Dimitrios Papakyriakou, Ioannis S. Barbounakis |
![]() |
Dimitrios Papakyriakou, Ioannis S. Barbounakis . Performance Analysis of Raspberry Pi 4B (8GB) Beowulf Cluster: HPCG Benchmarking. International Journal of Computer Applications. 187, 25 ( Jul 2025), 49-64. DOI=10.5120/ijca2025925449
The High-Performance Conjugate Gradient (HPCG) benchmark has emerged as a complementary metric to the High Performance LINPACK (HPL) [1], aiming to evaluate real-world high-performance computing (HPC) workloads that emphasize memory access patterns, cache behavior, and sparse matrix operations. Unlike HPL, which reflects peak floating-point capability, HPCG simulates practical scientific computations involving iterative solvers and irregular memory access, offering a more realistic performance indicator. This study investigates the implementation and analysis of the HPCG benchmark on a 24-node Beowulf cluster built with Raspberry Pi 4B devices, each equipped with 8GB LPDDR4 RAM and ARM Cortex-A72 processors. Both strong scaling (fixed problem size with increasing nodes) and weak scaling (proportional increase in problem size and nodes) methodologies were applied to assess system performance across various configurations. Metrics such as median execution time, floating-point throughput (GFLOP/s), and memory bandwidth (GB/s) were collected and analyzed. The results reveal that HPCG performance on this ARM-based cluster is primarily constrained by memory bandwidth saturation, lack of hardware-level floating-point acceleration, and network communication bottlenecks. Strong scaling experiments show minimal performance gains beyond 4–8 nodes, while weak scaling maintains computational stability up to moderate cluster sizes. Notably, the absence of measurable MPI communication overhead (ExchangeHalo time) underscores the limited halo data exchange under small subdomain decomposition and short runtimes. This study highlights the limitations and potential of energy-efficient, low-cost single-board clusters for realistic HPC workloads. The findings provide a methodological basis for benchmarking sparse solvers on ARM systems and inform future efforts in optimizing parallelism, memory access, and interconnect efficiency in edge computing, education, and embedded HPC environments.