CFP last date
20 October 2025
Call for Paper
November Edition
IJCA solicits high quality original research papers for the upcoming November edition of the journal. The last date of research paper submission is 20 October 2025

Submit your paper
Know more
Random Articles
Reseach Article

Deep Learning for Edge AI: MobileNetV2 CNN Training over ARM-based Clusters

by Dimitrios Papakyriakou, Ioannis S. Barbounakis
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 41
Year of Publication: 2025
Authors: Dimitrios Papakyriakou, Ioannis S. Barbounakis
10.5120/ijca2025925727

Dimitrios Papakyriakou, Ioannis S. Barbounakis . Deep Learning for Edge AI: MobileNetV2 CNN Training over ARM-based Clusters. International Journal of Computer Applications. 187, 41 ( Sep 2025), 43-57. DOI=10.5120/ijca2025925727

@article{ 10.5120/ijca2025925727,
author = { Dimitrios Papakyriakou, Ioannis S. Barbounakis },
title = { Deep Learning for Edge AI: MobileNetV2 CNN Training over ARM-based Clusters },
journal = { International Journal of Computer Applications },
issue_date = { Sep 2025 },
volume = { 187 },
number = { 41 },
month = { Sep },
year = { 2025 },
issn = { 0975-8887 },
pages = { 43-57 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number41/deep-learning-for-edge-ai-mobilenetv2-cnn-training-over-arm-based-clusters/ },
doi = { 10.5120/ijca2025925727 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2025-09-23T00:37:01+05:30
%A Dimitrios Papakyriakou
%A Ioannis S. Barbounakis
%T Deep Learning for Edge AI: MobileNetV2 CNN Training over ARM-based Clusters
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 41
%P 43-57
%D 2025
%I Foundation of Computer Science (FCS), NY, USA
Abstract

This paper presents a comprehensive investigation into the strong scaling performance of distributed training for Convolutional Neural Networks (CNNs) using the MobileNetV2 architecture on a resource constrained Beowulf cluster composed of 24 Raspberry Pi 4B nodes (8 GB RAM each). The training system employs the Message Passing Interface (MPI) via MPICH with synchronous data parallelism, running two processes per node across 2 to 48 total MPI processes. A fixed CIFAR 10 dataset was used, and all experiments were standardized to 10 epochs to maintain memory stability. The study jointly evaluates execution time scaling, training/test accuracy, and convergence loss to assess both computational performance and learning quality under increasing parallelism. Training time decreased nearly ten-fold at cluster scale, reaching a maximum speedup of (≈9.99×) with (≈41.6 %) parallel efficiency at 48 processes. Efficiency remained very high at small scales (≈90.9 % at np=4) and moderate at np=8 (≈52.3 %), confirming that MPI scaling itself is effective up to this range. However, while single-node and small-scale runs (up to 4–8 MPI processes) preserved strong generalization ability, larger scales suffered from sharply reduced per-rank dataset sizes, causing gradient noise and eventual collapse of test accuracy to the random guess baseline (10 %). These results demonstrate that, although ARM based Raspberry Pi clusters can support feasible small scale distributed deep learning, strong scaling beyond an optimal process count leads to “fast but wrong” training in which wall clock performance improves but model utility on unseen data is lost. This work provides the first detailed end to end evaluation of MPI based synchronous CNN training across an ARM based edge cluster, and outlines future research including comparative scaling with SqueezeNet and exploration of ultra low power Spiking Neural Networks (SNNs) for neuromorphic edge learning.

References
  1. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
  2. Howard, A. G., et al. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861.
  3. Sergeev, A., & Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in TensorFlow. arXiv:1802.05799.
  4. Lane, N. D., Bhattacharya, S., et al. (2016). DeepX: A software accelerator for low-power deep learning inference on mobile devices. In IPSN '16.
  5. Dastjerdi, A. V., & Buyya, R. (2016). Fog computing: Helping the Internet of Things realize its potential. Computer, 49(8), 112–116.
  6. Raspberry Pi 4 Model B. [Online]. Available: raspberrypi.com/products/raspberry-pi-4-model-b/.
  7. Raspberry Pi 4 Model B specifications. [Online]. Available: https://magpi.raspberrypi.com/articles/raspberry-pi-4-specs-benchmarks
  8. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). MobileNetV2: Inverted residuals and linear bottlenecks. CVPR, pp. 4510–4520
  9. Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical Report, University of Toronto.
  10. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
  11. Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
  12. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
  13. Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q. V., … & Ng, A. Y. (2012). Large scale distributed deep networks. Advances in Neural Information Processing Systems, 25, 1–11.
  14. Gropp, W., Lusk, E., & Skjellum, A. (2014). Using MPI: Portable parallel programming with the message-passing interface (3rd ed.). MIT Press.
  15. Shallue, C. J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., & Dahl, G. E. (2019). Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20(112), 1–49. http://jmlr.org/papers/v20/18-789.html
  16. Masters, D., & Luschi, C. (2018). Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612. https://arxiv.org/abs/1804.07612
Index Terms

Computer Science
Information Sciences

Keywords

Convolutional Neural Networks (CNNs) Distributed Deep Learning Beowulf Cluster ARM Architecture Raspberry Pi Cluster Parallel Computing Message Passing Interface (MPI) MPICH Low-Cost Clusters Distributed Systems HPC MobileNet Edge Computing Parallel Efficiency Edge Deep Learning