CFP last date
21 July 2025
Reseach Article

Multimodal Gesture Recognition using CNN-GCN-LSTM with RGB, Depth,and Skeleton Data

by Md. Asraful Islam Khan, Syful Islam
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 15
Year of Publication: 2025
Authors: Md. Asraful Islam Khan, Syful Islam
10.5120/ijca2025925191

Md. Asraful Islam Khan, Syful Islam . Multimodal Gesture Recognition using CNN-GCN-LSTM with RGB, Depth,and Skeleton Data. International Journal of Computer Applications. 187, 15 ( Jun 2025), 19-26. DOI=10.5120/ijca2025925191

@article{ 10.5120/ijca2025925191,
author = { Md. Asraful Islam Khan, Syful Islam },
title = { Multimodal Gesture Recognition using CNN-GCN-LSTM with RGB, Depth,and Skeleton Data },
journal = { International Journal of Computer Applications },
issue_date = { Jun 2025 },
volume = { 187 },
number = { 15 },
month = { Jun },
year = { 2025 },
issn = { 0975-8887 },
pages = { 19-26 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number15/multimodal-gesture-recognition-using-cnn-gcn-lstm-with-rgb-depthand-skeleton-data/ },
doi = { 10.5120/ijca2025925191 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2025-06-26T19:04:51.864413+05:30
%A Md. Asraful Islam Khan
%A Syful Islam
%T Multimodal Gesture Recognition using CNN-GCN-LSTM with RGB, Depth,and Skeleton Data
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 15
%P 19-26
%D 2025
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Recognizing hand gestures is essential to human-computer interaction because it allows for organic and intuitive interaction in virtual reality, robotics, and assistive technologies. In this work, we suggest a unique multimodal fusion structure that integrates RGB images, depth information, and skeleton-based GCN features to enhance gesture recognition under realistic, noisy data conditions. Our architecture leverages MobileNetV3Small-based CNN backbones for visual feature extraction, GCNs for modeling skeletal relationships, and LSTM-attention modules for capturing temporal dynamics. Unlike previous works that rely on large curated datasets, our approach is evaluated on a challenging lowsample, high-noise dataset derived from real-world video recordings. Through systematic ablation studies, we demonstrate that incorporating depth and skeleton features incrementally improves performance, validating the strength of our fusion strategy. Despite operating under small and noisy data regimes, our model achieves meaningful accuracy, and our analysis provides insights into modality-specific failure cases. The proposed system paves the way for developing robust gesture recognition solutions deployable in real-world environments with minimal data preprocessing.

References
  1. A. S. M. Miah, M. A. M. Hasan, and J. Shin, “Dynamic hand gesture recognition using multi-branch attention based graph and general deep learning model,” IEEE Access, vol. 11, p. 4703–4716, 2023.
  2. O. Yusuf, M. Habib, and M. Moustafa, “Real-time hand gesture recognition: Integrating skeleton-based data fusion and multi-stream cnn,” 2024.
  3. B. Kwolek, “Continuous hand gesture recognition for humanrobot collaborative assembly,” in 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). IEEE, Oct. 2023, p. 1992–1999.
  4. A. S. M. Miah, M. A. M. Hasan, Y. Tomioka, and J. Shin, “Hand gesture recognition for multi-culture sign language using graph and general deep learning network,” IEEE Open Journal of the Computer Society, vol. 5, p. 144–155, 2024.
  5. Y. Han, Y. Han, and Q. Jiang, “A study on the stgcn-lstm sign language recognition model based on phonological features of sign language,” IEEE Access, p. 1–1, 2025.
  6. R. Slama, W. Rabah, and H. Wannous, “Online hand gesture recognition using continual graph transformers,” 2025.
  7. J. Song, H. Wang, J. Li, J. Zheng, Z. Zhao, and Q. Li, “Handaware graph convolution network for skeleton-based sign language recognition,” Journal of Information and Intelligence, vol. 3, no. 1, p. 36–50, Jan. 2025.
  8. H. Lee, M. Jiang, J. Yang, Z. Yang, and Q. Zhao, “Decoding gestures in electromyography: Spatiotemporal graph neural networks for generalizable and interpretable classification,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 33, p. 404–419, 2025.
  9. J. Shin, A. S. M. Miah, S. Konnai, I. Takahashi, and K. Hirooka, “Hand gesture recognition using semg signals with a multi-stream time-varying feature enhancement approach,” Scientific Reports, vol. 14, no. 1, Sep. 2024.
  10. M. Linardakis, I. Varlamis, and G. T. Papadopoulos, “Survey on hand gesture recognition from visual input,” 2025. [Online]. Available: https://arxiv.org/abs/2501.11992
  11. Y. Li and J. Zhang, “Sl-gcnn: A graph convolutional neural network for granular human motion recognition,” IEEE Access, vol. 13, p. 12373–12387, 2025. [Online]. Available: http://dx.doi.org/10.1109/ACCESS.2024.3514082
  12. H. Cui, R. Huang, R. Zhang, and T. Hayama, “Dstsagcn: Advancing skeleton-based gesture recognition with semantic-aware spatio-temporal topology modeling,” arXiv preprint arXiv:2501.12086, 2025. [Online]. Available: https: //arxiv.org/abs/2501.12086
  13. O. Ikne, B. Allaert, and H. Wannous, “Skeleton-based selfsupervised feature extraction for improved dynamic hand gesture recognition,” arXiv preprint arXiv:2405.12345, 2024. [Online]. Available: https://arxiv.org/abs/2405.12345
  14. M. Garg, D. Ghosh, and P. M. Pradhan, “Gestformer: Multiscale wavelet pooling transformer network for dynamic hand gesture recognition,” arXiv preprint arXiv:2405.11180, 2024. [Online]. Available: https://arxiv.org/abs/2405.11180
  15. Y. Liu, Z. Wang, and L. Chen, “Spatio-temporal transformer with kolmogorov–arnold network for skeleton-based hand gesture recognition,” Sensors, vol. 25, no. 3, p. 702, 2025.
  16. M. A. Rahim, A. S. M. Miah, H. S. Akash, J. Shin, M. I. Hossain, and M. N. Hossain, “An advanced deep learning based three-stream hybrid model for dynamic hand gesture recognition,” arXiv preprint arXiv:2408.08035, 2024. [Online]. Available: https://arxiv.org/abs/2408.08035
  17. H. Mahmud, M. M. Morshed, and M. K. Hasan, “A deep learning-based multimodal depth-aware dynamic hand gesture recognition system,” arXiv preprint arXiv:2307.12345, 2024. [Online]. Available: https://arxiv.org/abs/2307.12345
  18. J.-H. Kim, S.-M. Park, and D.-H. Lee, “Multi-modal zeroshot dynamic hand gesture recognition,” Expert Systems with Applications, vol. 213, p. 119123, 2024.
  19. R. Singh, A. Kumar, and P. Sharma, “Electromyographic hand gesture recognition using convolutional neural networks with multi-attention mechanisms,” Biomedical Signal Processing and Control, vol. 86, p. 104865, 2024.
  20. R. Patel and A. Singh, “Attention-driven hybrid lstm-gru model for enhanced emg-based hand gesture recognition,” SSRG International Journal of Electrical and Electronics Engineering, vol. 11, no. 11, p. 106, 2024.
  21. W. Zhang, M. Li, and X. Chen, “Gesture recognition with residual lstm attention using millimeter-wave radar,” Sensors, vol. 25, no. 2, p. 469, 2025.
  22. M. I. Md Selim Sarowar Nur E Jannatul Farjana Md. Asraful Islam Khan, Md Abdul Mutalib Syful Islam, “Hand gesture recognition systems: A review of methods, datasets, and emerging trends,” International Journal of Computer Applications, vol. 187, no. 2, pp. 1–33, May 2025. [Online]. Available: https://doi.org/10.5120/ijca2025924776
Index Terms

Computer Science
Information Sciences

Keywords

Hand Gesture Recognition Multimodal Fusion GCN LSTM Depth Skeleton CNN