International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 187 - Number 15 |
Year of Publication: 2025 |
Authors: Md. Asraful Islam Khan, Syful Islam |
![]() |
Md. Asraful Islam Khan, Syful Islam . Multimodal Gesture Recognition using CNN-GCN-LSTM with RGB, Depth,and Skeleton Data. International Journal of Computer Applications. 187, 15 ( Jun 2025), 19-26. DOI=10.5120/ijca2025925191
Recognizing hand gestures is essential to human-computer interaction because it allows for organic and intuitive interaction in virtual reality, robotics, and assistive technologies. In this work, we suggest a unique multimodal fusion structure that integrates RGB images, depth information, and skeleton-based GCN features to enhance gesture recognition under realistic, noisy data conditions. Our architecture leverages MobileNetV3Small-based CNN backbones for visual feature extraction, GCNs for modeling skeletal relationships, and LSTM-attention modules for capturing temporal dynamics. Unlike previous works that rely on large curated datasets, our approach is evaluated on a challenging lowsample, high-noise dataset derived from real-world video recordings. Through systematic ablation studies, we demonstrate that incorporating depth and skeleton features incrementally improves performance, validating the strength of our fusion strategy. Despite operating under small and noisy data regimes, our model achieves meaningful accuracy, and our analysis provides insights into modality-specific failure cases. The proposed system paves the way for developing robust gesture recognition solutions deployable in real-world environments with minimal data preprocessing.