International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 187 - Number 19 |
Year of Publication: 2025 |
Authors: Sai Teja Erukude, Suhasnadh Reddy Veluru, Viswa Chaitanya Marella |
![]() |
Sai Teja Erukude, Suhasnadh Reddy Veluru, Viswa Chaitanya Marella . Multimodal Deep Learning: A Survey of Models, Fusion Strategies, Applications, and Research Challenges. International Journal of Computer Applications. 187, 19 ( Jul 2025), 1-7. DOI=10.5120/ijca2025925264
Multimodal deep learning has become a primary methodological framework in artificial intelligence, allowing models to learn from (and reason over) many different types of data, such as text, images, audio, and video. By utilizing multiple modalities simultaneously, systems can enhance their contextual understanding, noise resilience, and generalization, all of which closely resemble human perception. This review offers a comprehensive overview of the field, taking a look at the basics of modality integration, fusion methods (early, late, and hybrid), and some of the main architectural advances in models like CLIP, Flamingo, GPT-4V, Gemini 1.5, and AudioCLIP. It also provides a primer on real-world applications in healthcare, autonomous systems, robotics, and education, including benchmarking datasets and evaluation metrics essential for evaluating performance. Notable challenges, such as modality imbalance, scalability, and interoperability, are highlighted, while also looking at growing areas of interest such as long-context modeling and embodied intelligence. As a review survey, the goal is to provide a map of options for researchers and practitioners who want to enhance their use of multimodal AI systems, both in research and in actual deployment.