International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 187 - Number 34 |
Year of Publication: 2025 |
Authors: Sreepal Reddy Bolla |
![]() |
Sreepal Reddy Bolla . Multi-modal LLMs for NLP: Integrating Text, Image and Video. International Journal of Computer Applications. 187, 34 ( Aug 2025), 66-71. DOI=10.5120/ijca2025925480
The present study looks at how integrating text, image, and video data through multi-modal learning could improve the abilities of Large Language Models (LLMs). The LLMs we have now been very good at processing natural words, but they could be even better if they could handle more than one type of input. A new framework that blends text-based LLMs, like GPT-4, with image and video models that use transformers and convolutional neural networks (CNNs) is what we're proposing. This method is used for jobs like visual question answering (VQA) and automated content generation, showing big gains in accuracy and understanding of the context. When compared to text-only models, our multi-modal model did 25% better on VQA standards. The system also improved the ability to create material by giving outputs that were richer and more context-aware. The results show that multi-modal learning can help LLMs make progress by helping them understand and react to different types of input better.