| International Journal of Computer Applications |
| Foundation of Computer Science (FCS), NY, USA |
| Volume 187 - Number 105 |
| Year of Publication: 2026 |
| Authors: Anurag Shrivastava, Shivang Agrawal, Sanjana Keshari, Mohd. Taukeer, Krishna Vishwakarma |
10.5120/ijcac86083229a5f
|
Anurag Shrivastava, Shivang Agrawal, Sanjana Keshari, Mohd. Taukeer, Krishna Vishwakarma . Vision Bridge: An Adaptive Serverless Architecture for Multimodal Heritage Tourism - CLIP-Based Visual Querying with Split-Horizon Delivery on Low-Bandwidth Networks. International Journal of Computer Applications. 187, 105 ( May 2026), 51-60. DOI=10.5120/ijcac86083229a5f
Heritage tourism in India occupies a curious position: the sites themselves are extraordinary, yet the information infrastructure surrounding them remains thin, fragmented, and predominantly English-language - excluding most of the domestic visitors who use them. This paper describes Vision Bridge, a serverless multimodal chatbot for heritage tourists that operates entirely through the Telegram messaging platform. The system accepts photographs of architectural features and returns contextually accurate multilingual descriptions - as both text and synthesized audio - within two seconds on standard mobile connections, with no application installation required. The authors introduce three original contributions beyond the prior text-only serverless heritage chatbot architecture on which this work builds. First, the Adaptive Confidence-Gated Visual Query Module (ACVQM) - a CLIP ViT-B/32 embedding retrieval system augmented with an image quality pre-filter and a query-adaptive threshold mechanism that adjusts matching confidence requirements based on estimated query ambiguity, improving identification robustness under real outdoor tourism conditions. Second, the Split-Horizon Delivery Protocol (SHDP) - a formally defined two-phase asynchronous pipeline that decouples initial text delivery from background audio synthesis, achieving 620 ms perceived response latency while full audio narration completes within 2.0 seconds. Third, a theoretical grounding of the design in Cognitive Load Theory and Information Foraging Theory, providing a principled framework for understanding why multimodal, audio-visual delivery of heritage information outperforms text-only presentation for tourists navigating unfamiliar architectural environments. Experimental evaluation across 500+ interaction cycles at the Residency Complex, Lucknow, demonstrates 87.4% top 1 visual identification accuracy with sub-500 ms inference on CPU-only cloud hardware. A seven-day field pilot with 120 participants yielded a TAM instrument Cronbach's alpha of 0.89, a visual utility mean score of 4.71/5 (SD = 0.39), and a statistically significant improvement over text-only baseline scores (t (119) = 3.47, p < 0.001, Cohen's d = 0.63). These results position Vision Bridge as a practically viable, replicable architectural blueprint for inclusive multimodal heritage information systems in resource-constrained deployments.