IJSMT Journal

International Journal of Science, Strategic Management and Technology

An International, Peer-Reviewed, Open Access Scholarly Journal Indexed in recognized academic databases · DOI via Crossref The journal adheres to established scholarly publishing, peer-review, and research ethics guidelines set by the UGC

ISSN: 3108-1762 (Online)
webp (1)

Plagiarism Passed
Peer reviewed
Open Access

VISION-BASED CONTEXT UNDERSTANDING USING MULTIMODAL AI

AUTHORS:
Rudra Gupta
Abdul Khalid
Mentor
Affiliation
B.Tech (Information Technology) NIET, Greater Noida
CC BY 4.0 License:
This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract

Understanding an image in a meaningful way re- quires more than identifying isolated objects. A useful description of a scene must also capture actions, relations, setting, intent, and often a coarse sense of social or emotional context. This ability, which humans perform naturally, remains a difficult

Keywords
Article Metrics
Article Views
38
PDF Downloads
2
HOW TO CITE
APA

MLA

Chicago

Copy

Gupta, R. & Khalid, A. (2026). Vision-Based Context Understanding using Multimodal AI. International Journal of Science, Strategic Management and Technology, 02(05). https://doi.org/10.55041/ijsmt.v2i5.160

Gupta, Rudra, and Abdul Khalid. "Vision-Based Context Understanding using Multimodal AI." International Journal of Science, Strategic Management and Technology, vol. 02, no. 05, 2026, pp. . doi:https://doi.org/10.55041/ijsmt.v2i5.160.

Gupta, Rudra, and Abdul Khalid. "Vision-Based Context Understanding using Multimodal AI." International Journal of Science, Strategic Management and Technology 02, no. 05 (2026). https://doi.org/https://doi.org/10.55041/ijsmt.v2i5.160.

References
1.Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” in Proc. ICML, 2021.

2.Li et al., “BLIP-2: Bootstrapping Language-Image Pre-training With Frozen Image Encoders and Large Language Models,” in Proc. ICML, 2023.

3.-B. Alayrac et al., “Flamingo: a Visual Language Model for Few-Shot Learning,” in Proc. NeurIPS, 2022.

4.Liu et al., “Visual Instruction Tuning,” arXiv preprint arXiv:2304.08485, 2023.

5.Dai et al., “InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning,” in Proc. NeurIPS, 2023.

6.Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Proc. NeurIPS, 2012.

7.He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proc. CVPR, 2016.

8.Vaswani et al., “Attention Is All You Need,” in Proc. NeurIPS, 2017.

9.Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in Proc. ICLR, 2021.

10.Lu et al., “ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks,” in Proc. NeurIPS, 2019.
Ethics and Compliance
✓ All ethical standards met
This article has undergone plagiarism screening and double-blind peer review. Editorial policies have been followed. Authors retain copyright under CC BY-NC 4.0 license. The research complies with ethical standards and institutional guidelines.
Indexed In
Similar Articles
A Hybrid Machine Learning Framework for Intelligent Decision Support in Education
string(9) "S.P.Maske" S.P.Maske,
(2026)
DOI: 10.55041/ijsmt.v2i3.278
Therapeutic Writing and Emotional Healing in Preeti Shenoy’s Wake Up! Life is Calling and Life is What You Make it
string(8) "Ayisha.B" Ayisha.B,
(2026)
DOI: 10.55041/ijsmt.v2i4.370
A Study on the Performance Appraisal of Employees at Sri Lakshmi Saraswathi Textile Mills Arni
string(15) "Dr.A.Sivanandam" Dr.A.Sivanandam,
(2026)
DOI: 10.55041/ijsmt.v2i5.384
Scroll to Top