IJSMT Journal

International Journal of Science, Strategic Management and Technology

An International, Peer-Reviewed, Open Access Scholarly Journal Indexed in recognized academic databases · DOI via Crossref The journal adheres to established scholarly publishing, peer-review, and research ethics guidelines set by the UGC

ISSN: 3108-1762 (Online)
webp (1)

Plagiarism Passed
Peer reviewed
Open Access

MULTIMODAL FUSION ARCHITECTURE FOR REAL-TIME TASK AUTOMATION

AUTHORS:
Shiva Singh
Mentor
Abdul Khalid
Affiliation
B.Tech (Information Technology) NIET, Greater Noida
CC BY 4.0 License:
This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract

Real-time task automation has evolved from rule-driven scripts into multimodal decision systems that must jointly interpret text, images, audio, and contextual metadata under strict latency constraints. Despite rapid progress in large lan-guage models, vision-language models, and speech systems, many practical automation pipelines still process modalities indepen-dently, fuse them too late, or rely on brittle prompt templates that amplify upstream noise. This paper presents a multimodal fusion architecture for real-time task automation that aligns text, image, and audio representations into a shared latent space before decoding a response or structured action. The overall pipeline follows an Input Prompt Generation Fusion-Aware AI Model Output Evaluation Refinement structure, with a bounded feedback loop that retries low-confidence cases and escalates uncertain outputs to a human operator.The proposed design combines modality-specific encoders, lightweight projection adapters, a shared fusion representa-tion, prompt conditioning, schema-aware output validation, and confidence-based routing. We evaluate the framework on three representative automation tasks: document summarisation with embedded charts, voice-driven form filling, and image-grounded question answering. Compared with a manual baseline, the auto-mated pipeline reduces average task completion time from 142.4 seconds to 31.2 seconds and improves output consistency from 64.0% to 81.3%. The gains are strongest when outputs have a verifiable structure, while tasks requiring open-ended judgement still benefit from human oversight. Beyond empirical results, we discuss robustness issues observed during deployment, including prompt drift, modality mismatch, chart parsing errors, and refinement-loop instability. The paper concludes that practical multimodal automation depends not only on model quality, but also on careful interface design between preprocessing, fusion, validation, and fallback stages.Index Terms—multimodal fusion, real-time automation, trans-former, prompt engineering, vision-language models, speech pro-cessing, confidence estimation, task automation

Keywords
Article Metrics
Article Views
46
PDF Downloads
1
HOW TO CITE
APA

MLA

Chicago

Copy

Singh, S. (2026). Multimodal Fusion Architecture for Real-Time Task Automation. International Journal of Science, Strategic Management and Technology, 02(05). https://doi.org/10.55041/ijsmt.v2i5.161

Singh, Shiva. "Multimodal Fusion Architecture for Real-Time Task Automation." International Journal of Science, Strategic Management and Technology, vol. 02, no. 05, 2026, pp. . doi:https://doi.org/10.55041/ijsmt.v2i5.161.

Singh, Shiva. "Multimodal Fusion Architecture for Real-Time Task Automation." International Journal of Science, Strategic Management and Technology 02, no. 05 (2026). https://doi.org/https://doi.org/10.55041/ijsmt.v2i5.161.

References
1.Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models From Natural Language Super-vision,” in Proc. Int. Conf. Machine Learning (ICML), vol. 139, 2021,8748–8763.

2.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford,

3.Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan, “Flamingo: A Visual Language Model for Few-Shot Learning,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022, pp. 23716–23736.

4Baltrusˇaitis, C. Ahuja, and L.-P. Morency, “Multimodal Machine Learning: A Survey and Taxonomy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 2, pp. 423–443, Feb. 2019.

5.Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models,” in Proc. Int. Conf. Machine Learning (ICML), 2023, pp. 19730–19742.

6.Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervi-sion,” in Proc. Int. Conf. Machine Learning (ICML), 2023, pp. 28492–28518.

7.K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. de Chaumont Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonov, et al., “AudioPaLM: A Large Language Model That Can Speak and Listen,” arXiv preprint arXiv:2306.12925, 2023.

8.Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks,” Trans. Machine Learning Research, 2023.

9.Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Ku¨ttler, M. Lewis, W. Yih, T. Rockta¨schel, S. Riedel, and D. Kiela, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 9459–9474.

10.Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Kaiser, and I. Polosukhin, “Attention Is All You Need,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017, pp. 5998–6008.

.
Ethics and Compliance
✓ All ethical standards met
This article has undergone plagiarism screening and double-blind peer review. Editorial policies have been followed. Authors retain copyright under CC BY-NC 4.0 license. The research complies with ethical standards and institutional guidelines.
Indexed In
Similar Articles
Pakistan-Based Terrorism and India’s Response: A Strategic Analysis of Avoiding Direct War
string(10) "PANKAJ DAS" DAS, P.et al.
(2026)
DOI: 10.55041/ijsmt.v2i3.255
A Decentralized Biometric Voting Framework using Grassmann Subspace Verification
string(16) "Ashika Shereen M" M, A. S.et al.
(2026)
DOI: 10.55041/ijsmt.v2i4.180
Zero Trust Architecture in Enterprise Networks
string(16) "Amitesh Tripathi" Tripathi, A.
(2026)
DOI: 10.55041/ijsmt.v2i5.193
Scroll to Top