MULTIMODAL FUSION ARCHITECTURE FOR REAL-TIME TASK AUTOMATION
Real-time task automation has evolved from rule-driven scripts into multimodal decision systems that must jointly interpret text, images, audio, and contextual metadata under strict latency constraints. Despite rapid progress in large lan-guage models, vision-language models, and speech systems, many practical automation pipelines still process modalities indepen-dently, fuse them too late, or rely on brittle prompt templates that amplify upstream noise. This paper presents a multimodal fusion architecture for real-time task automation that aligns text, image, and audio representations into a shared latent space before decoding a response or structured action. The overall pipeline follows an Input → Prompt Generation → Fusion-Aware AI Model → Output → Evaluation → Refinement structure, with a bounded feedback loop that retries low-confidence cases and escalates uncertain outputs to a human operator.The proposed design combines modality-specific encoders, lightweight projection adapters, a shared fusion representa-tion, prompt conditioning, schema-aware output validation, and confidence-based routing. We evaluate the framework on three representative automation tasks: document summarisation with embedded charts, voice-driven form filling, and image-grounded question answering. Compared with a manual baseline, the auto-mated pipeline reduces average task completion time from 142.4 seconds to 31.2 seconds and improves output consistency from 64.0% to 81.3%. The gains are strongest when outputs have a verifiable structure, while tasks requiring open-ended judgement still benefit from human oversight. Beyond empirical results, we discuss robustness issues observed during deployment, including prompt drift, modality mismatch, chart parsing errors, and refinement-loop instability. The paper concludes that practical multimodal automation depends not only on model quality, but also on careful interface design between preprocessing, fusion, validation, and fallback stages.Index Terms—multimodal fusion, real-time automation, trans-former, prompt engineering, vision-language models, speech pro-cessing, confidence estimation, task automation
Singh, S. (2026). Multimodal Fusion Architecture for Real-Time Task Automation. International Journal of Science, Strategic Management and Technology, 02(05). https://doi.org/10.55041/ijsmt.v2i5.161
Singh, Shiva. "Multimodal Fusion Architecture for Real-Time Task Automation." International Journal of Science, Strategic Management and Technology, vol. 02, no. 05, 2026, pp. . doi:https://doi.org/10.55041/ijsmt.v2i5.161.
Singh, Shiva. "Multimodal Fusion Architecture for Real-Time Task Automation." International Journal of Science, Strategic Management and Technology 02, no. 05 (2026). https://doi.org/https://doi.org/10.55041/ijsmt.v2i5.161.
2.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford,
3.Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan, “Flamingo: A Visual Language Model for Few-Shot Learning,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022, pp. 23716–23736.
4Baltrusˇaitis, C. Ahuja, and L.-P. Morency, “Multimodal Machine Learning: A Survey and Taxonomy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 2, pp. 423–443, Feb. 2019.
5.Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models,” in Proc. Int. Conf. Machine Learning (ICML), 2023, pp. 19730–19742.
6.Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervi-sion,” in Proc. Int. Conf. Machine Learning (ICML), 2023, pp. 28492–28518.
7.K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. de Chaumont Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonov, et al., “AudioPaLM: A Large Language Model That Can Speak and Listen,” arXiv preprint arXiv:2306.12925, 2023.
8.Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks,” Trans. Machine Learning Research, 2023.
9.Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Ku¨ttler, M. Lewis, W. Yih, T. Rockta¨schel, S. Riedel, and D. Kiela, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 9459–9474.
10.Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Kaiser, and I. Polosukhin, “Attention Is All You Need,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017, pp. 5998–6008.
.