Multimodal Fusion Architecture for Real-Time Task Automation

Singh, Shiva

doi:https://doi.org/10.55041/ijsmt.v2i5.161

Plagiarism Passed

Peer reviewed

Open Access

MULTIMODAL FUSION ARCHITECTURE FOR REAL-TIME TASK AUTOMATION

AUTHORS:

Shiva Singh

Mentor

Abdul Khalid

Affiliation

B.Tech (Information Technology) NIET, Greater Noida

DOI: 10.55041/ijsmt.v2i5.161

CC BY 4.0 License:

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

DOWNLOAD ARTICLE

REVIEW REPORT

CITE THIS ARTICLE

Abstract

Real-time task automation has evolved from rule-driven scripts into multimodal decision systems that must jointly interpret text, images, audio, and contextual metadata under strict latency constraints. Despite rapid progress in large lan-guage models, vision-language models, and speech systems, many practical automation pipelines still process modalities indepen-dently, fuse them too late, or rely on brittle prompt templates that amplify upstream noise. This paper presents a multimodal fusion architecture for real-time task automation that aligns text, image, and audio representations into a shared latent space before decoding a response or structured action. The overall pipeline follows an Input → Prompt Generation → Fusion-Aware AI Model → Output → Evaluation → Refinement structure, with a bounded feedback loop that retries low-confidence cases and escalates uncertain outputs to a human operator.The proposed design combines modality-specific encoders, lightweight projection adapters, a shared fusion representa-tion, prompt conditioning, schema-aware output validation, and confidence-based routing. We evaluate the framework on three representative automation tasks: document summarisation with embedded charts, voice-driven form filling, and image-grounded question answering. Compared with a manual baseline, the auto-mated pipeline reduces average task completion time from 142.4 seconds to 31.2 seconds and improves output consistency from 64.0% to 81.3%. The gains are strongest when outputs have a verifiable structure, while tasks requiring open-ended judgement still benefit from human oversight. Beyond empirical results, we discuss robustness issues observed during deployment, including prompt drift, modality mismatch, chart parsing errors, and refinement-loop instability. The paper concludes that practical multimodal automation depends not only on model quality, but also on careful interface design between preprocessing, fusion, validation, and fallback stages.Index Terms—multimodal fusion, real-time automation, trans-former, prompt engineering, vision-language models, speech pro-cessing, confidence estimation, task automation

Keywords

Article Information

Article Metrics

Article Views

112

PDF Downloads

HOW TO CITE

References

Ethics and Compliance

✓ All ethical standards met

This article has undergone plagiarism screening and double-blind peer review. Editorial policies have been followed. Authors retain copyright under CC BY-NC 4.0 license. The research complies with ethical standards and institutional guidelines.

Indexed In

International Journal of Science, Strategic Management and Technology

ISSN: 3108-1762 (Online)

MULTIMODAL FUSION ARCHITECTURE FOR REAL-TIME TASK AUTOMATION

About Journal

Policies & Ethics

Indexing Platforms

Contact Us

MULTIMODAL FUSION ARCHITECTURE FOR REAL-TIME TASK AUTOMATION

About Journal

Contact Us

Share on