JARVIS: A Just-in-Time AR Visual Instruction System for Completing Missions Easily Across Real and Virtual Worlds

JARVIS: A Just-in-Time AR Visual Instruction System for Cross-Reality Task Guidance

arXivApr 11, 2026Yusi Sun, Ying Jiang, Jiayin Lu, Yin yang, Yong-Hong Kuo0 views

View original →

HCI Today summarized the key points

Background

•This article presents research on the JARVIS system, which uses AR and AI to help users perform real and virtual tasks together.

Main Points

•Conventional instructions were inconvenient because users had to alternate between reading and executing, whereas AR displays guidance directly on the task screen.
•The research team investigated real and virtual tasks separately and organized the guidance scenarios into four categories: real-to-real, real-to-virtual, virtual-to-real, and virtual-to-virtual.
•With a single input, JARVIS provides step-by-step guidance, state checking, and error correction, along with visual indicators such as diagrams, videos, and arrows.

Conclusion

•However, at present, users must trigger state checks and video presentation themselves, so faster automatic verification is needed going forward.

This summary was generated by an AI editor based on HCI expert perspectives.

Why Read This from an HCI Perspective

This article focuses on AR interactions where AI is not merely a tool that ‘generates’ answers, but a system that helps users understand their current situation and decide their next action. For HCI/UX practitioners and researchers, it offers design cues on which modalities—text, images, or video—are effective at different times, and why state feedback is crucial. In particular, it discusses practical ways to reduce cognitive load during complex tasks.

CIT's Commentary

The core of this piece is less about whether the AI is ‘smart,’ and more about whether the user can see what they are looking at right now, confirm whether they are on the right track, and understand where they can intervene. AR tutorials should not end with a single arrow; they need supporting mechanisms such as a state queue, previews of target states, and error checks to operate safely. What’s interesting is that while images can be fast and less burdensome, in real products even small changes in the situation can cause explanations to drift out of sync. That’s why runtime state validation—and breaking instructions into finer steps when needed—becomes important. This is also a question that carries over when attaching AI assistants in domestic services like Naver and Kakao. Especially in tasks that mix the real world and the screen—such as games, digital work, and hybrid office tasks—guidance that reduces misunderstandings is often more valuable than ‘correct answer’ guidance. Moreover, the approach of using LLMs not only as a UX generation engine, but as a tool to support state measurement and validation, is methodologically meaningful as well.

Questions to Consider While Reading

Q.In real products, adding state validation more often will increase response time and cost—at what point can it be considered ‘sufficiently safe’?
Q.Among combinations of images, video, and text, what is most suitable for beginners versus experienced users, and how should the system adapt based on user proficiency?
Q.In Korea’s mobile app or AI assistant environments, you may need simpler interfaces than rich visualizations like AR—how can these differences be turned into design principles?

This commentary was generated by an AI editor based on HCI expert perspectives.
Please refer to the original for accurate details.

Read original →

Subscribe to Newsletter

Get the weekly HCI highlights delivered to your inbox every Friday.