JARVIS: A Just-in-Time AR Visual Instruction System That Seamlessly Helps Across Realities

JARVIS: A Just-in-Time AR Visual Instruction System for Cross-Reality Task Guidance

arXivApr 11, 2026Yusi Sun, Ying Jiang, Jiayin Lu, Yin yang, Yong-Hong Kuo0 views

View original →

HCI Today summarized the key points

Background

•This article presents research on the JARVIS system, which uses AR and AI to help users perform tasks across both real and virtual environments.

Main Points

•Conventional instructions were inconvenient because users had to read them and repeat actions. The new guidance system that combines AR and AI aims to reduce this problem.
•The research team categorized tasks that span real and virtual worlds into four types, and first examined differences among photo, video, and text-based guidance.
•As a result, photos and videos were easier to understand than text, and guidance that communicates status was especially important for task success.

Conclusion

•With JARVIS, a single instruction provides step-by-step guidance along with status checks, leading to fewer mistakes and a higher success rate.

This summary was generated by an AI editor based on HCI expert perspectives.

Why Read This from an HCI Perspective

This article goes beyond simply pairing AR and AI to ‘show explanations.’ It expands into interaction design that helps users understand what they are doing right now. In particular, it shows which of text, images, or video is less confusing in real tasks, and why status checks and error recovery are crucial—making it highly relevant to both HCI/UX practice and research.

CIT's Commentary

The core of this study is not whether the AI is smart, but how well users can follow it and verify it again. In particular, mechanisms that reveal ‘where you are right now’—such as a status panel, a preview of the target state, and error feedback—are especially important in systems where the cost of failure is high, like autonomous driving or remote control. However, in real products, the trade-off between VLM inference speed and accuracy quickly becomes a trade-off in user experience: simplifying for faster responses can increase incorrect answers, while improving accuracy can slow reactions. So these systems should be designed not only around model performance, but also around when users should intervene and when they can trust the system.

Questions to Consider While Reading

Q.When shifting real-time status validation from a user-request model to automatic monitoring, what approaches can reduce latency and computational cost while maintaining safety?
Q.When text, images, video, and a status panel are all present, how can we design criteria to adjust the optimal information density differently for novices versus experts?
Q.In products like Korea’s mobile and social environment—where screens are small and context switches are frequent—what form of AR-style guidance would be most effective to scale down or restructure?

This commentary was generated by an AI editor based on HCI expert perspectives.
Please refer to the original for accurate details.

Read original →

Subscribe to Newsletter

Get the weekly HCI highlights delivered to your inbox every Friday.