Beyond Descriptions: A Generative Scene2Audio Framework for Blind and Low-Vision Users to Experience Vista Landscapes

arXivMar 28, 2026Chitralekha Gupta, Jing Peng, Ashwin Ram, Shreyas Sridhar, Christophe Jouffrais6 views

View original →

HCI Today summarized the key points

Background

•This article introduces the Scene2Audio framework, which helps BLV users experience distant landscapes through non-verbal sounds.

Main Points

•The system identifies key sound-producing objects in a scene, turns each object into sound, and then synthesizes the result using Audio Scene Analysis and Foley techniques.
•In auditory experiments and a study with BLV users, this approach achieved higher scene understanding and preference than conventional image-to-audio conversion, and was especially effective when combined with speech.
•In a mobile-app study where participants used the system in daily life for one week, detail mode was preferred. The results also showed the need to separate audio for rest versus audio for practical use depending on context.

Conclusion

•Ultimately, Scene2Audio demonstrated the potential to improve BLV access to distant landscapes, but reducing latency and improving hallucinations (incorrectly generated sounds) remains a challenge.

This summary was generated by an AI editor based on HCI expert perspectives.

Why Read This from an HCI Perspective

This article is worth reading because it treats accessibility technology not just as an information-delivery tool, but as an HCI challenge that extends to conveying the atmosphere and aesthetic experience of a scene. In particular, its design stands out for aiming to reduce the cognitive burden on BLV users while simultaneously improving both comprehension and engagement. It also presents both the potential and limitations of applying generative AI to accessibility. For practitioners, it offers criteria for designing multimodal feedback; for researchers, it provides an evaluation framework that connects sound synthesis, user experience, and real-world usability.

CIT's Commentary

What’s especially interesting is that this work redefines the problem from a ‘description’-centric approach to one that makes scenes something users can ‘experience.’ The results are quite convincing: speech can convey information, but it lacks emotional richness, while audio-only leaves too much room for interpretation and can feel unstable. That’s why Overlay seems like a compromise. It suggests that the key is not so much the raw performance of the generative model, but how information is arranged—specifically, the interaction design that determines what to tell users first and when. However, the finding that detail mode was preferred in-the-wild also reinforces that what counts as a ‘good experience’ depends on context. In real use, it will be important to separate audio for enjoyment versus audio for tasks, and to design safety mechanisms against latency, errors, and hallucinations.

Questions to Consider While Reading

Q.How can a system distinguish and adapt to situations where BLV users want to understand a scene versus when they want to enjoy it?
Q.When the immersion provided by non-verbal audio conflicts with information reliability, what priority rules are needed?
Q.For real-time use, what design approach is most effective at reducing generation latency and hallucinations while maintaining the quality of the current audio experience?

This commentary was generated by an AI editor based on HCI expert perspectives.
Please refer to the original for accurate details.

Read original →

Subscribe to Newsletter

Get the weekly HCI highlights delivered to your inbox every Friday.