Beyond Words: Measuring User Experience through Speech Analysis in Voice User Interfaces

arXivMar 20, 2026Yong Ma, Xuesong Zhang, Xuedong Zhang, Natalia Bartłomiejczyk, Seungwoo Je2 views

View original →

HCI Today summarized the key points

Background

•This article reports research that validates whether user experience (UX) can be measured through speech analysis in Voice User Interfaces (VUIs).

Main Points

•The researchers compared three VA personas and three usage scenarios with 49 participants to examine the relationship between speech features and UX.
•Speech indicators such as prosody, speaking rate, pauses, audio quality, and disfluencies showed significant correlations with satisfaction, trust, and attractiveness.
•The study also showed that AI/ML models can classify good, neutral, and bad UX using speech features alone, suggesting that such models could serve as real-time measurement tools to complement surveys.

Conclusion

•The study suggests the possibility of moving VUI evaluation away from post-hoc, survey-centered methods and toward adaptive interfaces that leverage speech signals during conversation.

This summary was generated by an AI editor based on HCI expert perspectives.

Why Read This from an HCI Perspective

This article is highly meaningful for both HCI practitioners and researchers because it goes beyond the practice of relying solely on post-task surveys for VUI evaluation, attempting instead to read users’ utterances themselves as UX signals. If speech analysis can capture moments of discomfort, effort, and engagement in near real time, it becomes possible to design an evaluation framework with fewer interaction drop-offs and lower operational costs. In particular, aligning system logs with utterances to interpret them is a practical approach.

CIT's Commentary

From a CIT perspective, the core contribution of this study is that it empirically demonstrates that ‘speech’ is not merely an input channel, but behavioral data that reveals interaction quality. That said, we believe it is more appropriate to view these results as a complement to UX measurement rather than a substitute. Speech characteristics are sensitive to confounding factors such as fatigue, intonation, the degree of inhibition, dialect, and microphone quality—so individual user baselines and contextual information should be addressed together. Even so, connecting turn-level analysis to adaptive VUIs is a direction with strong potential for real-world deployment. In particular, the idea of reading ‘good UX’ not as a judgment made after the fact, but as signs observed during interaction aligns well with dynamic experience measurement—something CIT places great importance on.

Questions to Consider While Reading

Q.I’m curious how robust speech-based UX inference is to individual user differences and environmental noise, and how performance changes under subject-independent conditions.
Q.When applying this approach to a real service, what would be an appropriate way to design privacy and consent for continuously analyzing users’ speech?
Q.Rather than simply classifying good UX, neutral UX, and bad UX, I’d like to know whether it’s possible to interpret more precisely which speech features connect to which system problems (latency, errors, excessive verbosity).

This commentary was generated by an AI editor based on HCI expert perspectives.
Please refer to the original for accurate details.

Read original →

Subscribe to Newsletter

Get the weekly HCI highlights delivered to your inbox every Friday.