Does Explanation Correctness Matter? Linking Computational XAI Evaluation to Human Understanding

arXivMar 26, 2026Gregor Baer, Chao Zhang, Isel Grau, Pieter Van Gorp1 views

View original →

HCI Today summarized the key points

Background

•This article reports a study that tests whether the functional correctness of XAI explanations truly connects to human understanding.

Main Points

•The research team manipulated explanation correctness in a classification task with only time information, using levels of 100%, 85%, 70%, and 55%.
•In an experiment with 200 participants, understanding decreased as correctness dropped, but it did not decrease proportionally at every step.
•At 70% and 55%, performance was significantly worse than at 100%, but the additional drop from 70% to 55% was not a large difference.

Conclusion

•Overall, small differences in functional correctness scores do not directly translate into human understanding, so human-centered validation is needed.

This summary was generated by an AI editor based on HCI expert perspectives.

Why Read This from an HCI Perspective

This article is highly meaningful for HCI/UX researchers because it experimentally tests how much the ‘correctness’ of XAI explanations actually changes human understanding. In particular, in time-series tasks rather than image-based ones, it measures understanding via forward simulation while excluding human intuition, revealing a gap between functional metrics and human outcomes. Practically, it also highlights the limitations of approaches that optimize explanation quality purely as a numeric score.

CIT's Commentary

From a CIT perspective, this study carefully challenges the assumption often taken for granted in XAI evaluation that ‘higher correctness leads to better understanding.’ The difference between 100% and 85% was unclear, and understanding dropped only at 70% or below—suggesting that explanation quality may operate in a threshold-like manner rather than continuously. However, the more important point is that even with the same correctness level, some participants ultimately failed to learn the pattern. In other words, it’s not just the performance of the explanation itself; the design must also account for users’ learnability, task difficulty, and the feedback structure. From CIT’s viewpoint, this is not merely a question of having a ‘good explanation,’ but of designing the interaction conditions under which explanations translate into understanding.

Questions to Consider While Reading

Q.If explanation correctness up to 85% does not produce a large difference in understanding, from what level onward would quality improvements meaningfully reflect in user experience in real products?
Q.Can the threshold effect observed in time-series tasks be interpreted as appearing in other domains as well—such as images, text, or recommendation systems?
Q.If we consider not only forward simulation performance but also trust, decision quality, and error-detection ability, how would the effect of explanation correctness change?

This commentary was generated by an AI editor based on HCI expert perspectives.
Please refer to the original for accurate details.

Read original →

Subscribe to Newsletter

Get the weekly HCI highlights delivered to your inbox every Friday.