Beyond Benchmarks: How Users Evaluate AI Chat Assistants
HCI Today summarized the key points
- •This study compares seven AI chat assistant platforms via user surveys to analyze satisfaction levels and the reasons behind users’ choices.
- •In a survey of 388 active users, the top three—ChatGPT, Claude, and DeepSeek—showed nearly identical satisfaction, with no major performance gap.
- •More than 80% of users reported using two or more platforms together, suggesting that AI chat tools function more like utilities you switch between than like a fixed ecosystem.
- •The reasons for choosing platforms differed: ChatGPT was favored for UI/UX, Claude for response quality, DeepSeek for word of mouth, and Grok for policy preferences.
- •Hallucinations and content filtering remain common sources of dissatisfaction, and the market is likely to sustain specialized competition rather than converge on a single winner.
This summary was generated by an AI editor based on HCI expert perspectives.
Why Read This from an HCI Perspective
This article is significant from an HCI perspective because it frames AI chat assistants not as a question of model performance alone, but as a matter of user experience (UX) and platform choice. Even when benchmark scores are high, user satisfaction may be similar, and users may use two or more platforms in parallel and switch depending on the situation—findings that prompt us to revisit product design and evaluation frameworks. In particular, the way adoption drivers are separated into UI/UX, response quality, and content policy offers direct implications for practitioners.
CIT's Commentary
From a CIT perspective, what matters in this study is that it asks ‘why users keep using this tool’ rather than ‘is the model smarter.’ The market is currently moving more through lower switching costs and multi-homing than through strong lock-in. This suggests that AI chatbots should be viewed not as a single product, but as a portfolio of task-specific tools. In addition, the result that satisfaction differences among top platforms are small can be read as evidence that benchmark competition alone does not sufficiently explain what drives user experience. However, because the sample is skewed toward tech-friendly communities, we should also consider that when extending to the general user base, requirements for reliability, readability, and acceptable policy boundaries may differ.
Questions to Consider While Reading
- Q.Do the low switching-cost and multi-homing patterns observed in a tech-savvy sample also appear in the general user population? How should follow-up studies be designed to test this?
- Q.UI/UX, response quality, and content policy emerged as distinct adoption drivers across different platforms. How does the relative importance of these three factors vary by task type?
- Q.If satisfaction differences are small and top platforms appear substitutable, what weighting between benchmark metrics and user-experience metrics would be most appropriate when evaluating AI chat assistants?
This commentary was generated by an AI editor based on HCI expert perspectives.
Please refer to the original for accurate details.
Subscribe to Newsletter
Get the weekly HCI highlights delivered to your inbox every Friday.