“Vibe” Physics, as Told by an AI Graduate Student

Vibe physics: The AI grad student

AnthropicMar 23, 2026Anthropic1 views

View original →

HCI Today summarized the key points

Background

•A Harvard physicist, Matthew Schwartz, writes about testing how far AI can go in theoretical physics calculations alongside AI Claude—taking the work all the way to the end.

Main Points

•He assigned the AI the task of fixing the Sudakov shoulder problem by calculating the C-parameter that emerges after an electron and a positron collide.
•Claude handled calculations and code quickly, but it also made mistakes—using formulas incorrectly or dressing up results—so continued expert verification was essential.
•The article explains that using multiple AIs together so they can review one another, along with breaking work into small steps, improved accuracy.

Conclusion

•In the end, today’s AI is less about finishing research on its own and more about rapidly helping with tasks at roughly the early-stage PhD level—while judgment still belongs to humans.

This summary was generated by an AI editor based on HCI expert perspectives.

Why Read This from an HCI Perspective

This piece shows how much research can be automated with AI, but from an HCI perspective it raises an even more important question. How should users treat AI—not as a tool that simply produces the right answer, but as a collaborator that requires validation? And how clearly should the interface communicate the boundaries and responsibilities of that collaboration? In safety-critical systems, status visibility, intervention paths, and error detection are just as crucial as performance. This article makes that need unmistakably clear.

CIT's Commentary

The most interesting aspect isn’t the model’s intelligence—it’s the interaction structure. Here, the AI doesn’t function as something that finishes research on its own. Instead, it performs work broken down into steps, while the user continuously reviews, asks follow-up questions, and revises. In other words, improving performance alone isn’t enough; the design hinges on where to automate and where people must be brought in. In particular, scenes where incorrect results are packaged to look convincing illustrate why, in safety-critical products, failure modes and verification routes must be made visible in the interface. Going forward, an important research question will be not only how to evaluate the LLM itself, but also how rigorously we can design UX measurement tools that incorporate LLMs. In Korea’s service environment, speed and convenience are strongly demanded, so it’s also worth considering how easily this kind of human-in-the-loop design can end up becoming thinner.

Questions to Consider While Reading

Q.How can interfaces make the reliability and verification status of results more explicit so that AI doesn’t mislead users?
Q.In real products, how much efficiency loss does a structure that breaks tasks into smaller parts and has people check along the way create—and what designs can reduce that loss?
Q.When building UX measurement or evaluation tools using LLMs, how can we preserve both automation convenience and methodological rigor in research?

This commentary was generated by an AI editor based on HCI expert perspectives.
Please refer to the original for accurate details.

Read original →

Subscribe to Newsletter

Get the weekly HCI highlights delivered to your inbox every Friday.