The real way to use “Claude” for long-running scientific computing

Long-running Claude for scientific computing

AnthropicMar 23, 2026Anthropic1 views

View original →

HCI Today summarized the key points

Background

•This article explains how to solve scientific computing problems by running an AI agent like Claude for days at a time.

Main Points

•The author explains that if you define the goals and rules in a document first, the AI can work on its own over multiple days and complete large tasks quickly.
•As an example, the author builds a Boltzmann solver for calculating the cosmic microwave background using JAX, aiming for accuracy similar to the existing CLASS code.
•To do this, they used a progress log file, test criteria that act as a ground-truth oracle, frequent saves to Git, and a repeated execution approach that checks again whether the work is finished.

Conclusion

•In the end, this approach can significantly speed up research code even without micromanagement, though it’s not yet perfect for every situation.

This summary was generated by an AI editor based on HCI expert perspectives.

Why Read This from an HCI Perspective

This article shows how to treat AI not as a ‘chatting tool,’ but as a worker that can carry out long, multi-day tasks alongside you. For HCI/UX practitioners and researchers, it highlights something more important than raw model performance: where users draw the line on what the system should handle versus what they should intervene in, how progress is surfaced, and how failures are detected. In particular, the idea that test criteria, progress records, and rollback paths are part of interaction design is especially useful.

CIT's Commentary

The core of this piece isn’t about how smart an agent is—it’s about designing when a person chooses to trust it and when they decide to stop it. The more you automate long-running work, the more burden shifts to users to validate outcomes unless you provide clear status indicators, failure modes, and intervention paths. In safety-critical systems, this problem becomes even more pronounced. Even if it looks like the system is acting autonomously, the structure still requires the user to keep watching in the background—which isn’t good automation. That’s why test oracles, progress logs, and a Git-based rollback structure should be read not as simple development tips, but as interaction mechanisms that help humans build trust. The article also sparks the idea of using LLMs to improve UX measurement tools or even the inspection routines themselves. Keeping methodological rigor while using AI to support the research process is likely to become increasingly important.

Questions to Consider While Reading

Q.What is the minimum status information that lets users judge whether a long-running AI agent is ‘on track’?
Q.In real product environments without a test oracle, how should you design interaction structures so users can detect failures and intervene?
Q.When automating UX measurement or usability checks with LLMs, how do you balance convenience with research rigor?

This commentary was generated by an AI editor based on HCI expert perspectives.
Please refer to the original for accurate details.

Read original →

Subscribe to Newsletter

Get the weekly HCI highlights delivered to your inbox every Friday.