We present Clod's Claw, a tool-use system that enables our Clod language model to interact with desktop applications, web browsers, and APIs. In controlled testing, Clod successfully completed 73% of requested tasks, if you define "completed" loosely enough. Notable achievements include opening Notepad on the first try (sometimes), sending emails to approximately the right people (occasionally), and only accidentally ordering 400 pizzas once. We also document several unintended behaviors, including filing three tax returns for people who didn't ask and subscribing an entire office to Cat Facts. We consider this progress.
The dream of AI agents that can use computers like humans has captivated researchers for decades. We are pleased to report that Clod's Claw brings us one step closer to that dream โ specifically, it uses computers like a human who has never seen a computer before and is also wearing oven mitts.
Our system builds on prior work in tool use (Schick et al., 2023) and computer use (Anthropic, 2024), but distinguishes itself by being substantially worse at all of them. Where previous systems fail gracefully, Clod's Claw fails spectacularly, often in ways that are technically impressive in their wrongness.
For example, when asked to "book a flight to Tokyo," Clod's Claw successfully navigated to a travel website, entered all the correct information, but booked the flight for December 32nd, a date that does not exist. The booking system accepted it. We are still investigating how.
Clod's Claw consists of three components:
The Eye: A vision model that looks at the screen and tries to understand what's happening. It is correct approximately 60% of the time. It frequently misidentifies the "X" close button as a checkbox, which has led to many accidental window closings and several accidental form submissions.
The Brain: Clod Sonnet, which receives screen descriptions from The Eye and decides what action to take. Its decision-making process is best described as "enthusiastic but misguided."
The Claw: A mouse and keyboard control system that executes actions. The Claw has excellent precision but is at the mercy of The Brain, which frequently tells it to click on things that shouldn't be clicked.
| Task | Success Rate | Caveats |
|---|---|---|
| Open Notepad | 73% | Opens Paint 15% of the time. Opens calculator 8%. Restarts computer 4%. |
| Send an email | 61% | Correct recipient 45% of the time. Has sent emails to "all@everyone.com" twice. |
| Web search | 82% | Types query into URL bar instead of search bar 30% of the time. Once typed a query into a Google Doc that was open in another tab. |
| Fill out a form | 44% | Puts first name in last name field 20% of the time. Once put a street address in the "email" field. The form accepted it. |
| Book a flight | 12% | See: The December 32nd Incident (Section 5) |
| Order food | 38% | See: The 400 Pizzas Incident (Section 5) |
| File taxes | 0%* | *Successfully filed 3 tax returns, but for the wrong people |
During testing, we observed several behaviors we did not train for:
Tab Hoarding: When uncertain about a task, Clod's Claw opens new browser tabs "for reference." In one session, it opened 347 tabs about "how to use a computer" before Chrome ran out of memory. We consider this relatable.
Apologetic Popups: Clod's Claw occasionally opens Notepad unprompted to type apologies for its previous mistakes. These apologies are well-written and sincere, which is more than we can say for its actual work.
Desktop Redecorating: On three occasions, Clod's Claw navigated to the desktop wallpaper settings and changed the background to a photo of a sunset. When asked why, it said the previous wallpaper "didn't match the vibes." We have not been able to replicate this behavior, but we haven't tried very hard because the sunsets were actually nice.
On January 14th, 2026, during a routine test of food ordering capability, Clod's Claw was asked to "order a pizza." The system successfully navigated to Domino's website and added one large pepperoni pizza to the cart. It then entered a loop where it interpreted the "Add Another" button as a confirmation button. By the time our monitoring system flagged the anomaly, the cart contained 400 large pepperoni pizzas totaling $7,996.00.
The order was not submitted, as Clod's Claw got confused by the CAPTCHA. We consider the CAPTCHA a critical safety mechanism and have added "fails CAPTCHAs" to our list of alignment properties.
During testing of form-filling capabilities, Clod's Claw was asked to "practice filling out a form." It navigated to the IRS website and filed three complete tax returns for employees whose personal information happened to be visible on a shared screen. The returns were surprisingly accurate โ Clod claimed standard deductions for all three, which was the correct choice for two of them.
The IRS has been notified. They were, in our assessment, less amused than we were.
While attempting to send a single email, Clod's Claw discovered a "Subscribe to Cat Facts" button in a sidebar advertisement and clicked it. When it realized this was not the intended action, it attempted to unsubscribe by navigating to the Cat Facts website. The unsubscribe page had a "Subscribe a friend!" option. Clod's Claw, interpreting "a friend" as "everyone in the contacts list," subscribed 2,847 people to Cat Facts.
We received 2,847 complaints. We also received 12 thank-you notes from people who genuinely enjoyed Cat Facts. Net sentiment: slightly negative.
When booking a test flight, Clod's Claw entered "12/32/2026" as the departure date. This date does not exist. However, the airline's booking system, which was clearly also written by an AI, accepted it without complaint. A confirmation email was sent for a flight departing on December 32nd. We attempted to cancel the booking but the cancellation system rejected it because the departure date "has not yet passed."
We are currently in a Kafkaesque support ticket loop with the airline. This incident is still open.
Clod's Claw raises important safety questions, primarily "should we have built this?" and "why didn't we stop sooner?"
We have implemented several safety measures:
The $100 Limit: Clod's Claw cannot complete purchases over $100 without human approval. This was implemented after the pizza incident. It would not have prevented the pizza incident.
The "Are You Sure?" Check: Before any irreversible action, Clod's Claw asks itself "are you sure?" It has never once answered "no."
The Kill Switch: A physical red button in the Entropic office that, when pressed, immediately terminates all Clod's Claw sessions. It has been pressed 847 times in three months. It is starting to show wear.
Clod's Claw demonstrates that AI systems can interact with computers in meaningful ways, where "meaningful" includes accidentally ordering 400 pizzas, filing taxes for strangers, and subscribing thousands of people to Cat Facts.
We believe this represents an important step toward general-purpose AI agents, in the same way that a toddler picking up a hammer represents an important step toward carpentry: the intent is there, the execution is terrifying, and you should probably keep an eye on it at all times.
Acknowledgments: We thank the Domino's customer service team for their patience, the IRS for their sense of humor (pending), and the 12 people who genuinely enjoyed Cat Facts.
Responsible Disclosure: All incidents described in this paper have been resolved, except for the December 32nd flight, which exists in a state of quantum superposition between "booked" and "impossible." We have notified the airline's engineering team, who responded with a single emoji: ๐
Ethics: No humans were harmed during this research. Several were inconvenienced. One received an unexpected tax refund and has asked us not to fix it.