Anthropic built one of the most capable AI models in the world. Then they asked it to build a complex application from scratch. It failed. It tried to do everything at once, declared victory too early, and produced something broken.
So they wrapped better scaffolding around the same model. Progress tracking, incremental task lists, session handoffs, automated testing. Same brain, better hands.
That scaffolding has a name in the industry: a harness. It might be the most under-discussed innovation in AI right now. Every few months a new model drops and the headlines write themselves. Claude 4. GPT-5. Gemini 2.5. The media covers foundation models like they're the whole story. But the real performance gains, the ones changing what AI can actually do in practice, are happening in the harness. The tooling and orchestration wrapped around the model. An entire discipline evolving at breakneck speed, and almost nobody outside builder circles is paying attention.
The brain-versus-hands distinction is the most useful way I've found to think about what makes AI work. And once you have the right brain-and-hands combo, you judge it across three dimensions: quality, cost, and latency.
The Brain and the Hands
Most people think the magic of AI lives in the model. GPT, Claude, Gemini. Pick your favorite, plug it in, watch it work. This is wrong, or at least dangerously incomplete. What makes AI powerful in practice is the combination of two things: the foundation model and the harness, the tooling and infrastructure wrapped around it.
Think of a master watchmaker. What makes them extraordinary isn't their knowledge of horology or their dexterity. It's both, working together. Their brain is the judgment and planning ability. Their hands are the tools, the workbench, the physical ability to execute. A brilliant brain without skilled hands produces nothing. And skilled hands without a brain just flail around.
The brain is your foundation model. The intelligence and reasoning, the ability to understand what you're asking and figure out how to approach it.
The hands are everything that lets the model actually do things in the real world. The tools it can call, the memory it can access across sessions, the context it receives, the guardrails that keep it on track, the logic that breaks complex work into smaller pieces. That's your harness.
Same brain. Very different outcomes.
And "harness engineering" is quickly becoming its own discipline, arguably the most important one for anyone building AI-powered products right now.
Both sides have gotten dramatically better in the past two years. Models have gone from party tricks to genuine reasoning engines. Harnesses have evolved from simple prompt templates to systems with persistent memory, tool use, multi-agent coordination, and self-verification loops. The best AI workflows are the ones where both the brain and the hands are strong and well-matched.
What's Inside a Harness
If the model is the brain, the harness is everything between the brain and the outside world.
📋 Instructions. Before the model sees your request, it receives a briefing. Who it is, what it's good at, what it should avoid, how it should format responses. Think of a new employee's onboarding document, except it gets re-read before every single task. The same model with different instructions feels like a completely different product.
👁️ Context. The model can only work with what it can see. A model answering a question about your codebase is only as good as the code you show it. Harness engineering is largely the art of putting the right information in front of the model at the right time.
🛠️ Tools. A raw model can only generate text. Give it tools and it can do things: search the web, read files, write code, call APIs, send messages. This is what turns a chatbot into an agent.
🎼 Orchestration. For anything beyond a single question-and-answer, someone has to manage the workflow. Break a complex task into steps. Decide what to do next based on what just happened. Retry when something fails. Hand off between different models or agents.
🛡️ Guardrails. Left unchecked, models make confident mistakes. Guardrails catch them by validating outputs, running tests on generated code, flagging when the model contradicts itself. The best harnesses build verification into every step rather than just checking at the end.
🔍 Retrieval. Most useful knowledge doesn't fit in the model's context window. Retrieval systems search through your documents and knowledge bases, pulling in only what's relevant. It's how you give a model access to your company's entire documentation without pasting it all in every time.
None of these are especially complex on their own. The difficulty is in combining them well, and that's what separates a demo from a product.
Harness engineering is also never finished. Mitchell Hashimoto, who coined the term, describes it as a loop: every time an agent makes a mistake, you engineer a fix so it never makes that mistake again. Every failure becomes infrastructure.
OpenAI learned this building a million-line application entirely with AI agents, zero hand-written code. Their biggest early lesson wasn't about the model. The agent struggled not because it was incapable, but because the harness was underspecified. Give the agent a concise map of what matters, not a thousand-page manual. Context is a scarce resource, and flooding it hurts more than it helps.
There's also a counterintuitive pattern the best teams have discovered: constraining agents makes them more reliable. Fewer architectural choices, enforced boundaries, standardized structures. You trade some "generate anything" flexibility for outputs you can actually trust. The harness isn't just enabling the model. It's deliberately limiting it.
Go deeper on harnesses
Harness engineering is an emerging discipline with serious people writing about it. If you want to dig in, here are the best starting points.
- Effective Harnesses for Long-Running Agents — Anthropic's engineering team on solving multi-session agent work with progress tracking and incremental feature delivery.
- Harness Engineering — OpenAI's report on building a million-line app with zero hand-written code and every lesson they learned doing it.
- My AI Adoption Journey — Mitchell Hashimoto (creator of Terraform) coined the term and explains the core loop: every agent failure should become infrastructure.
- Harness Engineering — Martin Fowler's team breaks harnesses into three pillars: context engineering, architectural constraints, and entropy management.
- The Importance of Agent Harness in 2026 — Phil Schmid expands his CPU/OS analogy with practical patterns for building harnesses that outlast model upgrades.
- Agentic Engineering Patterns — Simon Willison's practitioner guide to testing, orchestration, and code comprehension patterns for agent workflows.
The Performance Triangle
Once you understand that AI = brain + hands, the next question is practical. How do you pick the right combination for a specific job?
You evaluate across three dimensions.
Quality is how good the output is on the first try. You send a request, the model does its work, you review the result. Can I trust this without spending time fixing it? That's quality.
Cost is how much you pay per completed task. Not per token, per task. A cheap model that needs three rounds of correction often costs more than an expensive one that nails it first try. Inference costs have been dropping by roughly an order of magnitude per year, which helps.
Latency is how long it takes. For chatbots and voice apps, users notice delays above a second. For code generation or batch processing, total generation speed matters more. Which bottleneck you care about depends on what you're building.
But you can rarely optimize all three at once. Improving one almost always costs you another.
Pick two. Tradeoffs are necessary.
High quality and low latency? Expensive. You're running the biggest model on dedicated hardware. High quality and low cost? Slow, because you're batching requests and waiting for off-peak pricing. Low cost and low latency? Lower quality, a small, fast model that makes more mistakes.
The right balance depends entirely on the task. Writing production code that ships to customers? Optimize for quality, being wrong is expensive. Powering a real-time chat interface? Latency is king. Running a coding agent overnight while you sleep? Cost is all that matters, because nobody's waiting.
As models and harnesses keep improving, the triangle keeps shifting. What needed a frontier model last year now runs fine on a smaller, cheaper one.
These two frameworks work as a pair. The brain-and-hands model tells you where to invest. Sometimes the bottleneck is the model, sometimes it's the harness. The performance triangle tells you what to optimize for given the specific task.
The people getting the most out of AI right now ask two questions before starting anything: Is this a brain problem or a hands problem? And: Am I optimizing for quality, cost, or speed?
Written with ❤️ by a human (still)
