Tool Deep DivesAI エージェント

What's an Agent Harness? Understanding Claude Code and Codex CLI in 8 Minutes Using a Car Analogy

A harness is the outer-shell software that wraps an LLM and lets it actually act in the world. With the same Claude or GPT engine, a different harness can swing SWE-bench scores by 36 points. Claude Code, Codex CLI, and Devin are different chassis. This is a general-reader walkthrough using a car analogy.

中澤圭志

@keishi_nakazawa

Sales Claw maintainer

May 18, 2026·12 min

What's an Agent Harness? Understanding Claude Code and Codex CLI in 8 Minutes Using a Car Analogy

This English article is a concise version of the original. For the full Japanese deep-dive, see the Japanese original.

Key Facts

In one line

External-shell software that turns an LLM (engine) into a working AI agent (a car body)

Performance impact

Same model, different harness: 42% → 78% on SWE-bench (documented case)

Examples

Claude Code (Terminus-2) / Codex CLI / Devin / Aider

Where to start

Pick one pre-built harness (Claude Code or Codex CLI) and play for 30 minutes

DATA— First, the vocabulary (30-second glossary)

LLM (Large Language Model): the language model itself, the core of ChatGPT or Claude. Think of it as "the engine." Pour fuel in and it spins, but it cannot drive on its own.
Harness: the outer-shell software that makes an LLM usable in the real world. Think of it as "the chassis, the steering wheel, the tires, and the brakes", all together. It handles tool calls, memory of previous work, the loop that keeps going until the goal is met, and the safety guards that prevent dangerous operations.
Scaffolding: a part of the harness — specifically, the configuration assembled before the agent starts running. System prompts, the tool registry, sub-agent definitions. Like construction scaffolding: built first, then used.
Agent loop: the LLM's "think → use a tool → check the result → decide next" cycle, repeated autonomously. Think of it as"the driving action of steering until you reach the destination."This is the core of any harness.
Benchmark (SWE-bench / Terminal-Bench): tests that measure AI coding agents. SWE-bench uses real GitHub bug fixes; Terminal-Bench uses real terminal command tasks. These benchmarks are how the field figured out that swapping harnesses on the same model can move scores by tens of points.

"Why are AI people suddenly talking about harnesses?" "Isn't the difference between Claude Code and Codex just the model?" "What does it actually mean for a 'harness change' to dramatically change performance?" — this article unpacks the term harness, which became an AI-industry buzzword in 2026, in language general readers can follow. We cite Anthropic Claude Code official documentation, the Anthropic System Card, the Terminal-Bench official article, and GitHub repositories as primary sources to explain why harness matters nowand how regular people can start touching one today.

Primary sources for this article: Anthropic Claude Code Docs / Anthropic System Card (Claude Opus 4.6 PDF) / Terminal-Bench official leaderboard / GitHub anthropics/claude-code / VentureBeat interview with Anthropic / OpenAI Codex CLI official changelog. We also reference a few third-party explainer articles for context, but every load-bearing number and spec ties back to official sources.

1. What a harness is — "the chassis the LLM rides on"

[Official] In the Anthropic System Card for Claude Opus 4.6 (February 2026), Anthropic explicitly names "Terminus-2" as the harness used when scoring the model on terminal benchmarks (System Card §4.2). The same Claude Opus 4.6 produces different numbers depending on the harness wrapping it, so Anthropic publishes "model-only score" and "harness-included score" separately. This is the strongest public signal that the harness matters as much as the model.

[Author's view] The car analogy is the most intuitive way in. Top-tier LLMs like Claude Opus 4.7 and GPT-5.5 are world-class engines, but an engine alone cannot drive on public roads. Steering, tires, brakes, navigation, seatbeltshave to be combined before you have a "car." In the AI world, that "chassis" is the harness, and Claude Code, Codex CLI, Devin, and Aider are different chassis (built by Anthropic, OpenAI, Cognition, and the OSS community respectively) sold to wrap the latest engines.

Anthropic's official engineering blog "Building Effective Agents" (December 2024) defines an agent as "a system where the LLM dynamically directs its own processes and tool usage, maintaining control over how it accomplishes the task." The harness is the outer-shell software that implements that "dynamically direct" and "maintain control" layer.

A whiteboard-style illustration showing LLM as engine and harness as chassis. The center shows a large engine; left side lists what the harness adds (tool calling / memory / loop / safety guards); right side lists what cannot be done without a harness (file edits / command execution / multi-step work). A yellow sticky note highlights the slogan 'Agent = Model + Harness'. — Figure: LLM (engine) + harness (chassis) = AI agent (running car) — the full picture

2. Why "harness" suddenly became a 2026 buzzword

[Official] In April 2026, Anthropic told VentureBeat that the period during which users reported Claude quality degradation overlapped with internal changes to the harness and operating instructions (VentureBeat April 2026 article, Anthropic comment). This is, without exaggeration, the first time a major AI lab has publicly acknowledgedthat "changes to the harness alone can shift user-visible quality without touching the model."

[Official] In the same window, SWE-bench Pro leaderboard analyses surfaced the following numbers (multiple official engineering blogs, April–May 2026):

Same model, harness swap only: 42% → 78% (a ~36-point lift) on SWE-bench
SWE-bench Pro: scaffolding differences accounted for 22+ points
By contrast, the gap between top-of-frontier models is roughly 1 point

[Author's view] What these numbers actually say is that we have entered a phase where"harness choice" affects outcomes more than "model choice". Through 2024 the debate was "is GPT-4 or Claude 3 smarter?" In 2026, the lived truth is that the same Claude Opus 4.7 inside Claude Code behaves like a different product than the same model inside a homemade script.

3. What's inside — the three pillars of a harness

If the harness is a chassis, the three pillars are the engine mount, the steering wheel, and the fuel tank. We'll go through each in order.

(1) The autonomy loop — the Gather-Act-Verify cycle

[Official] Claude Code's official overview describes the agent's behavior as a repeating "Gather context → Act → Verify results" cycle. Concretely:

Gather: search files, read code, capture command output
Act: edit files, run commands, call external APIs
Verify: run tests, check output, look for errors
If the result doesn't meet the goal, return to Gather and try again

That, in essence, is the autonomy loop. An LLM by itself does "answer once and stop." The harness runs the three-step loop until the goal condition is satisfied, which is what makes the agent look like it "is thinking and acting on its own."

(2) Tool calling — exposing functions to the AI

The harness gives the agent "tools" it can use. For Claude Code, the standard toolset includes:

Read / Write / Edit (file operations)
Bash (command execution)
Glob / Grep (search)
WebFetch / WebSearch (internet access)
Custom tools via MCP servers (Slack posting, GitHub PR creation, etc.)

These are exposed to the LLM as "JSON Schema function definitions."The LLM responds with a JSON payload saying "I want to call this function with these arguments," the harness parses that JSON, actually invokes the function, and returns the result back to the LLM. That feedback loop is what "tool calling (function calling)" really is. See theMCP (Model Context Protocol) complete guidefor the protocol underneath.

(3) Memory / context management — "not forgetting long tasks"

Every LLM has a context window limit. As of May 2026, Claude Opus 4.7 is 1M tokens, GPT-5.5 is 272K tokens (or about 1M in long mode). In real coding work, repositories and logs routinely exceed even a 1M-token window.

The harness keeps you under the cap by silently doing things like (a) summarizing and compressing older turns, (b) paging important facts to side files and reading them back when needed, and (c) delegating sub-tasks to sub-agents and only ingesting their final result into the parent context. This is what "memory management" and "context management" refer to.

A whiteboard illustration breaking down the harness internals. The center shows the Gather-Act-Verify autonomy loop; the left lists tools (Read/Write/Bash/Grep/MCP); the right covers context compression and sub-agent delegation; the top is labeled 'safety guards (approval, skip)' and the bottom is labeled 'audit log (who, when, what)'. — Figure: The three pillars (autonomy loop / tools / memory) with safety guards and audit logs bookending both sides

4. Inside Claude Code's harness

[Official] Claude Code is the CLI agent shipped as the npm package @anthropic-ai/claude-code(GitHub: anthropics/claude-code). The official Claude Code Docs describe it as an"agentic coding tool that lives in your terminal, understands your codebase, and executes routine tasks via natural language commands."

What's distinctive about the Claude Code harness is "always plan first" and "approval-heavy."

Plan mode: doesn't jump straight into editing — it writes out "what it plans to do" in natural language and waits for the user's OK before executing
permission-mode: default / acceptEdits /bypassPermissions / plan — four tiers of operational autonomy you can switch between
Sub-agent: delegate a large task to a child agent so the parent only ingests its final result, saving context
/goal command (2.1.140+): high-autonomy mode that keeps looping until a stated condition is satisfied

On the Terminal-Bench leaderboard, Claude Opus 4.6 + Terminus-2 scores 65.4% (max effort)(tbench.ai official). Claude Code doesn't use Terminus-2 byte-for-byte, but it's one of the few cases where an Anthropic-built harness baseline is public knowledge.

# Install Claude Code and start (Mac/Linux/Windows)
npm install -g @anthropic-ai/claude-code
claude

# Use Plan mode to just see "what it would do" first
claude --permission-mode plan "add a new blog entry to lib/blog.ts"

# Run the autonomy loop until the goal is met (Claude Code 2.1.140+)
claude /goal "keep editing until all tests pass"

For what changed in each Claude Code release, see theClaude Code 2.1.143 release notes.

5. Inside Codex CLI's harness

[Official] Codex CLI is OpenAI's Node.js CLI agent, backed by GPT-5.5 / GPT-5.3-Codex / GPT-5.4 (OpenAI Codex Changelog). Compared with Claude Code at the harness-personality level:

項目	Claude Code (Anthropic harness)	Codex CLI (OpenAI harness)
Model	Claude Opus 4.7 / Sonnet 4.6 / Haiku 4.5	GPT-5.5 / 5.3-Codex / 5.4
Context window	1M tokens (Opus/Sonnet/Haiku)	272K (~1M in long mode)
Planning	Plan mode writes a plan first	Model-discretion, execution-leaning
Remote control	Dedicated Remote Control UI (2026-02)	JSON-RPC 2.0 + mobile (2026-05)
Image generation	External tools only	Built-in image_generation (gpt-image-2)
Approval model	permission-mode (4 tiers)	per-site / per-command (Chrome ext.)
Best for	Long-context reading, sub-agent splits	Parallel headless batches, image generation

[Author's view] It's not a question of which is better — the design philosophies differ. Claude Code shines at "reading a lot while in dialog with a human." Codex CLI shines at "quietly running parallel batches overnight." In practice, the two coexist well. For a deeper breakdown, see theCodex CLI vs Claude Code benchmark comparison.

DATA— In-house verification notes — Sales Claw maintainer's field observations

To put it plainly, the same Claude Opus 4.7 feels completely different inside Claude Code vs. Aider. To say it the everyday way, the harness personality directly shapes the quality of what comes out.

Test conditions: Windows 11 / Claude Code 2.1.143 / Claude Opus 4.7 / same Sales Claw repository
Period: 2026-04-20 through 2026-05-17 (~4 weeks)
Sample size: 38 coding tasks + 18 blog-authoring tasks = 56 internal-bench runs
Observation 1: Plan mode in Claude Code reduced "accidentally edited without my OK" incidents to zero
Observation 2: raising permission-mode to acceptEdits doubled perceived speed, but caused 3 accidental edits
Observation 3: blog drafting that split work via sub-agents went ~1.4× faster than a single parent agent
Observation 4: Codex CLI's built-in image_generation rendered all 7 blog images in about 25 minutes
Observation 5: tested the same task on Aider (OSS); loop control felt thinner and completion rate looked lower (estimated)
Reproducibility caveat: 56 is a small sample. Different task mixes, repo sizes, or model unit prices may not reproduce these numbers

※ These are one developer's field observations. Generalization requires more data.

Sales Claw applies Claude Code / Codex CLI harness thinking to BtoB sales-form automation.

無料・MIT ライセンス。インストールせずにライブデモも試せます。

無料でダウンロードライブデモを試す GitHub

6. Choosing a harness — build your own or use one? (how to start)

[Author's view] For most people wanting to put an AI agent to work, "build one from scratch" is unrealistic. The three-step progression below is what we recommend.

Step 1: pick one pre-built harness and try it (free to low cost)

Claude Code: with a Claude Pro subscription (USD 20/month), one npm install -g @anthropic-ai/claude-code gets you running. The 1M-token context window and the clarity of Plan mode make it a great first AI agent.
Codex CLI: requires a ChatGPT Plus / Pro / Business plan. Install via npm install -g @openai/codex. GPT-5.5 responses are fast, and web-search tool integration is standard.
Gemini CLI: Google Cloud account gets you a free tier. Long context and search are strong out of the box.
Aider (OSS): free as long as you have an API key. For developers who want to read the harness internals.

Step 2: extend the harness via MCP

Once you're comfortable, layer on custom tools via MCP (Model Context Protocol). Tools that let the agent post to your internal Slack, read your Notion, or query your own database turn the harness into a business-specific instrument.

Step 3: adopt a domain-specific harness (e.g. Sales Claw)

In domains like outbound sales automation, where business-specific safety guards and audit logs are mandatory, general harnesses (Claude Code / Codex CLI) leave gaps. Sales Claw is a sales-specific harness with built-in policy controls, send-time auto-inspection, sales-NG detection, captcha-stop, rate limiting, audit logs, and AND-ed auto-stop conditions. To go further, theSales Claw Quickstart Guideis the easiest entry point.

Three-step staircase illustration showing the path from generic harness to specialized one. Step 1: try Claude Code / Codex CLI (free to low cost), Step 2: extend with MCP-served internal tools, Step 3: adopt a domain-specific harness like Sales Claw. Each step is annotated with rough duration and difficulty. — Figure: The three-step approach for regular users — pre-built → MCP extension → domain-specific

Bar chart showing harness impact on SWE-bench Pro. The same model lifts from 42% to 78% just by changing harness; meanwhile model-to-model gap at the frontier is roughly 1 point. Sample sizes and assumptions are annotated, Sales Claw editorial illustration. — Figure: SWE-bench Pro impact of harness choice. Same model, vastly different scores depending on harness.

Bar chart comparing context windows: GPT-5.4 / GPT-5.5 standard / GPT-5.5 long mode / Gemini 2.5 Pro / Claude (Opus, Sonnet, Haiku all at 1M). Cited Anthropic, OpenAI, and Google official pricing & docs (May 2026 retrieval). — Figure: Context-window comparison across major LLMs. The harness has to know how to use that capacity.

Bar chart comparing monthly JPY cost for Aider (OSS) + Claude API / Gemini CLI (free tier) / Claude Code (Claude Pro) / Codex CLI (ChatGPT Plus) / Sales Claw (OSS) + Claude. Assumes 1 USD = 150 JPY, single developer using ~100 hours/month. — Figure: Monthly cost estimate for major harnesses (individual / SMB usage). Roughly ¥0 to ¥4,500/month gets you started.

7. Risks and safety design

(1) Runaway autonomy loops

[Unverified] As of May 2026, no major AI agent has publicly reported a serious incident caused by a runaway loop. That said, social media and developer blogs document multiple personal cases: unexpected large API spend, git push --force wiping out history, and similar. The harness executes whatever the model decides next — which means it often won't stop in situations where a human's instinct would.

(2) Over-broad tool permissions

It's convenient to hand the harness every tool, but the blast radius of any failure scales with what it can touch. In business contexts, build in permission separationfrom day one: read-only tools (Read / Grep / WebFetch) get approval-free, write tools (Write / Bash / Delete) always require approval. Claude Code's permission-modeand Codex CLI's per-command approval exist exactly for this purpose.

(3) Missing audit logs

The harness must persist a record of "who, when, what, via which tool". When something goes wrong, no logs means no root cause, no prevention. Claude Code and Codex CLI both save session logs by default, but for business use it's realistic to operate a parallel audit log in a structured format (JSON-Lines, ISO 8601 timestamps, user ID, the full command text).

8. Business use and the Sales Claw angle

Sales Claw is an OSS tool that uses policy controls, send-time auto-inspection, sales-NG detection, captcha-stop, rate limiting, audit logs, and AND-ed auto-stop conditions to structurally reduce the risk of accidental sends and policy violations. General harnesses like Claude Code and Codex CLI are excellent for coding and research, but "submit a sales pitch through a company's contact form"is a task-specific job that needs task-specific guards.

[Author's view] The relationship between general-purpose and domain-specific harness is similar to that between a passenger sedan and an ambulance. Both are vehicles. But the ambulance ships with sirens, an ECG monitor, oxygen tanks, a stretcher built in — things a sedan does not have. In the same way, a sales agent needs send-time inspection, sales-NG detection, captcha non-bypass, audit logs built in, things Claude Code does not ship by default. Sales Claw is the ambulance for outbound sales work, putting a dedicated body on top of the latest Claude / Codex engines.

Checklist before adopting a harness for business use

Before putting a harness in production

Set AND-conditioned limits on max turns, max items, and max execution time
Verify the auto-inspection pass rate on a 100-record sample
Approval mode is enabled for destructive commands (rm / git push --force / DELETE)
Audit log (action-log.json) persistence is enabled
Tool permissions are restricted to the minimum (read/write separated)
No production credentials remain in prompt history
Error notifications (Slack / email) are configured
A correction / rollback procedure is defined
Legal / compliance review is complete
Harness version and changelog are recorded internally

Conclusion — chassis decides the race more than the engine

In 2026, AI agents have entered a phase where "which chassis (harness) you ride in" matters more for real-world performance than "which engine (LLM) you use." Claude Code, Codex CLI, Devin, and Aider are all ready-made chassis polished by Anthropic, OpenAI, Cognition, and the OSS community respectively.

Next action: pick Claude Code or Codex CLI and play with it for 30 minutes. Experiencing Plan mode or per-command approval is what makes "what is a harness" click in a single sitting. If you want a domain-specific harness for outbound sales, theSales Claw Quickstart Guideor the free download pagewill get you started.

The engine doesn't decide the speed. The chassis does. Run Sales Claw on your own list.

無料・MIT ライセンス。インストールせずにライブデモも試せます。

無料でダウンロードライブデモを試す GitHub

Japanese-language original: AI エージェントの『ハーネス』って結局なに？

よくある質問

What is an AI agent harness?

A harness is the external software that wraps an LLM (Claude, GPT, etc.) and turns it into a working AI agent. Think of the LLM as the engine and the harness as the chassis, steering, tires, brakes, navigation, and seatbelts combined. It bundles tool calling (function calling), memory and context management, the autonomy loop (Gather-Act-Verify), safety guards (approval and guardrails), and audit logs. In its official System Card for Claude Opus 4.6, Anthropic publishes both "model-only" and "harness-included" scores, naming "Terminus-2" as the harness used for benchmarking.

Why did "harness" become a 2026 buzzword?

Through late 2025, SWE-bench Pro data piled up showing the same model swinging tens of points depending on harness. In April 2026, Anthropic told VentureBeat that user-visible quality variation in Claude had been caused by internal changes to the harness and operating instructions. That was the first time a major lab publicly acknowledged that a harness change alone moves user-visible quality. The "Agent = Model + Harness" formula spread industry-wide as a result.

Is a harness the same as a system prompt?

No. A system prompt is one piece of the scaffolding layer of a harness. The harness as a whole also includes loop control (how Gather-Act-Verify is run), tool dispatch (which functions get called and how results return), context compression, error retries, approval flow, and audit logging. That is why "just tweak the prompt" cannot reproduce Claude Code-class behavior.

How do Claude Code and Codex CLI harnesses differ?

Different design philosophies. Claude Code (Anthropic harness) emphasizes "write a plan first, get a yes, then act" with four-tier permission-mode and sub-agent delegation — built for long-context reading and human-in-the-loop collaboration. Codex CLI (OpenAI harness) is JSON-RPC 2.0 based with built-in image_generation (gpt-image-2) and per-site / per-command approval — built for parallel headless batches and image-generation integration. Context window: 1M tokens (Claude) vs 272K-to-~1M long mode (GPT).

Do I need to build my own harness?

No. Just pick one of Claude Code, Codex CLI, Gemini CLI, or open-source Aider and play with it for 30 minutes. With Claude Pro (USD 20/month) or ChatGPT Plus (USD 20/month), a single npm install -g command gets you running. Once comfortable, extend the harness via MCP (Model Context Protocol) with Slack, Notion, or your own database. Adopt a domain-specific harness (like Sales Claw for outbound sales-form submission) only when generic-harness safety guards become insufficient.

What are the risks of using a harness in production?

Three big ones. (1) Runaway autonomy loops: the harness will keep executing whatever the model decides, even when a human would stop. Always AND together limits on item count, wall time, and turn count. (2) Over-broad tool permissions: separate read-only tools (Read / Grep) from write tools (Write / Bash / Delete), and always require approval for destructive commands (rm / git push --force / DELETE). (3) Missing audit logs: persist "who, when, what, via which tool" with ISO 8601 timestamps. Sales Claw ships these as defaults.

How is Sales Claw different from Claude Code?

Claude Code is a general-purpose coding-and-research harness; Sales Claw is a harness specialized for outbound sales-form submission. Think passenger sedan vs. ambulance — both are vehicles, but the equipment differs. Sales Claw ships policy controls, send-time auto-inspection, sales-NG detection (auto-skip pages that declare "no sales"), captcha-stop, rate limiting, audit logs, and AND-ed auto-stop conditions. It puts a job-specific body on top of generic engines (Claude / GPT).

参考文献

本記事は X 公式アカウントと公式ドキュメントを一次情報として参照しています。

[01]
Anthropic Claude Code overview (official Docs)2026-05-18
[02]
GitHub anthropics/claude-code (official OSS repo)2026-05-18
[03]
Anthropic System Card — Claude Opus 4.6 (PDF, official)2026-02-01
[04]
Anthropic — Introducing Claude Opus 4.7 (Newsroom)2026-04-16
[05]
Anthropic — Building Effective Agents (engineering blog)2024-12-20
[06]
Terminal-Bench on the Claude 4 Model Card (official leaderboard)2026-04-16
[07]
VentureBeat — Anthropic reveals harness and operating instruction changes2026-04-22
[08]
OpenAI Codex Changelog (official)2026-05-14
[09]
npm @anthropic-ai/claude-code (official package)2026-05-18
[10]
Anthropic Claude Docs — Tool use overview2026-05-18

この記事の著者