Tool Deep DivesAI エージェント

What's an Agent Harness? Understanding Claude Code and Codex CLI in 8 Minutes Using a Car Analogy

A harness is the outer-shell software that wraps an LLM and lets it actually act in the world. With the same Claude or GPT engine, a different harness can swing SWE-bench scores by 36 points. Claude Code, Codex CLI, and Devin are different chassis. This is a general-reader walkthrough using a car analogy.

中澤 圭志

中澤 圭志

@keishi_nakazawa

Sales Claw maintainer

·12 min
What's an Agent Harness? Understanding Claude Code and Codex CLI in 8 Minutes Using a Car Analogy
This English article is a concise version of the original. For the full Japanese deep-dive, see the Japanese original.

Key Facts

In one line

External-shell software that turns an LLM (engine) into a working AI agent (a car body)

Performance impact

Same model, different harness: 42% → 78% on SWE-bench (documented case)

Examples

Claude Code (Terminus-2) / Codex CLI / Devin / Aider

Where to start

Pick one pre-built harness (Claude Code or Codex CLI) and play for 30 minutes

"Why are AI people suddenly talking about harnesses?" "Isn't the difference between Claude Code and Codex just the model?" "What does it actually mean for a 'harness change' to dramatically change performance?" — this article unpacks the term harness, which became an AI-industry buzzword in 2026, in language general readers can follow. We cite Anthropic Claude Code official documentation, the Anthropic System Card, the Terminal-Bench official article, and GitHub repositories as primary sources to explain why harness matters nowand how regular people can start touching one today.

Primary sources for this article: Anthropic Claude Code Docs / Anthropic System Card (Claude Opus 4.6 PDF) / Terminal-Bench official leaderboard / GitHub anthropics/claude-code / VentureBeat interview with Anthropic / OpenAI Codex CLI official changelog. We also reference a few third-party explainer articles for context, but every load-bearing number and spec ties back to official sources.

1. What a harness is — "the chassis the LLM rides on"

[Official] In the Anthropic System Card for Claude Opus 4.6 (February 2026), Anthropic explicitly names "Terminus-2" as the harness used when scoring the model on terminal benchmarks (System Card §4.2). The same Claude Opus 4.6 produces different numbers depending on the harness wrapping it, so Anthropic publishes "model-only score" and "harness-included score" separately. This is the strongest public signal that the harness matters as much as the model.

[Author's view] The car analogy is the most intuitive way in. Top-tier LLMs like Claude Opus 4.7 and GPT-5.5 are world-class engines, but an engine alone cannot drive on public roads. Steering, tires, brakes, navigation, seatbeltshave to be combined before you have a "car." In the AI world, that "chassis" is the harness, and Claude Code, Codex CLI, Devin, and Aider are different chassis (built by Anthropic, OpenAI, Cognition, and the OSS community respectively) sold to wrap the latest engines.

Anthropic's official engineering blog "Building Effective Agents" (December 2024) defines an agent as "a system where the LLM dynamically directs its own processes and tool usage, maintaining control over how it accomplishes the task." The harness is the outer-shell software that implements that "dynamically direct" and "maintain control" layer.

A whiteboard-style illustration showing LLM as engine and harness as chassis. The center shows a large engine; left side lists what the harness adds (tool calling / memory / loop / safety guards); right side lists what cannot be done without a harness (file edits / command execution / multi-step work). A yellow sticky note highlights the slogan 'Agent = Model + Harness'.
Figure: LLM (engine) + harness (chassis) = AI agent (running car) — the full picture

2. Why "harness" suddenly became a 2026 buzzword

[Official] In April 2026, Anthropic told VentureBeat that the period during which users reported Claude quality degradation overlapped with internal changes to the harness and operating instructions (VentureBeat April 2026 article, Anthropic comment). This is, without exaggeration, the first time a major AI lab has publicly acknowledgedthat "changes to the harness alone can shift user-visible quality without touching the model."

[Official] In the same window, SWE-bench Pro leaderboard analyses surfaced the following numbers (multiple official engineering blogs, April–May 2026):

  • Same model, harness swap only: 42% → 78% (a ~36-point lift) on SWE-bench
  • SWE-bench Pro: scaffolding differences accounted for 22+ points
  • By contrast, the gap between top-of-frontier models is roughly 1 point

[Author's view] What these numbers actually say is that we have entered a phase where"harness choice" affects outcomes more than "model choice". Through 2024 the debate was "is GPT-4 or Claude 3 smarter?" In 2026, the lived truth is that the same Claude Opus 4.7 inside Claude Code behaves like a different product than the same model inside a homemade script.

3. What's inside — the three pillars of a harness

If the harness is a chassis, the three pillars are the engine mount, the steering wheel, and the fuel tank. We'll go through each in order.

(1) The autonomy loop — the Gather-Act-Verify cycle

[Official] Claude Code's official overview describes the agent's behavior as a repeating "Gather context → Act → Verify results" cycle. Concretely:

  1. Gather: search files, read code, capture command output
  2. Act: edit files, run commands, call external APIs
  3. Verify: run tests, check output, look for errors
  4. If the result doesn't meet the goal, return to Gather and try again

That, in essence, is the autonomy loop. An LLM by itself does "answer once and stop." The harness runs the three-step loop until the goal condition is satisfied, which is what makes the agent look like it "is thinking and acting on its own."

(2) Tool calling — exposing functions to the AI

The harness gives the agent "tools" it can use. For Claude Code, the standard toolset includes:

  • Read / Write / Edit (file operations)
  • Bash (command execution)
  • Glob / Grep (search)
  • WebFetch / WebSearch (internet access)
  • Custom tools via MCP servers (Slack posting, GitHub PR creation, etc.)

These are exposed to the LLM as "JSON Schema function definitions."The LLM responds with a JSON payload saying "I want to call this function with these arguments," the harness parses that JSON, actually invokes the function, and returns the result back to the LLM. That feedback loop is what "tool calling (function calling)" really is. See theMCP (Model Context Protocol) complete guidefor the protocol underneath.

(3) Memory / context management — "not forgetting long tasks"

Every LLM has a context window limit. As of May 2026, Claude Opus 4.7 is 1M tokens, GPT-5.5 is 272K tokens (or about 1M in long mode). In real coding work, repositories and logs routinely exceed even a 1M-token window.

The harness keeps you under the cap by silently doing things like (a) summarizing and compressing older turns, (b) paging important facts to side files and reading them back when needed, and (c) delegating sub-tasks to sub-agents and only ingesting their final result into the parent context. This is what "memory management" and "context management" refer to.

A whiteboard illustration breaking down the harness internals. The center shows the Gather-Act-Verify autonomy loop; the left lists tools (Read/Write/Bash/Grep/MCP); the right covers context compression and sub-agent delegation; the top is labeled 'safety guards (approval, skip)' and the bottom is labeled 'audit log (who, when, what)'.
Figure: The three pillars (autonomy loop / tools / memory) with safety guards and audit logs bookending both sides

4. Inside Claude Code's harness

[Official] Claude Code is the CLI agent shipped as the npm package @anthropic-ai/claude-code(GitHub: anthropics/claude-code). The official Claude Code Docs describe it as an"agentic coding tool that lives in your terminal, understands your codebase, and executes routine tasks via natural language commands."

What's distinctive about the Claude Code harness is "always plan first" and "approval-heavy."

  • Plan mode: doesn't jump straight into editing — it writes out "what it plans to do" in natural language and waits for the user's OK before executing
  • permission-mode: default / acceptEdits /bypassPermissions / plan — four tiers of operational autonomy you can switch between
  • Sub-agent: delegate a large task to a child agent so the parent only ingests its final result, saving context
  • /goal command (2.1.140+): high-autonomy mode that keeps looping until a stated condition is satisfied

On the Terminal-Bench leaderboard, Claude Opus 4.6 + Terminus-2 scores 65.4% (max effort)(tbench.ai official). Claude Code doesn't use Terminus-2 byte-for-byte, but it's one of the few cases where an Anthropic-built harness baseline is public knowledge.

# Install Claude Code and start (Mac/Linux/Windows)
npm install -g @anthropic-ai/claude-code
claude

# Use Plan mode to just see "what it would do" first
claude --permission-mode plan "add a new blog entry to lib/blog.ts"

# Run the autonomy loop until the goal is met (Claude Code 2.1.140+)
claude /goal "keep editing until all tests pass"

For what changed in each Claude Code release, see theClaude Code 2.1.143 release notes.

5. Inside Codex CLI's harness

[Official] Codex CLI is OpenAI's Node.js CLI agent, backed by GPT-5.5 / GPT-5.3-Codex / GPT-5.4 (OpenAI Codex Changelog). Compared with Claude Code at the harness-personality level:

項目Claude Code (Anthropic harness)Codex CLI (OpenAI harness)
ModelClaude Opus 4.7 / Sonnet 4.6 / Haiku 4.5GPT-5.5 / 5.3-Codex / 5.4
Context window1M tokens (Opus/Sonnet/Haiku)272K (~1M in long mode)
PlanningPlan mode writes a plan firstModel-discretion, execution-leaning
Remote controlDedicated Remote Control UI (2026-02)JSON-RPC 2.0 + mobile (2026-05)
Image generationExternal tools onlyBuilt-in image_generation (gpt-image-2)
Approval modelpermission-mode (4 tiers)per-site / per-command (Chrome ext.)
Best forLong-context reading, sub-agent splitsParallel headless batches, image generation

[Author's view] It's not a question of which is better — the design philosophies differ. Claude Code shines at "reading a lot while in dialog with a human." Codex CLI shines at "quietly running parallel batches overnight." In practice, the two coexist well. For a deeper breakdown, see theCodex CLI vs Claude Code benchmark comparison.

Sales Claw applies Claude Code / Codex CLI harness thinking to BtoB sales-form automation.

無料・MIT ライセンス。インストールせずにライブデモも試せます。

6. Choosing a harness — build your own or use one? (how to start)

[Author's view] For most people wanting to put an AI agent to work, "build one from scratch" is unrealistic. The three-step progression below is what we recommend.

Step 1: pick one pre-built harness and try it (free to low cost)

  • Claude Code: with a Claude Pro subscription (USD 20/month), one npm install -g @anthropic-ai/claude-code gets you running. The 1M-token context window and the clarity of Plan mode make it a great first AI agent.
  • Codex CLI: requires a ChatGPT Plus / Pro / Business plan. Install via npm install -g @openai/codex. GPT-5.5 responses are fast, and web-search tool integration is standard.
  • Gemini CLI: Google Cloud account gets you a free tier. Long context and search are strong out of the box.
  • Aider (OSS): free as long as you have an API key. For developers who want to read the harness internals.

Step 2: extend the harness via MCP

Once you're comfortable, layer on custom tools via MCP (Model Context Protocol). Tools that let the agent post to your internal Slack, read your Notion, or query your own database turn the harness into a business-specific instrument.

Step 3: adopt a domain-specific harness (e.g. Sales Claw)

In domains like outbound sales automation, where business-specific safety guards and audit logs are mandatory, general harnesses (Claude Code / Codex CLI) leave gaps. Sales Claw is a sales-specific harness with built-in policy controls, send-time auto-inspection, sales-NG detection, captcha-stop, rate limiting, audit logs, and AND-ed auto-stop conditions. To go further, theSales Claw Quickstart Guideis the easiest entry point.

Three-step staircase illustration showing the path from generic harness to specialized one. Step 1: try Claude Code / Codex CLI (free to low cost), Step 2: extend with MCP-served internal tools, Step 3: adopt a domain-specific harness like Sales Claw. Each step is annotated with rough duration and difficulty.
Figure: The three-step approach for regular users — pre-built → MCP extension → domain-specific
Bar chart showing harness impact on SWE-bench Pro. The same model lifts from 42% to 78% just by changing harness; meanwhile model-to-model gap at the frontier is roughly 1 point. Sample sizes and assumptions are annotated, Sales Claw editorial illustration.
Figure: SWE-bench Pro impact of harness choice. Same model, vastly different scores depending on harness.
Bar chart comparing context windows: GPT-5.4 / GPT-5.5 standard / GPT-5.5 long mode / Gemini 2.5 Pro / Claude (Opus, Sonnet, Haiku all at 1M). Cited Anthropic, OpenAI, and Google official pricing & docs (May 2026 retrieval).
Figure: Context-window comparison across major LLMs. The harness has to know how to use that capacity.
Bar chart comparing monthly JPY cost for Aider (OSS) + Claude API / Gemini CLI (free tier) / Claude Code (Claude Pro) / Codex CLI (ChatGPT Plus) / Sales Claw (OSS) + Claude. Assumes 1 USD = 150 JPY, single developer using ~100 hours/month.
Figure: Monthly cost estimate for major harnesses (individual / SMB usage). Roughly ¥0 to ¥4,500/month gets you started.

7. Risks and safety design

(1) Runaway autonomy loops

[Unverified] As of May 2026, no major AI agent has publicly reported a serious incident caused by a runaway loop. That said, social media and developer blogs document multiple personal cases: unexpected large API spend, git push --force wiping out history, and similar. The harness executes whatever the model decides next — which means it often won't stop in situations where a human's instinct would.

(2) Over-broad tool permissions

It's convenient to hand the harness every tool, but the blast radius of any failure scales with what it can touch. In business contexts, build in permission separationfrom day one: read-only tools (Read / Grep / WebFetch) get approval-free, write tools (Write / Bash / Delete) always require approval. Claude Code's permission-modeand Codex CLI's per-command approval exist exactly for this purpose.

(3) Missing audit logs

The harness must persist a record of "who, when, what, via which tool". When something goes wrong, no logs means no root cause, no prevention. Claude Code and Codex CLI both save session logs by default, but for business use it's realistic to operate a parallel audit log in a structured format (JSON-Lines, ISO 8601 timestamps, user ID, the full command text).

8. Business use and the Sales Claw angle

Sales Claw is an OSS tool that uses policy controls, send-time auto-inspection, sales-NG detection, captcha-stop, rate limiting, audit logs, and AND-ed auto-stop conditions to structurally reduce the risk of accidental sends and policy violations. General harnesses like Claude Code and Codex CLI are excellent for coding and research, but "submit a sales pitch through a company's contact form"is a task-specific job that needs task-specific guards.

[Author's view] The relationship between general-purpose and domain-specific harness is similar to that between a passenger sedan and an ambulance. Both are vehicles. But the ambulance ships with sirens, an ECG monitor, oxygen tanks, a stretcher built in — things a sedan does not have. In the same way, a sales agent needs send-time inspection, sales-NG detection, captcha non-bypass, audit logs built in, things Claude Code does not ship by default. Sales Claw is the ambulance for outbound sales work, putting a dedicated body on top of the latest Claude / Codex engines.

Checklist before adopting a harness for business use

Before putting a harness in production

  • Set AND-conditioned limits on max turns, max items, and max execution time
  • Verify the auto-inspection pass rate on a 100-record sample
  • Approval mode is enabled for destructive commands (rm / git push --force / DELETE)
  • Audit log (action-log.json) persistence is enabled
  • Tool permissions are restricted to the minimum (read/write separated)
  • No production credentials remain in prompt history
  • Error notifications (Slack / email) are configured
  • A correction / rollback procedure is defined
  • Legal / compliance review is complete
  • Harness version and changelog are recorded internally

Conclusion — chassis decides the race more than the engine

In 2026, AI agents have entered a phase where "which chassis (harness) you ride in" matters more for real-world performance than "which engine (LLM) you use." Claude Code, Codex CLI, Devin, and Aider are all ready-made chassis polished by Anthropic, OpenAI, Cognition, and the OSS community respectively.

Next action: pick Claude Code or Codex CLI and play with it for 30 minutes. Experiencing Plan mode or per-command approval is what makes "what is a harness" click in a single sitting. If you want a domain-specific harness for outbound sales, theSales Claw Quickstart Guideor the free download pagewill get you started.

The engine doesn't decide the speed. The chassis does. Run Sales Claw on your own list.

無料・MIT ライセンス。インストールせずにライブデモも試せます。

Japanese-language original: AI エージェントの『ハーネス』って結局なに?

よくある質問

What is an AI agent harness?
A harness is the external software that wraps an LLM (Claude, GPT, etc.) and turns it into a working AI agent. Think of the LLM as the engine and the harness as the chassis, steering, tires, brakes, navigation, and seatbelts combined. It bundles tool calling (function calling), memory and context management, the autonomy loop (Gather-Act-Verify), safety guards (approval and guardrails), and audit logs. In its official System Card for Claude Opus 4.6, Anthropic publishes both "model-only" and "harness-included" scores, naming "Terminus-2" as the harness used for benchmarking.
Why did "harness" become a 2026 buzzword?
Through late 2025, SWE-bench Pro data piled up showing the same model swinging tens of points depending on harness. In April 2026, Anthropic told VentureBeat that user-visible quality variation in Claude had been caused by internal changes to the harness and operating instructions. That was the first time a major lab publicly acknowledged that a harness change alone moves user-visible quality. The "Agent = Model + Harness" formula spread industry-wide as a result.
Is a harness the same as a system prompt?
No. A system prompt is one piece of the scaffolding layer of a harness. The harness as a whole also includes loop control (how Gather-Act-Verify is run), tool dispatch (which functions get called and how results return), context compression, error retries, approval flow, and audit logging. That is why "just tweak the prompt" cannot reproduce Claude Code-class behavior.
How do Claude Code and Codex CLI harnesses differ?
Different design philosophies. Claude Code (Anthropic harness) emphasizes "write a plan first, get a yes, then act" with four-tier permission-mode and sub-agent delegation — built for long-context reading and human-in-the-loop collaboration. Codex CLI (OpenAI harness) is JSON-RPC 2.0 based with built-in image_generation (gpt-image-2) and per-site / per-command approval — built for parallel headless batches and image-generation integration. Context window: 1M tokens (Claude) vs 272K-to-~1M long mode (GPT).
Do I need to build my own harness?
No. Just pick one of Claude Code, Codex CLI, Gemini CLI, or open-source Aider and play with it for 30 minutes. With Claude Pro (USD 20/month) or ChatGPT Plus (USD 20/month), a single npm install -g command gets you running. Once comfortable, extend the harness via MCP (Model Context Protocol) with Slack, Notion, or your own database. Adopt a domain-specific harness (like Sales Claw for outbound sales-form submission) only when generic-harness safety guards become insufficient.
What are the risks of using a harness in production?
Three big ones. (1) Runaway autonomy loops: the harness will keep executing whatever the model decides, even when a human would stop. Always AND together limits on item count, wall time, and turn count. (2) Over-broad tool permissions: separate read-only tools (Read / Grep) from write tools (Write / Bash / Delete), and always require approval for destructive commands (rm / git push --force / DELETE). (3) Missing audit logs: persist "who, when, what, via which tool" with ISO 8601 timestamps. Sales Claw ships these as defaults.
How is Sales Claw different from Claude Code?
Claude Code is a general-purpose coding-and-research harness; Sales Claw is a harness specialized for outbound sales-form submission. Think passenger sedan vs. ambulance — both are vehicles, but the equipment differs. Sales Claw ships policy controls, send-time auto-inspection, sales-NG detection (auto-skip pages that declare "no sales"), captcha-stop, rate limiting, audit logs, and AND-ed auto-stop conditions. It puts a job-specific body on top of generic engines (Claude / GPT).

参考文献

本記事は X 公式アカウントと公式ドキュメントを一次情報として参照しています。

  1. [01]
  2. [02]
  3. [03]
  4. [04]
  5. [05]
  6. [06]
  7. [07]
  8. [08]
  9. [09]
  10. [10]

この記事の著者

中澤 圭志

中澤 圭志

Sales Claw maintainer

Sales Claw の設計・開発を担当。BtoB 営業自動化と AI 活用の実践者として、現場目線で情報発信中。

Share this article