
Codex CLI vs Claude Code (May 2026): a cross-cut benchmark comparison — SWE-bench, Terminal-Bench, cost
Codex CLI 0.130.0 and Claude Code 2.1.143 reshuffle rankings depending on the benchmark axis. We walk through Terminal-Bench 2.0 / SWE-bench Verified / Aider Polyglot (official + third-party aggregates), API pricing, CLI feature deltas, and AI sales-automation fit — all from the Sales Claw maintainer's seat.

中澤 圭志
@keishi_nakazawaSales Claw maintainer

Key Facts
Latest
Codex CLI 0.130.0 (2026-05-08) / Claude Code 2.1.143 (2026-05-15)
Default model
GPT-5.5 (2026-04-23) / Claude Opus 4.7 (2026-04-16)
Terminal-Bench 2.0
vix + Opus 4.7 90.2% (#1) / Codex CLI + GPT-5.5 82.0% (#7)
API pricing (in/out)
Opus 4.7 $5/$25 / GPT-5.5 $5/$30 / GPT-5.3-Codex $1.75/$14 per MTok
"Which is stronger — Codex CLI or Claude Code? I want a cross-cut comparison on 2026-May official benchmarks, with cost and task fit included, so I can pick the right one in the field." — We answer this using the Terminal-Bench 2.0 official leaderboard, Aider Polyglot, SWE-bench Verified aggregates, plus Anthropic / OpenAI official docs and GitHub releases — through the lens of embedding these agents in AI sales automation.
As of May 2026, the latest versions are Codex CLI 0.130.0 (released 2026-05-08) and Claude Code 2.1.143 (released 2026-05-15). Their backing flagship models are GPT-5.5 (rolled out 2026-04-23) and Claude Opus 4.7 (released 2026-04-16). Benchmark rankings flip depending on which axis you measure, so rather than a simple winner, read these as "task-by-task fit".
Sources: OpenAI Codex official changelog / Claude Code official changelog / Anthropic Newsroom / Terminal-Bench official leaderboard (tbench.ai) / Aider official docs / Claude / OpenAI official pricing pages. We only cite official information at publish time; third-party aggregates are explicitly labeled.
1. What Codex CLI and Claude Code are — May 2026 snapshot

Codex CLI (OpenAI)
- Latest: 0.130.0 stable (2026-05-08). 0.131.0-alpha line in progress.
- Default model: GPT-5.5 (rolled out 2026-04-23). The Codex-tuned GPT-5.3-Codex is selectable.
- Strengths: JSON-RPC 2.0 app-server;
codex remote-controlfor full headless control from external processes; built-in image generation (gpt-image-2). - Subscriptions: ChatGPT Plus / Pro / Business / Enterprise include Codex (Pro $100 = 5x Plus limits).
- Package:
@openai/codexon npm, launched viacodex.
Claude Code (Anthropic)
- Latest: 2.1.143 (2026-05-15).
- Default model: Claude Opus 4.7 (2026-04-16). Sonnet 4.6 / Haiku 4.5 selectable. Supports
xhigheffort level and Fast mode. - Strengths: First-class subagents (
claude agents),/goalfor completion-condition loops,/ultrareview, Plugin / Skill / MCP. 1M-token context standard. - Subscriptions: Claude Pro / Max / Team include Claude Code. Direct API also supported.
- Package:
@anthropic-ai/claude-codeon npm, launched viaclaude.
2. Cross-cut benchmarks — Terminal-Bench, SWE-bench, Aider
Terminal-Bench 2.0 — real terminal tasks
tbench.ai's official leaderboard shows the top 10 as of 2026-05-15:
| Rank | Agent | Model | Score | Date |
|---|---|---|---|---|
| 1 | vix | Claude Opus 4.7 | 90.2% ± 2.1 | 2026-05-15 |
| 2 | JJAgent | Multiple | 87.1% | 2026-05-15 |
| 3 | NexAU-AHE | GPT-5.5 | 84.7% | 2026-05-14 |
| 7 | Codex CLI | GPT-5.5 | 82.0% | 2026-04-23 |
| 9 | WOZCODE | Claude Opus 4.7 | 80.2% | 2026-05-14 |
SWE-bench Verified — real GitHub issues
SWE-bench Verified is the heavyweight benchmark using real GitHub issues. OpenAI paused self-reporting in Feb 2026 over contamination concerns, so current numbers come from third-party trackers like Epoch AI.

- GPT-5.5: 88.7% (OpenAI self-reported, released 2026-04-23)
- GPT-5.3-Codex: 85.0%
- Claude Opus 4.7: ~82% (third-party aggregate)
- Claude Code (Opus 4.6 base agent): 80.9%
Anthropic explicitly states in the Opus 4.7 announcement that "excluding any problems that show signs of memorization, Opus 4.7's margin of improvement over Opus 4.6 holds", signaling transparency on contamination. We treat these numbers as "baselines with ±several percent error".
Aider Polyglot — multi-language code editing
Aider's official leaderboard evaluates against 225 Exercism problems across C++ / Go / Java / JavaScript / Python / Rust.
- gpt-5 (high): 88.0% correct, $29.08 cost (Rank 1)
- gpt-5 (medium): 86.7% correct, $17.69 cost (Rank 2)
- o3-pro (high): 84.9% correct, $146.32 cost (Rank 3)
3. CLI / subagent / plugin feature deltas

| 項目 | Codex CLI 0.130.0 | Claude Code 2.1.143 |
|---|---|---|
| Default model | GPT-5.5 (Codex also offers GPT-5.3-Codex) | Claude Opus 4.7 (Sonnet 4.6 / Haiku 4.5 switchable) |
| Context window | GPT-5.4: 272K default / 1.05M long mode (third-party) | 1M tokens standard (Opus 4.7 / 4.6 / Sonnet 4.6) |
| Subagents | No (parallelism via remote-control from external processes) | Yes (claude agents — 8 flags for session isolation) |
| Completion-condition loop | No (implement outer loop via turn/start) | Yes (/goal — 2.1.143 fixes background-shell consistency) |
| Plugins | Yes (workspace sharing / access controls) | Yes (dependency management / cost visibility, 2.1.143) |
| Image generation | Built-in (gpt-image-2 via image_generation feature) | No (external generation possible via MCP) |
| Remote control | codex remote-control + JSON-RPC 2.0 app-server | claude agents dispatched background sessions |
| Code review | In-cmd review prompts | /ultrareview (cloud parallel review) |
| Modal editing | /vim (added in 0.129.0) | No (standard TUI input) |
| Lifecycle | /hooks (browser added in 0.129.0) | hooks / skills combination |
| Windows support | Native PowerShell, sandbox bypass flag | 2.1.143 defaults to -ExecutionPolicy Bypass |
| Auth | OpenAI API key / ChatGPT subscription / AWS Bedrock | Anthropic API key / Claude Pro / Bedrock / Vertex / Foundry |
Where Claude Code wins
- Long-context investigation: 1M-token context plus subagent isolation suits large-repo overview tasks.
- Completion-condition loops:
/goaldrives "until all tests pass" or "until lint = 0" in one command. - MCP / Plugin / Skill: Three mature extension hooks make it easy to inject internal knowledge.
Where Codex CLI wins
- Programmable execution:
codex remote-control+ JSON-RPC 2.0 enables full control from external scripts — perfect for CI / batch. - Image generation:
image_generationbuilt in. The figures in this article are generated with it. - Token efficiency: Third-party benchmarks report "about 1/4 the token consumption of Claude Code on the same task" (source: morphllm aggregates, reproducibility requires verification).
4. Token efficiency, context windows, API pricing
| Model | Input | Output | Cache Read | Context |
|---|---|---|---|---|
| Claude Opus 4.7 | $5.00 | $25.00 | $0.50 | 1M |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $0.30 | 1M |
| Claude Haiku 4.5 | $1.00 | $5.00 | $0.10 | 200K |
| GPT-5.5 | $5.00 | $30.00 | $0.50 (agg.) | 272K-1M |
| GPT-5.4 | $2.50 | $15.00 | $0.25 (agg.) | 272K-1M |
| GPT-5.3-Codex | $1.75 | $14.00 | $0.18 (agg.) | 200K+ (agg.) |
Sources: Anthropic Pricing Docs / OpenAI Pricing. Prices in USD; subject to FX moves and official revisions.

| 項目 | ChatGPT family (Codex CLI included) | Claude family (Claude Code included) |
|---|---|---|
| Free | Free (ads) | Free (limits apply) |
| Individual light | Go $8/mo (US, ads) | — (no direct equivalent) |
| Individual standard | Plus $20/mo (includes Codex) | Pro $20/mo (includes Claude Code) |
| Individual upper | Pro $100/mo (5x Plus limits) / Pro $200 | Max $100 / $200 (power-user ceilings) |
| Business | Business $25/seat (monthly) | Team / Enterprise (contact sales) |
| Direct API | Per-token rates above | Per-token rates above |
5. AI sales automation fit — Sales Claw perspective

Sales Claw is an OSS tool designed to reduce mis-send and TOS-violation risk through policy control, pre-send automated checks, sales-NG detection, halting on CAPTCHA detection, send-frequency limits, audit log preservation, and auto-stop conditions. When embedding CLI agents into the loop, Codex CLI and Claude Code are complementary, not exclusive. In our own internal verification (May 2026, 100-company sample on the Sales Claw repo), the most stable setup paired Claude Code's /goal loop driving approach guardrails violations to zero with Codex CLI's image_generation producing OG cards in parallel. See our walkthrough on bundling claude agents and codex remote-control into one parallel headless platform for details.
Where Claude Code fits
- Form body generation: 1M context can hold company info, past send history, and approach guardrails in one shot.
- Diff-text checking:
/goal "loop until no approach-guardrail violations" - Large-repo overview:
claude agentsparallelizes workers across Sales Claw source. - MCP integrations: Plug in internal knowledge servers (Notion / Slack / Postgres).
Where Codex CLI fits
- Image / OG generation: Built-in
image_generationfor blog covers and dynamic OG (the figures in this article come from it). - Nightly batch headless parallelism:
codex remote-control+ JSON-RPC for N-way parallel from an external scheduler. - Terminal command sequencing: Codex submits 82.0% on Terminal-Bench 2.0; strong on sh / pwsh workflows.
- Cheap classification / extraction: GPT-5.3-Codex ($1.75 / $14) for short tasks.
Hybrid example
# Run Sales Claw with Claude Code + Codex CLI side by side
# Phase A: Claude Code generates copy (long context + /goal loop)
claude agents \
--add-dir ./company-data \
--mcp-config ./mcp/sales-claw.json \
--permission-mode plan \
--model claude-opus-4-7 \
--task "Generate guardrail-compliant copy for 100 companies"
# Phase B: Codex CLI handles OG image + batch verification
codex remote-control --port 7777 &
node scripts/dispatch-og-generation.cjs --port 7777 --count 1006. Cost estimate and assumptions
Assumptions
- Volume: 10,000 companies / month (Sales Claw standard scale)
- Triage model: Claude Haiku 4.5 ($1 / $5 per MTok)
- Body model: Claude Sonnet 4.6 ($3 / $15) — comparison: GPT-5.3-Codex ($1.75 / $14)
- FX: 1 USD = 150 JPY
- Exclusions: CAPTCHA ~8% / sales-NG ~12% / no form ~15%
- Average tokens per company: ~4,000 in / ~800 out
- Cache hit ratio: 60% (company info / guardrails reused)
- Variance: ±30%

Under identical conditions, GPT-5.3-Codex (input $1.75 / output $14) comes to about ¥12,500 / month. Delta vs Sonnet 4.6 is ~¥3,900 / month. Measure on real data before adopting — the quality vs cost tradeoff is workload-specific.
| 項目 | In-house Claude Code + Codex CLI | Typical sales-agency SaaS |
|---|---|---|
| Monthly range | Approx. ¥12,500–¥16,400 (10K companies, direct API) | Generally ¥300K–¥2M (list scale, send-execution included) |
| Setup cost | 0 (Sales Claw is OSS) | ¥100K–¥1M typical |
| Customization | High (own data / copy rules) | Low–medium (template-bound) |
| In-house skill required | Claude / Codex CLI operation | Not required (SaaS-operated) |
7. Risk and safety design for no-human-review operation
Sales Claw doesn't ship CLI agent output directly — it applies pre-send automated checks, sales-NG detection, halting on CAPTCHA detection, send-frequency limits, audit log preservation, and auto-stop conditions to reduce risk structurally (see Figure 5 in section 5 for the full flow).
Legal / compliance
- Anti-spam (JP): Sender-info 4-requirement auto-fill (
preferences.complianceFooter: true) - TOS compliance: Pages marked "no sales solicitation" auto-skipped
- No CAPTCHA bypass: Halt with
awaiting_approval, audit logged - Send-frequency limit: Suppress repeated sends to the same domain
- Opt-out line: "If you would prefer not to receive..." auto-inserted
Prevent CLI agent runaway with auto-stop AND-conditions
Residual risk
- Missed-detection of new CAPTCHA schemes (mis-send possible until Sales Claw catches up)
- Lag on TOS-revision response (manual update of compliance metadata required)
- Vertical regulations (BFSI etc. require separate review)
- LLM hallucinations (wrong company info / wrong contact-person names)
- CLI-level bugs (Both Claude Code and Codex CLI ship weekly; behavior can drift)
8. Pre-production checklist + closing thoughts
Before going hybrid with Codex CLI + Claude Code
- Documented which CLI handles which task type
- Set AND conditions: max turns + max count + max time
- Verified pre-send-check pass-rate on a 100-company sample
- Confirmed CAPTCHA-bypass automation is OFF
- Confirmed sales-NG detection / skip is ON
- action-log.json saving is enabled
- Compliance footer is enabled (4-requirement auto-fill)
- Opt-out line is part of the copy template
- Send-frequency limit is configured
- /goal completion criteria are explicit in writing
- remote-control caller implements max-iterations
- API keys live in .env (not in the repo)
- Error-time notification is configured
- Correction / retraction procedure is defined
Closing
As of May 2026, Codex CLI 0.130.0 and Claude Code 2.1.143 are best read as "pick by task; combine where it fits" rather than "one CLI to rule them all". Terminal-Bench 2.0 puts Claude-Opus-4.7-based vix at 90.2% (rank 1) and Codex CLI itself at 82.0% (rank 7); SWE-bench Verified aggregates put GPT-5.5 at 88.7% and Claude Opus 4.7 around 82%; Aider Polyglot puts GPT-5 (high) at 88.0%. There is no single winner — rankings reshuffle per benchmark.
For embedded-product use like AI sales automation, "measure 100 companies of your own list / copy rules / target sites" is a far more trustworthy selection signal than benchmark ranks. Sales Claw doesn't ship CLI output as-is — it reduces risk structurally with pre-send checks, sales-NG detection, CAPTCHA halts, audit log, and auto-stop conditions.
Next step: slice 100 companies from your list, run both Claude Code and Codex CLI for body generation, then compare pre-send-check pass rates and quality side by side. Start at the Sales Claw quick start. The OSS source is also free to download.
This is a convenience English translation of the Japanese-language original. In case of any discrepancy, the Japanese version is authoritative.
よくある質問
Which is stronger — Codex CLI or Claude Code?
Which is cheaper?
Which should I embed in AI sales automation?
Which has the larger context window?
Is it OK to run with no human review?
Why is Claude missing from Aider Polyglot top 10?
GPT-5.3-Codex or GPT-5.5 — which should I pick?
参考文献
本記事は X 公式アカウントと公式ドキュメントを一次情報として参照しています。
- [01]OpenAI Codex Changelog (official)2026-05-08
- [02]Claude Code Changelog (official)2026-05-15
- [03]Anthropic Newsroom — Claude Opus 4.72026-04-16
- [04]Terminal-Bench 2.0 official leaderboard2026-05-15
- [05]Aider LLM Leaderboards (official)2026-05-16
- [06]Anthropic Pricing Docs2026-05-16
- [07]OpenAI API Pricing2026-05-16
- [08]openai/codex GitHub Releases2026-05-15
- [09]OpenAI Devs X account (@OpenAIDevs)@OpenAIDevs·2026-04-23
この記事の著者

中澤 圭志
Sales Claw maintainer
Designs and develops Sales Claw. Writes from the field on B2B sales automation and applied AI.


