Industry TrendsCodex CLI

Codex CLI vs Claude Code (May 2026): a cross-cut benchmark comparison — SWE-bench, Terminal-Bench, cost

Codex CLI 0.130.0 and Claude Code 2.1.143 reshuffle rankings depending on the benchmark axis. We walk through Terminal-Bench 2.0 / SWE-bench Verified / Aider Polyglot (official + third-party aggregates), API pricing, CLI feature deltas, and AI sales-automation fit — all from the Sales Claw maintainer's seat.

中澤 圭志

中澤 圭志

@keishi_nakazawa

Sales Claw maintainer

·16 min
Codex CLI vs Claude Code (May 2026): a cross-cut benchmark comparison — SWE-bench, Terminal-Bench, cost
This English article is a concise version of the original. For the full Japanese deep-dive, see the Japanese original.

Key Facts

Latest

Codex CLI 0.130.0 (2026-05-08) / Claude Code 2.1.143 (2026-05-15)

Default model

GPT-5.5 (2026-04-23) / Claude Opus 4.7 (2026-04-16)

Terminal-Bench 2.0

vix + Opus 4.7 90.2% (#1) / Codex CLI + GPT-5.5 82.0% (#7)

API pricing (in/out)

Opus 4.7 $5/$25 / GPT-5.5 $5/$30 / GPT-5.3-Codex $1.75/$14 per MTok

"Which is stronger — Codex CLI or Claude Code? I want a cross-cut comparison on 2026-May official benchmarks, with cost and task fit included, so I can pick the right one in the field." — We answer this using the Terminal-Bench 2.0 official leaderboard, Aider Polyglot, SWE-bench Verified aggregates, plus Anthropic / OpenAI official docs and GitHub releases — through the lens of embedding these agents in AI sales automation.

As of May 2026, the latest versions are Codex CLI 0.130.0 (released 2026-05-08) and Claude Code 2.1.143 (released 2026-05-15). Their backing flagship models are GPT-5.5 (rolled out 2026-04-23) and Claude Opus 4.7 (released 2026-04-16). Benchmark rankings flip depending on which axis you measure, so rather than a simple winner, read these as "task-by-task fit".

Sources: OpenAI Codex official changelog / Claude Code official changelog / Anthropic Newsroom / Terminal-Bench official leaderboard (tbench.ai) / Aider official docs / Claude / OpenAI official pricing pages. We only cite official information at publish time; third-party aggregates are explicitly labeled.

1. What Codex CLI and Claude Code are — May 2026 snapshot

Hand-drawn whiteboard comparison of Codex CLI (GPT-5.5, 0.130.0, Terminal-Bench 82.0%, image generation, remote-control) and Claude Code (Opus 4.7, 2.1.143, subagents, /goal, MCP, 1M context), with a bridge metaphor showing 'pick by task' highlighted in yellow.
Figure: Figure 1: Codex CLI vs Claude Code role overview — official benchmarks and feature scope (as of 2026-05-16).

Codex CLI (OpenAI)

  • Latest: 0.130.0 stable (2026-05-08). 0.131.0-alpha line in progress.
  • Default model: GPT-5.5 (rolled out 2026-04-23). The Codex-tuned GPT-5.3-Codex is selectable.
  • Strengths: JSON-RPC 2.0 app-server; codex remote-control for full headless control from external processes; built-in image generation (gpt-image-2).
  • Subscriptions: ChatGPT Plus / Pro / Business / Enterprise include Codex (Pro $100 = 5x Plus limits).
  • Package: @openai/codex on npm, launched via codex.

Claude Code (Anthropic)

  • Latest: 2.1.143 (2026-05-15).
  • Default model: Claude Opus 4.7 (2026-04-16). Sonnet 4.6 / Haiku 4.5 selectable. Supports xhigh effort level and Fast mode.
  • Strengths: First-class subagents (claude agents), /goal for completion-condition loops, /ultrareview, Plugin / Skill / MCP. 1M-token context standard.
  • Subscriptions: Claude Pro / Max / Team include Claude Code. Direct API also supported.
  • Package: @anthropic-ai/claude-code on npm, launched via claude.

2. Cross-cut benchmarks — Terminal-Bench, SWE-bench, Aider

Terminal-Bench 2.0 — real terminal tasks

tbench.ai's official leaderboard shows the top 10 as of 2026-05-15:

RankAgentModelScoreDate
1vixClaude Opus 4.790.2% ± 2.12026-05-15
2JJAgentMultiple87.1%2026-05-15
3NexAU-AHEGPT-5.584.7%2026-05-14
7Codex CLIGPT-5.582.0%2026-04-23
9WOZCODEClaude Opus 4.780.2%2026-05-14

SWE-bench Verified — real GitHub issues

SWE-bench Verified is the heavyweight benchmark using real GitHub issues. OpenAI paused self-reporting in Feb 2026 over contamination concerns, so current numbers come from third-party trackers like Epoch AI.

SWE-bench Verified score comparison bar chart. GPT-5.5 88.7%, GPT-5.3-Codex 85.0%, Claude Opus 4.7 82.0%, Claude Code (Opus 4.6 base, agent) 80.9%. Third-party aggregates, as of May 2026.
Figure: Figure 2: SWE-bench Verified key scores (third-party aggregates, May 2026).
  • GPT-5.5: 88.7% (OpenAI self-reported, released 2026-04-23)
  • GPT-5.3-Codex: 85.0%
  • Claude Opus 4.7: ~82% (third-party aggregate)
  • Claude Code (Opus 4.6 base agent): 80.9%

Anthropic explicitly states in the Opus 4.7 announcement that "excluding any problems that show signs of memorization, Opus 4.7's margin of improvement over Opus 4.6 holds", signaling transparency on contamination. We treat these numbers as "baselines with ±several percent error".

Aider Polyglot — multi-language code editing

Aider's official leaderboard evaluates against 225 Exercism problems across C++ / Go / Java / JavaScript / Python / Rust.

  • gpt-5 (high): 88.0% correct, $29.08 cost (Rank 1)
  • gpt-5 (medium): 86.7% correct, $17.69 cost (Rank 2)
  • o3-pro (high): 84.9% correct, $146.32 cost (Rank 3)

3. CLI / subagent / plugin feature deltas

High-density whiteboard explainer comparing CLI features. Left zone: Codex CLI 0.130.0 (codex remote-control / JSON-RPC 2.0 / image_generation / /vim / /hooks / AWS Bedrock auth). Right zone: Claude Code 2.1.143 (subagents / /goal / /ultrareview / MCP / Plugin / Skill / Worktree). Middle: 'pick by task' flow (coding → both / terminal ops → Codex / long-context research → Claude Code).
Figure: Figure 3: CLI feature delta matrix — same category, different design priorities.
項目Codex CLI 0.130.0Claude Code 2.1.143
Default modelGPT-5.5 (Codex also offers GPT-5.3-Codex)Claude Opus 4.7 (Sonnet 4.6 / Haiku 4.5 switchable)
Context windowGPT-5.4: 272K default / 1.05M long mode (third-party)1M tokens standard (Opus 4.7 / 4.6 / Sonnet 4.6)
SubagentsNo (parallelism via remote-control from external processes)Yes (claude agents — 8 flags for session isolation)
Completion-condition loopNo (implement outer loop via turn/start)Yes (/goal — 2.1.143 fixes background-shell consistency)
PluginsYes (workspace sharing / access controls)Yes (dependency management / cost visibility, 2.1.143)
Image generationBuilt-in (gpt-image-2 via image_generation feature)No (external generation possible via MCP)
Remote controlcodex remote-control + JSON-RPC 2.0 app-serverclaude agents dispatched background sessions
Code reviewIn-cmd review prompts/ultrareview (cloud parallel review)
Modal editing/vim (added in 0.129.0)No (standard TUI input)
Lifecycle/hooks (browser added in 0.129.0)hooks / skills combination
Windows supportNative PowerShell, sandbox bypass flag2.1.143 defaults to -ExecutionPolicy Bypass
AuthOpenAI API key / ChatGPT subscription / AWS BedrockAnthropic API key / Claude Pro / Bedrock / Vertex / Foundry

Where Claude Code wins

  • Long-context investigation: 1M-token context plus subagent isolation suits large-repo overview tasks.
  • Completion-condition loops: /goal drives "until all tests pass" or "until lint = 0" in one command.
  • MCP / Plugin / Skill: Three mature extension hooks make it easy to inject internal knowledge.

Where Codex CLI wins

  • Programmable execution: codex remote-control + JSON-RPC 2.0 enables full control from external scripts — perfect for CI / batch.
  • Image generation: image_generation built in. The figures in this article are generated with it.
  • Token efficiency: Third-party benchmarks report "about 1/4 the token consumption of Claude Code on the same task" (source: morphllm aggregates, reproducibility requires verification).

4. Token efficiency, context windows, API pricing

ModelInputOutputCache ReadContext
Claude Opus 4.7$5.00$25.00$0.501M
Claude Sonnet 4.6$3.00$15.00$0.301M
Claude Haiku 4.5$1.00$5.00$0.10200K
GPT-5.5$5.00$30.00$0.50 (agg.)272K-1M
GPT-5.4$2.50$15.00$0.25 (agg.)272K-1M
GPT-5.3-Codex$1.75$14.00$0.18 (agg.)200K+ (agg.)

Sources: Anthropic Pricing Docs / OpenAI Pricing. Prices in USD; subject to FX moves and official revisions.

Scatter chart of main models — context window (X axis, thousand tokens) vs output price (Y axis, USD per MTok). Claude Opus 4.7 (1000K, $25), Sonnet 4.6 (1000K, $15), Haiku 4.5 (200K, $5), GPT-5.5 (272K-1M, $30), GPT-5.4 (272K-1M, $15), GPT-5.3-Codex (200K+, $14). The 'long context × low price' zone is held by Sonnet 4.6 and GPT-5.3-Codex.
Figure: Figure 4: Context window vs output price — Sonnet 4.6 and GPT-5.3-Codex occupy the 'long context × low price' zone.
項目ChatGPT family (Codex CLI included)Claude family (Claude Code included)
FreeFree (ads)Free (limits apply)
Individual lightGo $8/mo (US, ads)— (no direct equivalent)
Individual standardPlus $20/mo (includes Codex)Pro $20/mo (includes Claude Code)
Individual upperPro $100/mo (5x Plus limits) / Pro $200Max $100 / $200 (power-user ceilings)
BusinessBusiness $25/seat (monthly)Team / Enterprise (contact sales)
Direct APIPer-token rates abovePer-token rates above

Try Sales Claw — the loop CLI agents sit behind.

無料・MIT ライセンス。インストールせずにライブデモも試せます。

5. AI sales automation fit — Sales Claw perspective

Hand-drawn blackboard diagram showing the full Sales Claw loop. CLI agent output (Claude Code / Codex CLI) → pre-send checks (sales-NG detection / compliance footer / opt-out line) → CAPTCHA detected = awaiting_approval halt → audit log saved to action-log.json → only checks-passed items get sent. AND-bound termination conditions (count / elapsed / turn limits) prevent runaway.
Figure: Figure 5: Sales Claw full loop — from CLI output (Claude Code / Codex CLI) through pre-send checks and audit log to send.

Sales Claw is an OSS tool designed to reduce mis-send and TOS-violation risk through policy control, pre-send automated checks, sales-NG detection, halting on CAPTCHA detection, send-frequency limits, audit log preservation, and auto-stop conditions. When embedding CLI agents into the loop, Codex CLI and Claude Code are complementary, not exclusive. In our own internal verification (May 2026, 100-company sample on the Sales Claw repo), the most stable setup paired Claude Code's /goal loop driving approach guardrails violations to zero with Codex CLI's image_generation producing OG cards in parallel. See our walkthrough on bundling claude agents and codex remote-control into one parallel headless platform for details.

Where Claude Code fits

  • Form body generation: 1M context can hold company info, past send history, and approach guardrails in one shot.
  • Diff-text checking: /goal "loop until no approach-guardrail violations"
  • Large-repo overview: claude agents parallelizes workers across Sales Claw source.
  • MCP integrations: Plug in internal knowledge servers (Notion / Slack / Postgres).

Where Codex CLI fits

  • Image / OG generation: Built-in image_generation for blog covers and dynamic OG (the figures in this article come from it).
  • Nightly batch headless parallelism: codex remote-control + JSON-RPC for N-way parallel from an external scheduler.
  • Terminal command sequencing: Codex submits 82.0% on Terminal-Bench 2.0; strong on sh / pwsh workflows.
  • Cheap classification / extraction: GPT-5.3-Codex ($1.75 / $14) for short tasks.

Hybrid example

# Run Sales Claw with Claude Code + Codex CLI side by side
# Phase A: Claude Code generates copy (long context + /goal loop)
claude agents \
  --add-dir ./company-data \
  --mcp-config ./mcp/sales-claw.json \
  --permission-mode plan \
  --model claude-opus-4-7 \
  --task "Generate guardrail-compliant copy for 100 companies"

# Phase B: Codex CLI handles OG image + batch verification
codex remote-control --port 7777 &
node scripts/dispatch-og-generation.cjs --port 7777 --count 100

6. Cost estimate and assumptions

Assumptions

  • Volume: 10,000 companies / month (Sales Claw standard scale)
  • Triage model: Claude Haiku 4.5 ($1 / $5 per MTok)
  • Body model: Claude Sonnet 4.6 ($3 / $15) — comparison: GPT-5.3-Codex ($1.75 / $14)
  • FX: 1 USD = 150 JPY
  • Exclusions: CAPTCHA ~8% / sales-NG ~12% / no form ~15%
  • Average tokens per company: ~4,000 in / ~800 out
  • Cache hit ratio: 60% (company info / guardrails reused)
  • Variance: ±30%
Cost comparison bar chart for processing 10,000 companies/month with Claude Sonnet 4.6 vs GPT-5.3-Codex. Sonnet 4.6 = ~¥16,400/mo; GPT-5.3-Codex = ~¥12,500/mo. Difference ~¥3,900/mo. FX 150 JPY, 60% cache hit assumption.
Figure: Figure 6: Cost comparison for 10K-company body generation (Sonnet 4.6 vs GPT-5.3-Codex).

Under identical conditions, GPT-5.3-Codex (input $1.75 / output $14) comes to about ¥12,500 / month. Delta vs Sonnet 4.6 is ~¥3,900 / month. Measure on real data before adopting — the quality vs cost tradeoff is workload-specific.

項目In-house Claude Code + Codex CLITypical sales-agency SaaS
Monthly rangeApprox. ¥12,500–¥16,400 (10K companies, direct API)Generally ¥300K–¥2M (list scale, send-execution included)
Setup cost0 (Sales Claw is OSS)¥100K–¥1M typical
CustomizationHigh (own data / copy rules)Low–medium (template-bound)
In-house skill requiredClaude / Codex CLI operationNot required (SaaS-operated)

7. Risk and safety design for no-human-review operation

Sales Claw doesn't ship CLI agent output directly — it applies pre-send automated checks, sales-NG detection, halting on CAPTCHA detection, send-frequency limits, audit log preservation, and auto-stop conditions to reduce risk structurally (see Figure 5 in section 5 for the full flow).

  • Anti-spam (JP): Sender-info 4-requirement auto-fill (preferences.complianceFooter: true)
  • TOS compliance: Pages marked "no sales solicitation" auto-skipped
  • No CAPTCHA bypass: Halt with awaiting_approval, audit logged
  • Send-frequency limit: Suppress repeated sends to the same domain
  • Opt-out line: "If you would prefer not to receive..." auto-inserted

Prevent CLI agent runaway with auto-stop AND-conditions

Residual risk

  • Missed-detection of new CAPTCHA schemes (mis-send possible until Sales Claw catches up)
  • Lag on TOS-revision response (manual update of compliance metadata required)
  • Vertical regulations (BFSI etc. require separate review)
  • LLM hallucinations (wrong company info / wrong contact-person names)
  • CLI-level bugs (Both Claude Code and Codex CLI ship weekly; behavior can drift)

8. Pre-production checklist + closing thoughts

Before going hybrid with Codex CLI + Claude Code

  • Documented which CLI handles which task type
  • Set AND conditions: max turns + max count + max time
  • Verified pre-send-check pass-rate on a 100-company sample
  • Confirmed CAPTCHA-bypass automation is OFF
  • Confirmed sales-NG detection / skip is ON
  • action-log.json saving is enabled
  • Compliance footer is enabled (4-requirement auto-fill)
  • Opt-out line is part of the copy template
  • Send-frequency limit is configured
  • /goal completion criteria are explicit in writing
  • remote-control caller implements max-iterations
  • API keys live in .env (not in the repo)
  • Error-time notification is configured
  • Correction / retraction procedure is defined

Closing

As of May 2026, Codex CLI 0.130.0 and Claude Code 2.1.143 are best read as "pick by task; combine where it fits" rather than "one CLI to rule them all". Terminal-Bench 2.0 puts Claude-Opus-4.7-based vix at 90.2% (rank 1) and Codex CLI itself at 82.0% (rank 7); SWE-bench Verified aggregates put GPT-5.5 at 88.7% and Claude Opus 4.7 around 82%; Aider Polyglot puts GPT-5 (high) at 88.0%. There is no single winner — rankings reshuffle per benchmark.

For embedded-product use like AI sales automation, "measure 100 companies of your own list / copy rules / target sites" is a far more trustworthy selection signal than benchmark ranks. Sales Claw doesn't ship CLI output as-is — it reduces risk structurally with pre-send checks, sales-NG detection, CAPTCHA halts, audit log, and auto-stop conditions.

Next step: slice 100 companies from your list, run both Claude Code and Codex CLI for body generation, then compare pre-send-check pass rates and quality side by side. Start at the Sales Claw quick start. The OSS source is also free to download.

This is a convenience English translation of the Japanese-language original. In case of any discrepancy, the Japanese version is authoritative.

When benchmarks don't agree, measure on your own data. Sales Claw closes the loop.

無料・MIT ライセンス。インストールせずにライブデモも試せます。

よくある質問

Which is stronger — Codex CLI or Claude Code?
It depends on the benchmark axis. Terminal-Bench 2.0 (as of 2026-05-15) puts Claude-Opus-4.7-based vix at 90.2% (#1) and Codex CLI + GPT-5.5 at 82.0% (#7). SWE-bench Verified third-party aggregates put GPT-5.5 at 88.7% and Claude Opus 4.7 around 82%. Aider Polyglot puts GPT-5 (high) at 88.0% (#1). Rankings reshuffle per task type (terminal ops / GitHub issue solving / multi-language code editing), so the most trustworthy selection signal is 'measure 100 samples on your own task'.
Which is cheaper?
Direct-API, GPT-5.3-Codex (input $1.75 / output $14 per MTok) is the cheapest tier. Claude Opus 4.7 = $5/$25; GPT-5.5 = $5/$30. For 10K companies/month body generation (Sonnet 4.6 with 60% cache hit) the estimate is ~¥16,400/mo; switching to GPT-5.3-Codex under identical conditions drops it to ~¥12,500/mo. FX, cache hit ratio, and exclusion rates move the figure, so a 100-company sample is recommended before production. On subscriptions, ChatGPT Plus $20/mo (includes Codex) and Claude Pro $20/mo (includes Claude Code) sit at parity.
Which should I embed in AI sales automation?
A complementary hybrid is realistic. Claude Code is strong on long context (1M tokens) and subagents / /goal — good for form body generation and approach-guardrail violation checks. Codex CLI is strong on the JSON-RPC 2.0 codex remote-control and built-in image_generation (gpt-image-2) — good for nightly batch parallel headless and OG image generation. In Sales Claw, a division of labor — Claude Code for body generation + pre-send checks, Codex CLI for image generation + parallel batches — works in practice.
Which has the larger context window?
Claude Opus 4.7 / 4.6 / Sonnet 4.6 all ship 1M tokens standard with no surcharge on the full 1M (Anthropic Pricing Docs). On Codex, GPT-5.4 has 272K default and ~1.05M in long mode (third-party aggregate; official OpenAI confirmation desirable). For tasks that pack a giant repo or extensive past logs into a single prompt, Claude Code currently has the edge.
Is it OK to run with no human review?
No design is '100% safe'. Sales Claw doesn't ship CLI output as-is — it reduces risk structurally with policy control, pre-send automated checks, sales-NG detection, halting on CAPTCHA detection, send-frequency limits, audit log preservation, and auto-stop conditions (count + elapsed + turn caps in AND). Residual risks include missed-detection of new CAPTCHA schemes, lag on TOS-revision response, and LLM hallucinations — review the audit log carefully during early operation.
Why is Claude missing from Aider Polyglot top 10?
Aider evaluates models via direct API calls — it's not an integrated evaluation including the agent harness like Claude Code. On harness-inclusive Terminal-Bench 2.0, Claude-Opus-4.7-based vix is #1 at 90.2%, so the gap is an evaluation-axis difference. Aider's top 10 (as fetched 2026-05-16) is dominated by GPT-5 variants, which is partly a 'GPT-5 response characteristics fit Aider's edit loop' affinity story.
GPT-5.3-Codex or GPT-5.5 — which should I pick?
GPT-5.5 is the latest flagship ($5 input / $30 output per MTok), scoring 88.7% on SWE-bench Verified. GPT-5.3-Codex is the Codex-tuned dedicated model ($1.75 / $14), scoring 85.0%. For 'top quality at any cost', GPT-5.5; for 'good enough at much better cost efficiency', GPT-5.3-Codex. For repetitive Sales Claw-style workloads, GPT-5.3-Codex often wins on cost.

参考文献

本記事は X 公式アカウントと公式ドキュメントを一次情報として参照しています。

  1. [01]
  2. [02]
  3. [03]
  4. [04]
  5. [05]
  6. [06]
  7. [07]
  8. [08]
  9. [09]

この記事の著者

中澤 圭志

中澤 圭志

Sales Claw maintainer

Designs and develops Sales Claw. Writes from the field on B2B sales automation and applied AI.

Share this article