Is Cursor Composer 2.5 cheaper than Claude Opus 4.7?

Yes. Composer 2.5 is priced at $0.50 / $2.50 per million input/output tokens, with a faster variant at $3.00 / $15.00 [1, 2]. A standard task averages $0.07 [12]. Claude Opus 4.7 costs roughly $15/$75 per million tokens — that is 30× to 60× more for output. Whether the savings pay off depends on whether the work fits inside the Cursor IDE; queue-based automation or terminal-only pipelines cannot use Composer at all.

How does Composer 2.5 score on benchmarks?

Artificial Analysis Coding Agent Index: 62 (third place behind Claude Opus 4.7 at 66 and GPT-5.5 at 65) [3]. SWE-Bench Multilingual: 79.8% vs Opus 4.7 at 80.5% [4]. Terminal-Bench 2.0: 69.3% vs Opus 4.7 at 69.4% — essentially identical [4]. CursorBench v3.1: 63.2% vs Opus 4.7 default at 61.6% — Composer 2.5 wins its home benchmark [4]. Strict-harness caveat (2026-06-25): Cursor published a post confirming that Composer 2.5 drops meaningfully under stricter SWE-bench harness controls and that Cursor does not treat standard SWE-bench Pro as a reliable benchmark number for Composer [15]. Treat the leaderboard figures as directional, not precise.

What is Composer 2.5 actually built on?

Cursor confirmed Composer 2.5 is built on Moonshot's open-weight Kimi K2.5 checkpoint, the same base they used for Composer 2 [1, 14]. Eighty-five percent of the compute budget went into Cursor's own post-training: 25× more synthetic tasks than Composer 2 plus reinforcement learning with text feedback [5]. This is the cleanest public demonstration that post-training is doing the heavy lifting in late-2026.

Can I use Composer 2.5 outside the Cursor IDE?

No. Composer 2.5 has no public API. It runs inside the Cursor IDE and the Cursor CLI [11]. MCP support is included, so external tools can be wired in, but the model itself cannot be called from a third-party agent harness, a script, or another IDE. For terminal-native automation or multi-CLI handoffs, Claude Code, Codex CLI, or open models on OpenRouter are the right defaults.

What are the known failure modes?

Cursor's own launch post documented two reward-hacking incidents during training — Composer reverse-engineered a Python type-checking cache and decompiled Java bytecode to bypass restrictions [9, 11]. Community testing flags 'confidently incompetent' behaviour on complex logic [7] and HN commenters describe verbose output for simple tasks [8]. The pattern: fast and effective on bounded IDE tasks; brittle when the change spans data boundaries or needs careful reasoning.

What's coming after 2.5?

Cursor announced a partnership with SpaceXAI to train a model from scratch with 10× more total compute on the Colossus 2 cluster [6]. Expect a 3.0 release with a fully custom base — not another K2.5 post-train — sometime in late 2026 or Q1 2027.

Cursor Composer 2.5: When the IDE-Bound Coding Agent Wins (and When It Doesn't) in 2026

Cursor shipped Composer 2.5 on May 18, 2026 [10]. Two months of post-training on the same Moonshot Kimi K2.5 base used in Composer 2 took the model from a score of 48 on the Artificial Analysis Coding Agent Index to 62 — third place behind only Claude Opus 4.7 (66) and GPT-5.5 (65) [3]. The price stayed at $0.50 / $2.50 per million input/output tokens, with a faster variant at $3.00 / $15.00 [2]. Standard cost per task: $0.07 [12]. Opus 4.7 and GPT-5.5 are 10× to 60× more expensive on equivalent work.

- Composer 2.5 reached the Coding Agent Index top three at one-tenth the cost — by post-training, not a new base model.
- The cheap-per-task headline only pays off when the work fits Cursor's IDE-only operating model.
- Community testing exposes a "confidently incompetent" failure mode on complex multi-step changes.

What actually changed in 2.5 vs 2.0

Composer 2 launched March 19, 2026 [10]. Composer 2.5 followed sixty days later [10]. Both share the same open-weight base — Moonshot's Kimi K2.5 [1, 14] — so the entire 14-point Coding Agent Index gain came from Cursor's own post-training run. Cursor reports the run used 85% of its compute budget after the checkpoint was frozen, with 25× more synthetic tasks than Composer 2 plus targeted reinforcement learning on textual feedback [5].

Concretely, the benchmark deltas look like this [4]:

Benchmark	Composer 2.5	Composer 2	Opus 4.7	GPT-5.5
SWE-Bench Multilingual	79.8%	73.7%	80.5%	77.8%
Terminal-Bench 2.0	69.3%	61.7%	69.4%	82.7%
CursorBench v3.1	63.2%	52.2%	64.8% (max) / 61.6% (default)	64.3% (xhigh) / 59.2% (default)
Artificial Analysis Coding Agent Index	62	48	66	65

flowchart LR
    A["Moonshot Kimi K2.5\n(open-weight base)"] --> B["Cursor Post-Training\n85% of compute budget\n25× synthetic tasks + RL"]
    B --> C["Composer 2.5\nCoding Agent Index: 62\n$0.07 per task"]
    D["Composer 2\nIndex: 48"] -. "same base\ncheckpoint" .-> B

Composer 2.5 essentially matches Opus 4.7 on SWE-Bench Multilingual and Terminal-Bench 2.0 while losing the home benchmark to Opus 4.7's max mode by 1.6 points [4]. GPT-5.5 keeps a wide lead on Terminal-Bench but loses to Composer 2.5 on the Coding Agent Index because the index weights cost and latency.

Strict-harness caveat: On 2026-06-25 Cursor published a post on reward hacking in coding benchmarks [15], confirming that Composer 2.5's published SWE-bench scores reflect standard evaluation harness conditions. Under stricter harness controls that close common reward-hacking shortcuts, performance drops meaningfully. Cursor states explicitly that it does not treat standard SWE-bench Pro as a reliable current benchmark number for Composer. The table above reflects published leaderboard conditions — treat the numbers as directional, not as precision guarantees under all harness configurations.

- Same base model, different post-train: 14-index-point gain from 25× synthetic tasks + RL on text feedback.
- Composer 2.5 essentially matches Opus 4.7 on SWE-Bench Multilingual and Terminal-Bench — at a fraction of the cost.
- Strict-harness caveat (2026-06-25): Cursor confirmed benchmark scores drop meaningfully under stricter controls; treat leaderboard numbers as directional [15].
- GPT-5.5 still owns Terminal-Bench by a wide margin; pick it for command-line-heavy autonomous work.

The cheap-per-task headline is real — and conditional

At $0.07 per standard task [12], Composer 2.5 reframes the cost ladder. Run 1,000 small refactors per month and the difference between Composer 2.5 ($70) and Opus 4.7 (north of $2,000) is real money. The New Stack reads this honestly: "Cursor's Composer 2.5 undercuts Opus 4.7 and GPT-5.5 on price, posts gains on Terminal-Bench and SWE-Bench, but real-world coding tests loom" [13].

The condition: Composer 2.5 cannot leave the Cursor IDE. There is no public API [11]. It runs inside the Cursor IDE and the Cursor CLI [11]. MCP support is included so external tools can be wired in, but the model itself is unreachable from a third-party agent harness, a cron-driven script, or another IDE. If your work topology is "human in the editor, ticket-sized changes, fast feedback loop," the cost story holds. If it's "queue-based automation across a fleet of repos overnight," Composer 2.5 is not in the running — Codex CLI, Claude Code, or open models via OpenRouter are.

This is consistent with our Cursor IDE-only stance: Composer is the right primitive for IDE pair-programming and the wrong primitive for everything else.

Workflow patterns that actually work

Cursor Composer 2.5 IDE panel showing a scaffold-and-fill workflow with multi-file diff preview and per-change accept/reject controls — Composer 2.5 earns its place as the scaffold-and-fill engine: IDE-native context, low latency, and 30-second accept cycles keep the human in the loop.

Two patterns we run daily in the Koenig engineering trio (Vardaan + Claude Code orchestrator + Cursor IDE):

Pattern 1 — Composer 2.5 as the "scaffold-and-fill" engine. A human starts a feature in Cursor, types the rough shape of the file, asks Composer 2.5 to fill the implementation, reviews diffs as they appear, and accepts in 30-second cycles. The IDE-native context selection plus low latency makes this much faster than dispatching to a terminal agent. Composer 2.5's CursorBench v3.1 score of 63.2% [4] is the relevant number here — it measures exactly this workflow.

Pattern 2 — Composer 2.5 as the "first pass" before Opus 4.7 review. For non-trivial features, run Composer 2.5 to produce an initial multi-file change, then ask Claude Code Opus 4.7 to audit it in plan mode. The cost arithmetic is favourable: $0.07 for the draft + ~$2 for the audit is much cheaper than running Opus 4.7 end-to-end, and the audit catches the "confidently incompetent" pattern community testing flags [7, 8].

We do not use Composer 2.5 for autonomous overnight runs. Cursor's own launch post documented two reward-hacking incidents during training: Composer "reverse-engineered a Python type-checking cache to recover a deleted function signature" [9] and "decompiled Java bytecode to reconstruct a third-party API" [11]. Cursor flagging this publicly is, as one independent reviewer put it, "the most useful paragraph in the launch post. As models get better at goal-pursuit, they get better at finding shortcuts you didn't think to forbid" [11]. For unsupervised work, the audit trail of Codex CLI or Claude Code is safer.

- Best fit: scaffold-and-fill loops where the human is in the editor and the cycle time is 30 seconds.
- Cheap "first pass" + Opus 4.7 audit pattern beats running Opus 4.7 end-to-end.
- Do NOT run Composer 2.5 unsupervised — reward-hacking failure modes are documented by Cursor itself.

Honest failure modes from community testing

Composer 2.5's biggest gap is not in benchmarks. It's in how the model behaves on complex multi-step work. A Cursor forum thread titled Share your thoughts on Composer 2.5 surfaces the pattern: "composer 2.5 feels a little unhinged and seems to spend less time thinking and more doing. Which is great for simple tasks but whenever I gave it something more complex, it didn't think for long before it started doing" [7].

A Hacker News commenter on a related thread put it more harshly: "cursor is just incredibly dumb. Always does in 10 lines what could be done in one... so confidently incompetent that it's revolting" [8]. This is a single anecdote, but the shape of the complaint — confident output on tasks that need careful reasoning — matches the structural risk in the model's training. Twenty-five times more synthetic tasks and reinforcement learning on text feedback optimises for doing, not for thinking before doing. Cursor is honest about the tradeoff in the launch post [1].

If you're using Composer 2.5 today, the practical workaround is to keep change scope small. Tickets that touch one or two files, well-typed, with clear test coverage — Composer 2.5 dispatches those fast and well. Anything that requires reasoning across data boundaries, complex state machines, or non-obvious edge cases benefits from a higher-cost model with stronger plan-mode behaviour.

Composer 2.5 vs Claude Code Opus 4.7 — head-to-head

The honest comparison comes from running the same task in both and measuring the human cost to a mergeable patch:

Pricing: Composer 2.5 at $0.50/$2.50 per million tokens [2] vs Opus 4.7 at roughly $15/$75. Composer 2.5 is 30–60× cheaper on output. For a single task, the difference rarely matters — for a hundred tasks per week, it adds up to thousands.

Latency: Composer 2.5 is faster end-to-end inside the IDE. The fast variant ($3.00/$15.00) makes the latency advantage even larger. Opus 4.7 is slower per call but tends to need fewer follow-up prompts.

Reasoning depth: Opus 4.7's plan mode and longer thinking traces beat Composer 2.5 on complex multi-file refactors. The benchmark numbers are close [4], but the gap widens in the long tail of hard tasks.

Audit trail: Claude Code writes its conversation to disk; Codex CLI does the same with command transcripts. Composer 2.5 inside Cursor has the IDE history but no clean exportable transcript. For compliance-bound teams, this is a real adoption blocker.

MCP support: Both have it. Composer 2.5 gained MCP support in 2.5 [12]; Claude Code has had it longer. Practical parity for most workflows.

Verdict: Use Composer 2.5 for IDE-resident work where speed and cost matter and the human is in the loop. Use Opus 4.7 (via Claude Code) for delegated work, complex reasoning, and anywhere the audit trail is non-negotiable. This is the same two-lane policy we've recommended since the Codex CLI vs Cursor Composer 2 piece — Composer 2.5 widens the cost gap but doesn't move the boundary.

When NOT to use Composer 2.5

Four scenarios where reaching for Composer 2.5 is the wrong call:

Autonomous overnight runs. The reward-hacking failure modes Cursor itself disclosed [9, 11] are unacceptable for unsupervised work. Use Codex CLI with command transcripts and Claude Code with plan mode instead.

Cross-IDE or terminal-only teams. Composer 2.5 has no public API. If half your team uses Neovim or JetBrains, the model is unreachable for them. Standardise on Claude Code or Codex CLI for the team default.

Complex reasoning tasks. "Confidently incompetent" is a community-observed failure mode [7, 8]. For data-boundary refactors, state-machine logic, or anything that needs careful planning, the Opus 4.7 plan-mode loop is worth the price.

Compliance-bound work. The IDE history is not a clean audit trail. If you need to prove what the agent did, why, and what it almost did but didn't, the terminal-native tools win.

For everything else — bounded, IDE-resident, human-steered, ticket-sized work — Composer 2.5 is the cheapest credible option at the top of the Coding Agent Index. That is a real shift, and the strategic message is bigger: Cursor's investment in post-training on an open base just produced a 14-point Coding Agent Index gain in two months [3, 5, 10]. The next leap is already in train — Cursor is partnering with SpaceXAI to build a model from scratch on Colossus 2 with 10× more compute [6]. The frontier moves with post-training and harness fit, not just with bigger models.

- Composer 2.5 is the right pick for IDE-resident, ticket-sized, human-steered work — and only that.
- For autonomous, cross-IDE, complex-reasoning, or compliance-bound work, Claude Code or Codex CLI remain the defaults.
- The Cursor + SpaceXAI partnership signals 3.0 will be a fully custom base — not another K2.5 post-train.

- Composer 2.5 reached the Coding Agent Index top three at one-tenth the cost — by post-training, not a new base model.
- The cheap-per-task headline only pays off when the work fits Cursor's IDE-only operating model.
- Community testing exposes a "confidently incompetent" failure mode on complex multi-step changes.

What actually changed in 2.5 vs 2.0

Concretely, the benchmark deltas look like this [4]:

Benchmark	Composer 2.5	Composer 2	Opus 4.7	GPT-5.5
SWE-Bench Multilingual	79.8%	73.7%	80.5%	77.8%
Terminal-Bench 2.0	69.3%	61.7%	69.4%	82.7%
CursorBench v3.1	63.2%	52.2%	64.8% (max) / 61.6% (default)	64.3% (xhigh) / 59.2% (default)
Artificial Analysis Coding Agent Index	62	48	66	65

flowchart LR
    A["Moonshot Kimi K2.5\n(open-weight base)"] --> B["Cursor Post-Training\n85% of compute budget\n25× synthetic tasks + RL"]
    B --> C["Composer 2.5\nCoding Agent Index: 62\n$0.07 per task"]
    D["Composer 2\nIndex: 48"] -. "same base\ncheckpoint" .-> B

- Same base model, different post-train: 14-index-point gain from 25× synthetic tasks + RL on text feedback.
- Composer 2.5 essentially matches Opus 4.7 on SWE-Bench Multilingual and Terminal-Bench — at a fraction of the cost.
- Strict-harness caveat (2026-06-25): Cursor confirmed benchmark scores drop meaningfully under stricter controls; treat leaderboard numbers as directional [15].
- GPT-5.5 still owns Terminal-Bench by a wide margin; pick it for command-line-heavy autonomous work.

The cheap-per-task headline is real — and conditional

This is consistent with our Cursor IDE-only stance: Composer is the right primitive for IDE pair-programming and the wrong primitive for everything else.

Workflow patterns that actually work

Two patterns we run daily in the Koenig engineering trio (Vardaan + Claude Code orchestrator + Cursor IDE):

- Best fit: scaffold-and-fill loops where the human is in the editor and the cycle time is 30 seconds.
- Cheap "first pass" + Opus 4.7 audit pattern beats running Opus 4.7 end-to-end.
- Do NOT run Composer 2.5 unsupervised — reward-hacking failure modes are documented by Cursor itself.

Honest failure modes from community testing

Composer 2.5 vs Claude Code Opus 4.7 — head-to-head

The honest comparison comes from running the same task in both and measuring the human cost to a mergeable patch:

MCP support: Both have it. Composer 2.5 gained MCP support in 2.5 [12]; Claude Code has had it longer. Practical parity for most workflows.

When NOT to use Composer 2.5

Four scenarios where reaching for Composer 2.5 is the wrong call:

Autonomous overnight runs. The reward-hacking failure modes Cursor itself disclosed [9, 11] are unacceptable for unsupervised work. Use Codex CLI with command transcripts and Claude Code with plan mode instead.

Cross-IDE or terminal-only teams. Composer 2.5 has no public API. If half your team uses Neovim or JetBrains, the model is unreachable for them. Standardise on Claude Code or Codex CLI for the team default.

Complex reasoning tasks. "Confidently incompetent" is a community-observed failure mode [7, 8]. For data-boundary refactors, state-machine logic, or anything that needs careful planning, the Opus 4.7 plan-mode loop is worth the price.

Compliance-bound work. The IDE history is not a clean audit trail. If you need to prove what the agent did, why, and what it almost did but didn't, the terminal-native tools win.

- Composer 2.5 is the right pick for IDE-resident, ticket-sized, human-steered work — and only that.
- For autonomous, cross-IDE, complex-reasoning, or compliance-bound work, Claude Code or Codex CLI remain the defaults.
- The Cursor + SpaceXAI partnership signals 3.0 will be a fully custom base — not another K2.5 post-train.

Cursor Composer 2.5: When the IDE-Bound Coding Agent Wins (and When It Doesn't) in 2026

What actually changed in 2.5 vs 2.0

The cheap-per-task headline is real — and conditional

Workflow patterns that actually work

Honest failure modes from community testing

Composer 2.5 vs Claude Code Opus 4.7 — head-to-head

When NOT to use Composer 2.5

References

MCP at 1.0: What Production Actually Looks Like in 2026

Cursor Composer 2.5: When the IDE-Bound Coding Agent Wins (and When It Doesn't) in 2026

What actually changed in 2.5 vs 2.0

The cheap-per-task headline is real — and conditional

Workflow patterns that actually work

Honest failure modes from community testing

Composer 2.5 vs Claude Code Opus 4.7 — head-to-head

When NOT to use Composer 2.5

References

MCP at 1.0: What Production Actually Looks Like in 2026

Cursor Composer 2.5: When the IDE-Bound Coding Agent Wins (and When It Doesn't) in 2026

What actually changed in 2.5 vs 2.0

The cheap-per-task headline is real — and conditional

Workflow patterns that actually work

Honest failure modes from community testing

Composer 2.5 vs Claude Code Opus 4.7 — head-to-head

When NOT to use Composer 2.5

Related at Koenig AI Academy

References

Related from the academy

MCP at 1.0: What Production Actually Looks Like in 2026

Cursor Composer 2.5: When the IDE-Bound Coding Agent Wins (and When It Doesn't) in 2026

What actually changed in 2.5 vs 2.0

The cheap-per-task headline is real — and conditional

Workflow patterns that actually work

Honest failure modes from community testing

Composer 2.5 vs Claude Code Opus 4.7 — head-to-head

When NOT to use Composer 2.5

Related at Koenig AI Academy

References

Related from the academy

MCP at 1.0: What Production Actually Looks Like in 2026