Opus 4.7 resolves 3× more production coding tasks than 4.6 — here's the catch for 8-hour sessions

Claude Opus 4.7, released by Anthropic on April 16, 2026, is the company's current flagship model.[4] It delivers a 70% task completion rate on CursorBench (vs. 58% for 4.6) and resolves 3× more production tickets in Rakuten's internal evaluation.[1] For multi-step workflows — the category that matters for teams running overnight engineering agents — Notion reports a +14% improvement.[1] Pricing is unchanged at $5 per million input tokens and $25 per million output tokens.[2]

The headline numbers are the strongest Anthropic has published for a coding-focused release. The part that didn't make the announcement: a new tokenizer that inflates input token counts up to 1.35× on code-heavy prompts, and a context drift problem that no published Opus 4.7 benchmark directly measures.

Key facts

Released April 16, 2026 — generally available, same pricing tier as Opus 4.6 ($5/M in, $25/M out)[1]
70% on CursorBench vs. 58% for Opus 4.6 — 12-percentage-point gain on single-session task completion[2]
3× production task resolution improvement over Opus 4.6 on Rakuten's internal evaluation[1]
File-system-based memory for multi-session continuity — new in 4.7, allows the model to persist context across session boundaries[1]
Loop resistance and graceful error recovery — reduces mid-task abandonment on tool failures[1]
Tokenizer change: input token counts scale 1.0–1.35× vs. Opus 4.6 depending on content type; code-dense prompts hit the upper bound[1]

What the benchmarks actually measure

Every number in the Opus 4.7 announcement is customer-reported rather than independently reproduced on a public harness. Cursor's 70% is task-completion rate within a single coding session. Rakuten's 3× is production bug resolution with unknown session-length bounds.[2]

This matters because none of these benchmarks simulate what "long-running coding" means in practice: a continuous session with 150+ sequential tool calls, file states mutating underneath a plan, loop-detection logic firing between steps, and a 500K-token context window that the model must navigate coherently from step 1 to step 200.[2]

The closest published proxy is Notion's +14% multi-step workflow number — but Notion hasn't released evaluation methodology. For teams whose agents run multi-hour tickets, the honest answer is: the improvements are directionally real, but the magnitude depends on where your tasks break today.

The tokenizer tax at scale

The least-publicized change in Opus 4.7 is the tokenizer upgrade. Anthropic advises that depending on content type, input token counts "may increase by approximately 1.0–1.35×" compared to Opus 4.6.[1]

For a typical 8-hour engineering session, the practical impact:

A session consuming 400,000 input tokens on Opus 4.6 now consumes ~540,000 tokens on Opus 4.7
At $5/M input, that's $2.00 vs. $2.70 per session on input alone
Multiply by 10 agents running overnight: $20/night → $27/night, just on the tokenizer delta

The compounding problem is prompt caching. Anthropic offers up to 90% cost reduction via cache hits.[2] But because cache keys are token-sensitive, a tokenizer change invalidates previously cached prompt prefixes. Teams that built their cost model on Opus 4.6 cache hit rates will see effective costs spike until prompts are re-optimized for the new tokenizer.

Safety and Stability: Project Glasswing Opus 4.7 is the first model to fully integrate the safeguards developed under Project Glasswing, Anthropic's cybersecurity initiative that automatically detects and blocks high-risk cybersecurity uses.[1] In the context of long-running coding tasks, this manifests as a more robust "refusal boundary" for dangerous operations. If an agent tries to modify a sensitive security configuration in a way that introduces a known vulnerability (e.g., hardcoding a credential during a refactor), Opus 4.7 is significantly more likely to catch the error and suggest a secure alternative.

While this adds friction to some edge-case workflows, it is a critical safety net for autonomous agents operating in production environments. Anthropic’s safety evaluation reports lower rates of misaligned behavior during multi-step tasks for Opus 4.7 compared to its predecessors, with the model scoring 98.5% on the XBOW autonomous penetration-testing benchmark.[2]

Where context drift still happens

Opus 4.7's file-system memory and loop resistance address two real failure modes: forgetting across session breaks, and spinning on unrecoverable tool errors. These are meaningful improvements for agentic workflows. The underlying degradation in long-context coherence — where transformer attention drops off sharply for information positioned in the middle of a large window — is well-documented.[3]

File-system-based memory: Beyond context windows Unlike the transient context window, the new file-system-based memory in Opus 4.7 allows the model to persist key insights across session boundaries. In a multi-day coding project, standard agents often lose the "mental model" of the architecture if the context is flushed. Opus 4.7 can now be instructed to maintain a `.claude_memory` file (or similar) that it treats as a high-priority "long-term anchor." This reduces the need for the agent to re-scan the entire codebase at the start of every session, though it requires explicit agentic orchestration to maintain accurately.

Visual Acuity: The 3.75MP Upgrade for Frontend QA One of the most significant yet under-discussed upgrades in 4.7 is the jump to 3.75 megapixel visual resolution—a 3x improvement over prior models.[1] In long-running coding tasks, this acuity is transformative for frontend engineering and automated QA agents.

Previously, an agent trying to debug a CSS alignment issue or a "pixel-perfect" design discrepancy often hallucinated the cause because it couldn't see the fine-grained details of a dense screenshot. Opus 4.7 can resolve small text elements and complex layout issues in screenshots at significantly higher fidelity than its predecessors.[1] For teams building autonomous "UI fix-it" agents, this means the model can now effectively "look" at the terminal output and the browser window simultaneously to identify where a build failed visually.

Adaptive thinking: When to let the model reason deeper

Opus 4.7 introduces Adaptive thinking technology, which allows the model to dynamically adjust reasoning depth based on task complexity.[2] For most standard refactors, the model conserves tokens and responds quickly. For long-running tasks that involve complex architectural decisions — such as migrating a legacy monolith to a distributed microservices pattern — adaptive thinking allows the model to "pre-reason" through the dependency graph before emitting its first tool call.

The trade-off is latency and token spend: the model consumes more tokens in its internal thought trace on harder problems, but this often results in more correct-on-the-first-try tool calls. If your agent is running an 8-hour ticket, the extra reasoning time per step is a cheap insurance policy against a 20-minute loop where the model tries to fix a bug it introduced three steps prior.

Running the 200-step comparison

To quantify the 4.6 → 4.7 delta for your own task distribution, set up a fixed benchmark: the same 200-step coding task run on each model with identical prompts (adjusted for the tokenizer), measured on step-completion count, first-tool-failure step, and total token spend.

For a reproducible harness, the SWE-bench dataset provides real-world GitHub issues with deterministic pass/fail oracles (the target repository's test suite), making success rate objective.[5]

The setup — initializing the model client and dispatching a single benchmark task — looks like this:

▶ Interactive prompt cell (full demo on lesson pages)

Run this task set on both models and compare: step count to first passing test, backtrack rate (tool calls that undo a previous write), and total input/output token spend. Expect Opus 4.7 to show 10-15% fewer backtracks on tasks where the failure mode is mid-task abandonment, but roughly equivalent drift behavior after step 100 for tasks with long constraint chains.

Visual Debugging Example To test the 3.75MP acuity upgrade, try a task that requires matching a UI implementation against a high-resolution design spec or screenshot.

▶ Interactive prompt cell (full demo on lesson pages)

✓ Knowledge check (interactive on lesson pages)

Bottom line

For teams running sub-100-step coding agents, Opus 4.7 is a straightforward upgrade: 12pp better task completion, 3× production ticket resolution, same price. The tokenizer change is a minor line-item.

For teams running 150+ step overnight agents, the calculus is different. Budget for a 20-35% input token increase on code-heavy prompts, re-validate cache hit rates post-upgrade, and add explicit re-anchor checkpoints to your agent loop. The file-system memory and loop resistance will help — but context drift at long range is still your problem to engineer around, not Anthropic's.

For a structured framework on choosing between Opus 4.7, Sonnet 4.6, and competing frontier models based on task duration and cost tolerance, see Picking a Frontier Model: Opus 4.7 vs GPT-5.5 vs Gemini 3.1 Pro — A Builder's Benchmark Guide.