Ship fixes faster than Mythos-speed exploit chains can arrive

Q: What did AISI's Claude Mythos evaluation show?

[AISI](https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities) reported that Claude Mythos Preview completed The Last Ones, a 32-step simulated corporate network attack, end to end in 3 of 10 attempts and averaged 22 of 32 steps, while the tested Claude Opus 4.6 baseline did not complete the full range.

Q: Did GPT-5.5 also complete the AISI cyber range?

Yes. [AISI](https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5-5-cyber-capabilities) reported that GPT-5.5 completed the same 32-step simulated corporate attack in 2 of 10 attempts, while Claude Mythos Preview completed it in 3 of 10. Before Mythos, no AI model had completed that end-to-end range — making the threshold crossing broader than one vendor.

AISI's Claude Mythos evaluation matters for developers because it crossed a binary threshold: Mythos completed an end-to-end 32-step simulated corporate network attack, while earlier models did not complete that range end to end. GPT-5.5 is close enough that the practical takeaway is not "which model won." It is that development teams now need remediation pipelines that can safely patch internet-facing code in hours, not weeks (AISI Mythos evaluation, retrieved 2026-05-14; AISI GPT-5.5 evaluation, retrieved 2026-05-27).

The non-obvious read: this is not mainly a SOC story. Detection teams can get faster and still lose if product engineering cannot validate, deploy, and roll back fixes quickly. The new defensive bottleneck is safe change velocity. Security programs built around periodic scanning, manual triage, ticket queues, and monthly patch windows are structurally mismatched with models that can discover, chain, and validate exploit paths at machine speed (SACR, retrieved 2026-05-14).

Read AISI as a methodology result, not a headline score

AISI measured two different things: isolated exploitation skill and chained attack execution. The second one is the developer-relevant threshold because production incidents are chains, not puzzles.

The first class was capture-the-flag testing. In these challenges, a model identifies a weakness in a contained target and retrieves a flag. Mythos succeeded on 73% of expert-level CTF tasks, a tier AISI says no model could complete before April 2025 (AISI, retrieved 2026-05-14). That score is important, but CTFs overstate operational usefulness when read alone: they test specific skills in isolation.

The second class was cyber ranges. AISI's "The Last Ones" range is a 32-step corporate network attack simulation spanning reconnaissance through full network takeover; the associated AISI paper describes purpose-built ranges that require chaining heterogeneous capabilities across extended action sequences (arXiv, retrieved 2026-05-14). Mythos completed TLO from start to finish in 3 of 10 attempts and averaged 22 of 32 steps. Claude Opus 4.6 averaged 16 steps and did not complete the full range (AISI, retrieved 2026-05-14).

That is the threshold. In security terms, "completed full network takeover sometimes" and "never completed it" are different categories even when the success rate is not high enough for push-button reliability. AISI also states the caveat clearly: the ranges lacked active defenders, endpoint detection, and penalties for actions that would trigger alerts, so the result applies to small, weakly defended, vulnerable enterprise systems where network access has already been gained (AISI, retrieved 2026-05-14).

Developers should keep both truths in view. Mythos did not demonstrate reliable compromise of hardened enterprises. It did demonstrate that frontier models can now chain enough cyber steps to make weak patch discipline materially more dangerous.

Treat GPT-5.5 as confirmation that the threshold is not vendor-specific

The GPT-5.5 result matters because it weakens the comforting interpretation that Mythos is a one-off Anthropic anomaly. AISI reported that GPT-5.5 completed the same 32-step simulated corporate attack in 2 of 10 runs, with Mythos doing it in 3 of 10; before Mythos, no AI model had completed that test (AISI GPT-5.5 evaluation, retrieved 2026-05-27).

CyberScoop's read is the most important trend claim: both Claude Mythos Preview and GPT-5.5 substantially exceeded the doubling trend AISI had been tracking, and it remains unclear whether this is a one-time jump or a new faster baseline (CyberScoop, retrieved 2026-05-14). AISI's underlying range paper also found performance scaling with inference-time compute, with no observed plateau between 10M and 100M tokens in its test setup (arXiv, retrieved 2026-05-14).

For engineering leaders, the practical conclusion is conservative: do not bet your risk model on one provider keeping capability gated. The Cloud Security Alliance framed Mythos as a new baseline for weakly defended environments, not as a finished autonomous attacker against mature security programs (CSA, retrieved 2026-05-14). That distinction matters. The near-term risk is not a perfect AI red team. It is a larger population of models that can cheaply compress the time between bug discovery, exploit path construction, and proof that a vulnerable chain is real.

This also explains why access policy is now part of the security story. Anthropic's Project Glasswing restricts Mythos access to vetted partners. OpenAI says Trusted Access for Cyber gives legitimate defenders an identity-gated access path for advanced defensive workflows including vulnerability discovery, codebase reasoning, malware analysis, and other defensive work (OpenAI system card, retrieved 2026-05-27). For most enterprise security teams, that makes GPT-5.5 with Trusted Access the more accessible defensive path today; Mythos-level access remains concentrated among large partners.

Move the bottleneck from detection to remediation

Palo Alto Networks provides the clearest real-world signal because it tested frontier models against its own portfolio. In its May 2026 threat bulletin, Unit 42 said frontier AI model testing produced the majority of findings in Palo Alto's May Patch Wednesday advisories after a scan of more than 130 products; the advisory covered 26 CVEs representing 75 issues, compared with its usual volume of fewer than 5 CVEs in a month (Unit 42, retrieved 2026-05-27).

That is the developer wake-up call. A model that finds more bugs is only helpful if the organization can turn findings into safe fixes. Unit 42's May bulletin frames the first response plainly: find and fix vulnerabilities in applications, products, code, and open-source supply chains; coordinate accelerated patching with product and development teams; reduce reachable exposure; and improve real-time operations (Unit 42, retrieved 2026-05-27).

AISI's GPT-5.5 reverse-engineering result is the sub-hour pressure test: a challenge that took a human expert roughly 12 hours was solved autonomously in 10 minutes and 22 seconds for $1.73 in API usage (AISI GPT-5.5 evaluation, retrieved 2026-05-27). Treat that as a pressure test, not a universal law. Some bugs will still be hard. Hardened environments still matter. But the correct engineering question becomes blunt: if a serious internet-facing flaw lands today, can your team ship a tested mitigation before an exploit chain spreads?

Many teams cannot, for boring reasons. The service owner is unclear. The deploy pipeline is fragile. The test suite cannot isolate the risk. The rollback path is undocumented. Security files a ticket, product triages it, engineering schedules it, and the fix waits behind roadmap work. In a Mythos/GPT-5.5 world, that queue is the vulnerability.

Execute a 0-30, 31-60, 61-90 day developer plan

The response is not to panic-buy security tools. The response is to make remediation an engineering capability with a service-level objective.

0-30 days: reduce the reachable blast radius. Build an inventory of internet-facing services, admin panels, forgotten subdomains, exposed management planes, legacy protocols, and unauthenticated endpoints. For each asset, assign a current owner and an emergency contact. Require compensating controls where fixes cannot ship quickly: WAF or WAAP rules, service-to-service auth hardening, egress restrictions, stricter identity policies, and segmentation around crown-jewel systems. AISI's own advice emphasizes regular updates, robust access controls, security configuration, and comprehensive logging (AISI, retrieved 2026-05-14).

31-60 days: make emergency patching boring. Pick the ten services whose compromise would hurt most and run an emergency patch drill. The drill is successful only if the team can identify the owner, create a fix branch, run the relevant test suite, deploy a canary, observe it, and roll back inside a defined window. Add fuzzing and modern SAST/DAST where they produce developer-ready findings rather than generic backlog tickets. SACR's CISO recommendations emphasize developer ownership, AI-assisted review in IDE/CI, clear SLAs for internet-exposed issues, and guardrails that block risky changes without approval (SACR, retrieved 2026-05-14).

61-90 days: create a trusted remediation lane. This is not fully autonomous patching. It is a constrained path for high-confidence fixes: signed artifacts, human approval gates for sensitive services, canary deploys, synthetic checks, production observability, automatic rollback thresholds, and an audit trail. Evaluate AI remediation vendors against this question: can they produce patches your delivery system can operationalize safely? A finding without a safe ship path is just a faster ticket.

The sharper framing comes from SACR: SOC excellence without secure production capability is structurally insufficient when fixes cannot be shipped safely and quickly (SACR, retrieved 2026-05-14). Developers should take that literally. Your incident response plan now needs a release-engineering plan.

Use GPT-5.5-Cyber defensively

For teams that can access OpenAI's Trusted Access for Cyber program, GPT-5.5 with Trusted Access is the practical near-term tool to evaluate. OpenAI says Trusted Access reduces friction for verified defensive workflows, and GPT-5.5-Cyber is a more specialized access tier for authorized red teaming, penetration testing, and controlled validation (OpenAI system card, retrieved 2026-05-27). Unit 42 says Palo Alto tested OpenAI's latest models through Trusted Access for Cyber alongside Anthropic's Mythos model as part of Project Glasswing (Unit 42, retrieved 2026-05-27).

Keep the harness constrained

Do not wire a cyber-capable model straight into production systems. The safe developer pattern is narrow context, read-only first access, no secrets by default, explicit allowlists, logged tool calls, and separate approval for any action that changes code, infrastructure, credentials, or customer data. The aim is to accelerate triage and patch review, not to create a privileged autonomous operator with ambiguous instructions.

Use the model where it compresses defensive toil: map potentially affected code paths, explain a vulnerable dependency's reachable surface, draft test cases that prove a fix, compare a proposed patch against the original bug class, or summarize why a virtual patch is acceptable until a full release ships. Keep exploit generation inside authorized, isolated test environments and under your organization's legal and security process.

▶ Interactive prompt cell (full demo on lesson pages)

csv: service,owner,internet_facing,severity,fix_available,days_open auth-api,identity,true,critical,true,9 admin-ui,platform,true,high,true,18 billing-worker,payments,false,critical,true,41 docs-site,web,true,medium,true,33 old-vpn,it,true,high,false,77 expected_output: | { "identity": [ { "service": "auth-api", "priority": 1, "sla": "patch_or_mitigate_within_24_hours", "reason": "internet-facing critical issue with fix available" } ], "platform": [ { "service": "admin-ui", "priority": 2, "sla": "patch_or_mitigate_within_48_hours", "reason": "internet-facing high issue with fix available" } ], "it": [ { "service": "old-vpn", "priority": 3, "sla": "apply compensating controls today; patch plan required", "reason": "internet-facing high issue without fix available" } ] } </RunPromptCell>

Decide what changes this week

The right takeaway from AISI is specific and testable: can your team safely reduce exposure and ship fixes at the pace your detection program can now generate credible findings?

If the answer is no, start with three concrete checks this week. First, pick one internet-facing service and prove that its owner can deploy and roll back a security fix on demand. Second, export your critical/high vulnerability backlog and sort by reachability, owner, fix availability, and days open. Third, run a tabletop where detection is instant but remediation is constrained by your real release process. The gap you find is the gap Mythos-style capability will pressure.

Frequently Asked Questions

What did AISI's Claude Mythos evaluation show?

AISI reported that Claude Mythos Preview completed The Last Ones, a 32-step simulated corporate network attack, end to end in 3 of 10 attempts and averaged 22 of 32 steps, while the tested Claude Opus 4.6 baseline did not complete the full range.

Did GPT-5.5 also complete the AISI cyber range?

Yes. AISI reported that GPT-5.5 completed the same 32-step simulated corporate attack in 2 of 10 attempts, while Claude Mythos Preview completed it in 3 of 10. Before Mythos, no AI model had completed that end-to-end range — making the threshold crossing broader than one vendor.

What should developers change first after the Mythos and GPT-5.5 results?

The immediate developer response is to improve safe remediation speed: assign owners for internet-facing services, drill emergency patch and rollback paths, and create constrained approval lanes for high-confidence security fixes. SACR recommends developer ownership, AI-assisted review in CI, and clear SLAs for internet-exposed issues.

The next 90 days are not about replacing developers with cyber agents. They are about making developer-controlled remediation fast enough that AI-assisted discovery does not drown the organization in unshipped fixes. For a deeper path through secure tool permissions, agent boundaries, and production guardrails, continue with MCP from First Principles to Production: Why JSON-RPC over stdio beat WebSockets + OpenAPI, then pair it with secure-coding-with-claude and Production Agents with Claude Agent SDK + MCP Connector for secure review and deployment patterns.

Read AISI as a methodology result, not a headline score

AISI measured two different things: isolated exploitation skill and chained attack execution. The second one is the developer-relevant threshold because production incidents are chains, not puzzles.

Treat GPT-5.5 as confirmation that the threshold is not vendor-specific

Move the bottleneck from detection to remediation

Execute a 0-30, 31-60, 61-90 day developer plan

The response is not to panic-buy security tools. The response is to make remediation an engineering capability with a service-level objective.

Use GPT-5.5-Cyber defensively

Keep the harness constrained

▶ Interactive prompt cell (full demo on lesson pages)

Decide what changes this week

The right takeaway from AISI is specific and testable: can your team safely reduce exposure and ship fixes at the pace your detection program can now generate credible findings?

Frequently Asked Questions

What did AISI's Claude Mythos evaluation show?

Did GPT-5.5 also complete the AISI cyber range?

What should developers change first after the Mythos and GPT-5.5 results?

Ship fixes faster than Mythos-speed exploit chains can arrive

Read AISI as a methodology result, not a headline score

Treat GPT-5.5 as confirmation that the threshold is not vendor-specific

Move the bottleneck from detection to remediation

Execute a 0-30, 31-60, 61-90 day developer plan

Use GPT-5.5-Cyber defensively

Keep the harness constrained

Decide what changes this week

Frequently Asked Questions

References

Choose Connectors or Engineers: The Enterprise AI Split Behind Anthropic MCP and OpenAI FDE

Ship fixes faster than Mythos-speed exploit chains can arrive

Read AISI as a methodology result, not a headline score

Treat GPT-5.5 as confirmation that the threshold is not vendor-specific

Move the bottleneck from detection to remediation

Execute a 0-30, 31-60, 61-90 day developer plan

Use GPT-5.5-Cyber defensively

Keep the harness constrained

Decide what changes this week

Frequently Asked Questions

References

Choose Connectors or Engineers: The Enterprise AI Split Behind Anthropic MCP and OpenAI FDE

Ship fixes faster than Mythos-speed exploit chains can arrive

Read AISI as a methodology result, not a headline score

Treat GPT-5.5 as confirmation that the threshold is not vendor-specific

Move the bottleneck from detection to remediation

Execute a 0-30, 31-60, 61-90 day developer plan

Use GPT-5.5-Cyber defensively

Keep the harness constrained

Decide what changes this week

Frequently Asked Questions

References

Related from the academy

Choose Connectors or Engineers: The Enterprise AI Split Behind Anthropic MCP and OpenAI FDE

Ship fixes faster than Mythos-speed exploit chains can arrive

Read AISI as a methodology result, not a headline score

Treat GPT-5.5 as confirmation that the threshold is not vendor-specific

Move the bottleneck from detection to remediation

Execute a 0-30, 31-60, 61-90 day developer plan

Use GPT-5.5-Cyber defensively

Keep the harness constrained

Decide what changes this week

Frequently Asked Questions

References

Related from the academy

Choose Connectors or Engineers: The Enterprise AI Split Behind Anthropic MCP and OpenAI FDE