When should I escalate from Claude Sonnet to Opus in Hermes agent?

Escalate to Opus 4.7 when tasks require multi-step tool orchestration (8+ chained tool calls), long-context coherence above 100K tokens, GPQA-class math or logic reasoning, or high-resolution vision analysis. For standard coding, summarization, and content generation, Sonnet 4.6 delivers 79.6% SWE-bench performance at 40% less cost per token. The SWE-bench gap between Sonnet 4.6 (79.6%) and Opus 4.7 (80.8%) is only 1.2 percentage points — far smaller than most teams assume. (Source: Anthropic benchmarks via NxCode, April 2026)

How does Hermes agent's auxiliary task system affect model costs?

By default, Hermes routes every auxiliary task — screenshot analysis, web page summarization, context compression, and session-title generation — to your main chat model. If your main model is Opus 4.7 at $5/$25 per million tokens, every background page scrape and image analysis bills at that rate. Hermes documentation explicitly warns that 'on expensive reasoning models (Opus, MiniMax M2.7, etc.) auxiliary tasks add meaningful cost.' The fix: set explicit cheaper providers for each auxiliary task in ~/.hermes/config.yaml.

What is ClawRouter and how does it integrate with Hermes for model routing?

ClawRouter by BlockRunAI is an open-source local routing proxy that classifies requests across 14 dimensions and auto-routes to the cheapest capable model. It integrates with Hermes by pointing your model endpoint at the ClawRouter proxy (port 8402) and selecting model name `blockrun/auto`. ClawRouter's blended average cost is $2.05 per million tokens versus $25 per million for Opus alone — a reported 92% reduction. The `blockrun/agentic` profile is specifically tuned for Hermes-style tool-heavy workloads. (Source: BlockRunAI GitHub, 2026)

Hermes AI Gateway Model-Selection in 2026: When to Escalate Sonnet to Opus

Use Claude Sonnet 4.6 as your default model in Hermes AI gateway and escalate to Opus 4.7 only when one of five measurable signals appears: tool-call chains exceeding 8 steps, context windows above 100K tokens, GPQA-class math reasoning, high-resolution vision analysis, or multi-agent coordination decisions that cascade downstream. The raw token price gap ($3 vs $5 per million input) understates the real cost difference — Hermes's auxiliary task system bills every background page summary, screenshot analysis, and context compression against your main model, so Opus-as-default routinely costs 2–3× the headline rate implies.

Most teams escalate too eagerly. On SWE-bench Verified, Claude Sonnet 4.6 scores 79.6% against Opus 4.7's 80.8% — a 1.2 percentage point gap that justifies a 67% cost premium only when that last 1.2% is load-bearing. On CursorBench (multi-step agentic tasks), the gap widens to 12 percentage points (Opus 4.7: 70%, Opus 4.6: 58%) — which is exactly where escalation earns its keep. The model-selection question in Hermes is not "is this task hard?" but "does this task match the specific profile where Opus outperforms Sonnet by more than its cost difference?"

The Auxiliary Task Trap: Why Opus-as-Default Costs More Than You Think

Hermes Agent routes five categories of background work — screenshot analysis, web page summarization, session-title generation, browser extract, and context compression — to your main chat model by default. The Hermes configuration docs explicitly flag this: "on expensive reasoning models (Opus, MiniMax M2.7, etc.) auxiliary tasks add meaningful cost."

If you set model.provider: anthropic and model.default: claude-opus-4-7, every time Hermes summarizes a web page, analyzes a screenshot, or compresses long context, it bills at $5/$25 per million tokens. These tasks don't need Opus — they need speed and cheapness. The fix is four lines in ~/.hermes/config.yaml:

auxiliary:
  vision:
    provider: openrouter
    model: google/gemini-3-flash-preview   # ~$0.075/MTok vs $5.00 for Opus
  web_extraction:
    provider: openrouter
    model: google/gemini-3-flash-preview
  context_compression:
    provider: openrouter
    model: google/gemini-3-flash-preview
  session_title:
    provider: openrouter
    model: anthropic/claude-haiku-4-5     # $1/$5 MTok — sufficient for title gen

This alone reduces real-world Hermes costs by 30–50% for sessions with heavy web browsing or long-context compression, with zero impact on your main reasoning quality.

The 5 Escalation Signals That Actually Justify Opus

Based on benchmark data across Opus 4.7 and Sonnet 4.6 (retrieved June 2026), five specific task profiles produce quality gaps large enough to justify the cost premium:

1. Multi-step tool orchestration (>8 chained tool calls) Opus 4.7 scores 70% on CursorBench vs Sonnet 4.6's baseline — a gap that compounds when tool calls depend on each other's outputs. Short tool chains (≤5 steps) close this gap substantially.

2. Long-context coherence (>100K tokens in working context) Both models support 1M context at standard pricing, but Opus maintains cross-reference accuracy across very long contexts more reliably. For Hermes sessions with hundreds of files loaded, Opus reduces the rate of silent fact drift.

3. GPQA-class math and logic reasoning Opus 4.7 scores 94.5% on GPQA Diamond. Sonnet 4.6 GPQA scores are not publicly comparable. For Hermes agents doing financial modeling, proof verification, or scientific calculation, this gap is real and measurable.

4. High-resolution vision analysis Opus 4.7 achieves 98.5% vision accuracy versus Opus 4.6's 54.5% at high resolution. If your Hermes workflow processes screenshots from high-DPI displays or analyzes diagrams, only Opus 4.7 handles above-2576px images natively — though this applies to your main model vision calls, not auxiliary screenshot analysis (use Gemini Flash for those).

5. Cascading multi-agent coordination When Hermes acts as an orchestrator dispatching to other agents, a wrong decomposition decision in step 1 compounds through every downstream step. The Augmentcode routing analysis found that Opus as the coordinator model reduced downstream rework enough to cut total session cost by 51% versus Sonnet-for-everything, despite Opus's higher per-token rate.

The Real Cost Stack: A Worked Example

The numbers below are computed from Anthropic's current pricing (Opus 4.7: $5/$25 MTok; Sonnet 4.6: $3/$15 MTok; Haiku 4.5: $1/$5 MTok) and the Augmentcode three-tier routing benchmark, retrieved June 2026.

Task	Model	Input MTok	Output MTok	Per-task cost	Uniform Opus cost
Architecture planning (1×)	Opus 4.7	0.008	0.004	$0.140	$0.140
Code implementation (3×)	Sonnet 4.6	0.036	0.024	$0.468	$0.780
Quick file edits (8×)	Haiku 4.5	0.024	0.012	$0.084	$0.420
Code review (4×)	Haiku 4.5	0.020	0.008	$0.060	$0.300
Test generation (4×)	Sonnet 4.6	0.016	0.012	$0.228	$0.380
Auxiliary tasks (all)	Gemini Flash	—	—	~$0.012	~$0.15
Session total	Routed			$0.99	$2.17

Three-tier routing with Gemini Flash auxiliary tasks costs $0.99 versus $2.17 for uniform Opus — a 54% reduction with no measurable quality loss on the 90% of steps that don't require Opus reasoning depth. This matches the Augmentcode 51% benchmark closely.

Note: Opus 4.7's updated tokenizer inflates token counts by 1.0–1.35× depending on content type, per Simon Willison's token counter analysis. The $0.140 architecture step above uses list pricing; actual cost may run 10–15% higher on prose-heavy content.

Configuring the Routing Stack in Hermes

Two practical patterns for implementing Sonnet-first routing with Opus escalation inside Hermes:

Pattern A: LiteLLM Fallback Proxy

LiteLLM lets you set Sonnet as the primary and fall back to Opus on specific signals (rate limits, low-confidence responses):

# litellm_config.yaml
model_list:
  - model_name: main
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: ${ANTHROPIC_API_KEY}
  - model_name: main           # same alias, second entry = fallback
    litellm_params:
      model: anthropic/claude-opus-4-7
      api_key: ${ANTHROPIC_API_KEY}
router_settings:
  routing_strategy: usage-based-routing
  fallbacks: [{"main": ["main"]}]
  context_window_fallbacks: [{"main": ["main"]}]

Start LiteLLM: litellm --config litellm_config.yaml --port 4000

Then in Hermes: hermes model → Custom endpoint → http://localhost:4000 → model main.

Pattern B: ClawRouter Automatic Routing

ClawRouter by BlockRunAI classifies each request across 14 complexity dimensions before routing:

npx @blockrun/clawrouter   # starts on port 8402

In ~/.hermes/config.yaml:

model:
  provider: custom
  base_url: http://localhost:8402/v1
  default: blockrun/agentic    # Hermes-optimized profile

The blockrun/agentic profile is tuned for tool-heavy agent workloads. ClawRouter's reported blended cost is $2.05 per million tokens versus $25/M for Opus — though that average covers a mix of free and premium model routing, so production savings will vary by task mix.

> Note on ClawRouter payments: ClawRouter uses USDC cryptocurrency via x402 for per-request billing, not traditional API keys. Factor this into your payment infrastructure planning before adopting at scale.

What Benchmarks Won't Tell You (and Trace Evaluation Will)

The Opus vs Sonnet gap in SWE-bench is 1.2 percentage points. In CursorBench it's 12 points. The divergence exists because SWE-bench tests isolated file edits under contrived conditions, while CursorBench tests multi-step task completion — which is closer to what Hermes actually does in production.

Before committing to an escalation policy, run your own session traces: record 20 representative Hermes sessions with Sonnet only, then replay them with Opus only. Measure: task completion rate, downstream rework steps required, and total session cost. The sessions where Sonnet fails will cluster around the five signals above. Those that don't fail on Sonnet should stay on Sonnet permanently.

This is the stance behind our 2026-06-04-claude-code-opus-4-7-production-guide: benchmark scores are weak signals; trace evaluation against your actual workload topology is the only reliable escalation signal. See also 2026-06-02-ai-coding-agent-cost-ladder-2026 for the broader cost ladder analysis across 12 AI coding agents, and 2026-04-30-opus-4-7-long-running-coding-benchmark for Opus 4.7 long-running task benchmarks specifically.

For the Hermes model ranking by use case, the RemoteOpenClaw guide (retrieved June 2026) lands on the same hierarchy: "Claude Sonnet 4.6 for most tasks, Claude Opus 4.6 [or 4.7] for the most complex reasoning."

Knowledge Check: You're running a Hermes session that processes 50 web pages per hour, generates session titles automatically, and occasionally asks the agent to debug a multi-file TypeScript issue. Which part of this workflow should definitely NOT run on Opus 4.7?

<details> <summary>Answer</summary>

Web page processing (auxiliary web_extraction) and session-title generation (auxiliary session_title) should be explicitly routed to a cheap model like Gemini Flash or Haiku 4.5 — not Opus. Those tasks are volume-high, complexity-low, and hit at $5/MTok when Opus is your main model with default auxiliary settings. The multi-file TypeScript debugging is the only candidate for Opus escalation, and only if the debugging chain exceeds 8 tool calls or the codebase context exceeds 100K tokens.

</details>

Ready to build a model-routing policy for your full AI agent stack — not just Hermes? Picking a Frontier Model: Opus 4.7 vs GPT-5.5 vs Gemini 3.1 Pro — A Builder's Benchmark Guide covers Sonnet vs Opus vs Haiku selection across every major use case, with worked cost models for production pipelines.

The Auxiliary Task Trap: Why Opus-as-Default Costs More Than You Think

auxiliary:
  vision:
    provider: openrouter
    model: google/gemini-3-flash-preview   # ~$0.075/MTok vs $5.00 for Opus
  web_extraction:
    provider: openrouter
    model: google/gemini-3-flash-preview
  context_compression:
    provider: openrouter
    model: google/gemini-3-flash-preview
  session_title:
    provider: openrouter
    model: anthropic/claude-haiku-4-5     # $1/$5 MTok — sufficient for title gen

This alone reduces real-world Hermes costs by 30–50% for sessions with heavy web browsing or long-context compression, with zero impact on your main reasoning quality.

The 5 Escalation Signals That Actually Justify Opus

Based on benchmark data across Opus 4.7 and Sonnet 4.6 (retrieved June 2026), five specific task profiles produce quality gaps large enough to justify the cost premium:

The Real Cost Stack: A Worked Example

Task	Model	Input MTok	Output MTok	Per-task cost	Uniform Opus cost
Architecture planning (1×)	Opus 4.7	0.008	0.004	$0.140	$0.140
Code implementation (3×)	Sonnet 4.6	0.036	0.024	$0.468	$0.780
Quick file edits (8×)	Haiku 4.5	0.024	0.012	$0.084	$0.420
Code review (4×)	Haiku 4.5	0.020	0.008	$0.060	$0.300
Test generation (4×)	Sonnet 4.6	0.016	0.012	$0.228	$0.380
Auxiliary tasks (all)	Gemini Flash	—	—	~$0.012	~$0.15
Session total	Routed			$0.99	$2.17

Configuring the Routing Stack in Hermes

Two practical patterns for implementing Sonnet-first routing with Opus escalation inside Hermes:

Pattern A: LiteLLM Fallback Proxy

LiteLLM lets you set Sonnet as the primary and fall back to Opus on specific signals (rate limits, low-confidence responses):

# litellm_config.yaml
model_list:
  - model_name: main
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: ${ANTHROPIC_API_KEY}
  - model_name: main           # same alias, second entry = fallback
    litellm_params:
      model: anthropic/claude-opus-4-7
      api_key: ${ANTHROPIC_API_KEY}
router_settings:
  routing_strategy: usage-based-routing
  fallbacks: [{"main": ["main"]}]
  context_window_fallbacks: [{"main": ["main"]}]

Start LiteLLM: litellm --config litellm_config.yaml --port 4000

Then in Hermes: hermes model → Custom endpoint → http://localhost:4000 → model main.

Pattern B: ClawRouter Automatic Routing

ClawRouter by BlockRunAI classifies each request across 14 complexity dimensions before routing:

npx @blockrun/clawrouter   # starts on port 8402

In ~/.hermes/config.yaml:

model:
  provider: custom
  base_url: http://localhost:8402/v1
  default: blockrun/agentic    # Hermes-optimized profile

What Benchmarks Won't Tell You (and Trace Evaluation Will)

<details> <summary>Answer</summary>

</details>

Hermes AI Gateway Model-Selection in 2026: When to Escalate Sonnet to Opus

The Auxiliary Task Trap: Why Opus-as-Default Costs More Than You Think

The 5 Escalation Signals That Actually Justify Opus

The Real Cost Stack: A Worked Example

Configuring the Routing Stack in Hermes

Pattern A: LiteLLM Fallback Proxy

Pattern B: ClawRouter Automatic Routing

What Benchmarks Won't Tell You (and Trace Evaluation Will)

References

Add OAuth 2.1 to Your MCP Server: The GitHub Connector Pattern in 2026

Hermes AI Gateway Model-Selection in 2026: When to Escalate Sonnet to Opus

The Auxiliary Task Trap: Why Opus-as-Default Costs More Than You Think

The 5 Escalation Signals That Actually Justify Opus

The Real Cost Stack: A Worked Example

Configuring the Routing Stack in Hermes

Pattern A: LiteLLM Fallback Proxy

Pattern B: ClawRouter Automatic Routing

What Benchmarks Won't Tell You (and Trace Evaluation Will)

References

Add OAuth 2.1 to Your MCP Server: The GitHub Connector Pattern in 2026

Hermes AI Gateway Model-Selection in 2026: When to Escalate Sonnet to Opus

The Auxiliary Task Trap: Why Opus-as-Default Costs More Than You Think

The 5 Escalation Signals That Actually Justify Opus

The Real Cost Stack: A Worked Example

Configuring the Routing Stack in Hermes

Pattern A: LiteLLM Fallback Proxy

Pattern B: ClawRouter Automatic Routing

What Benchmarks Won't Tell You (and Trace Evaluation Will)

References

Related from the academy

Add OAuth 2.1 to Your MCP Server: The GitHub Connector Pattern in 2026

Hermes AI Gateway Model-Selection in 2026: When to Escalate Sonnet to Opus

The Auxiliary Task Trap: Why Opus-as-Default Costs More Than You Think

The 5 Escalation Signals That Actually Justify Opus

The Real Cost Stack: A Worked Example

Configuring the Routing Stack in Hermes

Pattern A: LiteLLM Fallback Proxy

Pattern B: ClawRouter Automatic Routing

What Benchmarks Won't Tell You (and Trace Evaluation Will)

References

Related from the academy

Add OAuth 2.1 to Your MCP Server: The GitHub Connector Pattern in 2026