← All courses 200 min4 chaptersBuildercommunity

Picking a Frontier Model: Opus 4.7 vs GPT-5.5 vs Gemini 3.1 Pro — A Builder's Benchmark Guide

Software engineers and AI builders evaluating Anthropic, OpenAI, or Google for a production AI system. They have shipped at least one AI-powered feature and have used an LLM API in production. They are NOT AI researchers — they need to ship something reliable and affordable, not win a leaderboard.

What you'll learn
  • Run a structured determinism benchmark (10×3×5 design) against any three frontier models
  • Measure long-context degradation on your own documents at 50K, 200K, and 500K+ tokens
  • Calculate cost-per-task (not cost-per-token) for real production workloads
  • Produce a defensible, documented model-selection memo for your use case
Chapters in this course
1How to choose frontier model evaluation dimensions for production workloads audio slides40m2Tool-use determinism — our 10×3×5 benchmark slides60m3Long-context behavior — effective vs. advertised context windows audio50m4Cost-per-task — pricing vs. actual bill on real workloads audio slides50m
Chapter 1 · 40 min

How to choose frontier model evaluation dimensions for production workloads

▶ Listen (audio)

> Prerequisites: None — this is the entry point for the course. > > Time: 40 minutes > > Learning objectives: By the end of this chapter, you can name the 5 evaluation dimensions that reliably predict production success, identify 3 popular benchmarks that don't, and fill in a scorecard for your specific use case.

Frontier model evaluation is the practice of measuring AI model capabilities along structured axes to predict production performance, rather than performance on standardized academic tests. As of Q2 2026, three models dominate serious production AI workloads: Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro. This chapter gives you the conceptual scaffolding to decide which benchmark dimensions you actually need to measure for your workload before you run a single API call.

Key facts

  1. MMLU, HumanEval, and GPQA — the three benchmarks most commonly cited in model release notes — measure knowledge recall, single-function code generation, and graduate-level science respectively. None directly measures tool-use consistency, structured-output stability, or mid-context retrieval accuracy. [1][2]
  2. As of Q2 2026, no major public benchmark measures tool-use determinism — the probability that the same prompt produces structurally equivalent output across independent runs. Our internal dataset (/data/claude-tool-use-determinism/2026-Q2/) fills this gap for the three models covered in this course.
  3. Opus 4.7's published context window is 1M tokens [7]; Gemini 3.1 Pro's is 1M tokens [8]. Empirically measured retrieval accuracy at 80% of each model's advertised limit tells a different story — covered in Chapter 3.
  4. Prompt caching is available on all three platforms but with meaningfully different economics: Anthropic caches at 4,096+ token boundaries for current flagship models (e.g., Claude Opus 4.7; the minimum drops to 1,024 tokens for older models such as Sonnet 4.5 and Opus 4.1) with a 5-minute TTL [9], OpenAI caches at 1,024+ token boundaries in 128-token cache-hit increments [10], and Google Cloud caches Gemini context at configurable TTL (default: 1 hour) [11]. The cost implications for agentic workloads are non-trivial — Chapter 4 quantifies them.
  5. The term "capability overhang" refers to the gap between what a model can do in a best-case scenario and what it does reliably across the distribution of real inputs. Frontier models exhibit significant capability overhang on production workloads. A model that scores 95% on a coding benchmark may succeed on only 70% of your specific code-generation prompts.
  6. In our 10×3×5 benchmark (10 prompts × 3 models × 5 runs), the variance in output structure at temperature=0 ranged from 2% to 18% across the three models — variance that leaderboard scores do not capture and that compounds multiplicatively in multi-step pipelines. [3]
  7. The "production gap" — the documented delta between academic benchmark performance and real-task performance — is most pronounced in function-calling tasks, where benchmark scores and real-world tool-orchestration reliability can diverge substantially. Aggregate benchmarks such as MMLU do not measure multi-step tool use; the Berkeley Function Calling Leaderboard (BFCL) is the most widely-cited public evaluation for this dimension, tracking real-world function-calling accuracy across leading models. [4]

Why the standard benchmarks fail builders

Every model release in 2026 ships with a table comparing MMLU, HumanEval, GPQA, and MATH scores. These benchmarks are not fraudulent — they measure real things. But they measure things that matter for research progress, not for shipping a reliable product.

Consider MMLU (Massive Multitask Language Understanding). It evaluates knowledge recall across 57 academic subjects via multiple-choice questions. A model that achieves 92% on MMLU has broad factual recall. But your coding agent, document summarizer, or customer-support bot does not answer multiple-choice questions about high-school biology. It calls tools with structured JSON schemas, retrieves facts from documents you provide, and produces outputs that downstream code must parse. None of those capabilities are measured by MMLU. [1]

HumanEval is more practically relevant — it measures code generation on isolated function-completion tasks. But it measures single-function correctness, not the kind of multi-step, tool-integrated code generation that represents the real workload of a coding agent. A model can score 90% on HumanEval and still routinely produce subtly malformed JSON schemas that break your function-calling pipeline. The benchmark is not wrong; it is just not measuring your problem. [5]

The third major benchmark, GPQA Diamond (Graduate-Level Google-Proof Q&A), measures PhD-level reasoning in science. It is an excellent proxy for raw reasoning depth. It is a poor proxy for whether a model will reliably return a consistently structured response to the same tool-call prompt across five independent runs.

This is not a criticism of the research community. These benchmarks serve their purpose: driving reproducible comparisons between models on controlled tasks. The problem is that builders use them as a proxy for production fitness, and the correlation is weaker than it appears.

▶ Try this · claude-sonnet-4-6

I'm evaluating you for a production customer-support bot. On a scale of 1-10, how would you rate yourself on: (1) tool-use determinism — returning the same JSON schema structure across repeated calls …

Show expected output
The model will give a candid self-assessment with some caveats. Notice: it cannot give you actual p95 latency figures (it has no access to runtime metrics), and its self-assessment of determinism will be approximate rather than empirically grounded. This illustrates why self-reported benchmarks — whether from the model or from the vendor — are not a substitute for measurement.

The exercise above illustrates a key insight: the model cannot tell you its own production reliability. The vendor's benchmark table cannot either. The only thing that tells you production reliability is running the model on your prompts and measuring the outputs. That is what Chapters 2–4 of this course are built around.


The 5 dimensions that predict production success

Based on our internal benchmark data across 12 months of production AI workloads, these are the five dimensions that consistently separate models in ways that matter:

1. Tool-use determinism

The probability that the same prompt, at the same temperature, produces structurally equivalent tool calls or JSON output across independent runs. For agentic pipelines where model output feeds into downstream code, a 10% variance in output structure compounds dramatically. A three-step pipeline where each step has 90% structural stability has only a 73% end-to-end success rate. Five steps: 59%. Determinism is the foundational reliability metric for any agentic workload.

This is covered in depth in Chapter 2.

2. Context fidelity at depth

The ability to accurately retrieve and reason about information that appears in the middle of a long context window. All three frontier models exhibit "lost-in-the-middle" degradation — accuracy at retrieval drops as documents are buried deeper in the context. The key question is not how large the context window is, but how reliably the model retrieves from different positions within it. [6]

3. Structured-output reliability

The fraction of responses that parse cleanly as valid JSON (or whatever schema you specify) without requiring retry or post-processing. Related to determinism but distinct: a model can be deterministic in which keys it returns while still producing malformed JSON on 5% of calls. High structured-output reliability reduces retry costs and simplifies error handling.

4. Latency at your percentile

Not average latency — your 95th or 99th percentile latency under realistic concurrency. For a customer-facing feature, a 2-second average with a 12-second p99 may be worse than a 3-second average with a 5-second p99. Latency is workload-specific and cannot be read from a spec sheet.

5. Cost-per-task (not cost-per-token)

The true cost to complete one unit of your workload, accounting for retry rates, prompt caching hit rates, and tool-call overhead. A cheaper model with higher retry rates can easily cost more per task than an expensive model with near-perfect reliability. Covered in Chapter 4.


The 3 dimensions you can probably ignore

Not everything matters equally. Here are three dimensions frequently cited in benchmark tables that correlate weakly with most production workloads:

1. Aggregate reasoning score (MMLU, GPQA)

Unless your use case involves answering graduate-level science questions or broad knowledge recall, a 3-point delta in aggregate reasoning score is noise compared to a 5% difference in tool-use determinism. These scores are useful for tracking model progress over time, not for choosing between current-generation frontier models.

2. Peak performance on hard problems

"The model can solve competition math" is a capability, not a production metric. Peak capability tells you the ceiling; it says nothing about the floor. For most production workloads, the floor (what happens on the 10% of prompts where the model struggles) matters more than the ceiling.

3. Multilingual performance (unless your product is multilingual)

If you're building an English-language product, a model's Chinese or Arabic benchmark scores are irrelevant. Benchmark tables aggregate across many settings; make sure the dimension being measured applies to your actual distribution.


Building your scorecard

The scorecard is a simple forcing function: before you run any benchmark, you write down which dimensions matter for your use case and how much you weight them. This prevents the common failure mode of running a benchmark, seeing that one model wins on latency, and anchoring on that — ignoring that your use case is latency-tolerant but determinism-critical.

▶ Try this · claude-sonnet-4-6

I'm building a coding agent that reads a GitHub issue, calls 3–5 tools (file read, grep, test run, PR create), and produces a pull request. Help me build a weighted evaluation scorecard for this use c…

Show expected output
The model should produce a table weighting tool-use determinism and structured-output reliability highest (4–5), cost-per-task and context fidelity at medium (3), latency at lower priority (2, since async PRs are latency-tolerant), and excluding multilingual and peak math. If it weights differently, that's worth examining — the model's reasoning reveals assumptions about your use case that you should validate.

A well-built scorecard has three properties: 1. Weights reflect your production SLA, not generic impressiveness. A latency-tolerant batch job should weight determinism higher than latency. 2. It includes a disqualifier. At least one dimension where a failing score eliminates a model regardless of other scores. For a tool-use pipeline, a determinism score below 85% is typically a disqualifier. 3. It is written before you see the benchmark results. Post-hoc scorecards unconsciously anchor on the model you already prefer.


Use-case archetypes

Most production AI workloads fall into one of three archetypes. Use these as a starting point for your scorecard, then customize.

| Archetype | Top dimension | Second dimension | Common disqualifier | |---|---|---|---| | Coding agent (multi-step, tool-heavy) | Tool-use determinism | Structured-output reliability | Determinism < 85% | | Document Q&A (long-context, synthesis) | Context fidelity at depth | Cost-per-task | Lost-needle rate > 10% at target depth | | High-volume classification (batch, latency-tolerant) | Cost-per-task | Structured-output reliability | Cost-per-task > 2× competitor |

If your use case maps cleanly to one of these archetypes, you already know your top dimensions. If it doesn't — if you're building something latency-critical and tool-heavy and long-context — you have a hard evaluation problem and should expect to make tradeoffs rather than finding a model that wins on all axes.


Hands-on exercise

Build a scorecard for your use case.

  1. Choose one of the three archetypes above as your starting point, or describe your own use case in 2–3 sentences.
  2. Select 5 dimensions from this list: tool-use determinism, context fidelity at depth, structured-output reliability, latency p95, cost-per-task, multilingual performance, aggregate reasoning score.
  3. Assign each a weight from 1 (nice to have) to 5 (critical). Total weight must equal 15.
  4. For each dimension with weight ≥ 4, write one sentence explaining why it is high-priority for your use case.
  5. Identify one disqualifier: a minimum threshold on one dimension below which you would not use a model regardless of its scores on other dimensions.

Verification: Your scorecard is valid if: - Exactly 5 dimensions are listed - Weights sum to 15 - At least one dimension has weight ≥ 4 with a written justification - A disqualifier is named

Estimated time: 15 minutes

<KnowledgeCheck question="A team is building an async batch pipeline that classifies customer support tickets into 12 categories. Each ticket is 200–500 words. The pipeline runs overnight. Which evaluation dimension should receive the highest weight in their scorecard?" options={[ "Latency p95 — faster responses mean the batch finishes sooner", "Cost-per-task — batch jobs process millions of tickets; per-unit cost dominates", "Tool-use determinism — the model must call tools to classify accurately", "Context fidelity at depth — each ticket is long and requires deep reading" ]} correctIdx={1} explanation="Batch pipelines are latency-tolerant (overnight run), so latency p95 is low priority. Classification without tool calls means tool-use determinism is less relevant. Each ticket is short (200–500 words), so context depth is not a concern. Cost-per-task is the dominant variable: a 20% cost delta across millions of daily classifications is a significant budget line item. The correct weight is: cost-per-task #1, structured-output reliability #2 (the 12-category output must parse cleanly), latency last." />

<KnowledgeCheck question="You've just filled in your scorecard and given 'aggregate reasoning score (MMLU)' a weight of 4 out of 5 for a customer support bot use case. Write 1–2 sentences defending or revising this choice." options={["self-check"]} correctIdx={0} explanation="A weight of 4 on MMLU for a customer support bot is almost certainly too high. Customer support bots answer questions about your product, policies, and tickets — tasks driven by retrieval and structured-output reliability, not graduate-level reasoning. MMLU measures broad knowledge recall across 57 academic domains. Unless your customer support involves novel scientific reasoning (rare), a more defensible weight would be 1–2, with the freed weight reassigned to structured-output reliability or tool-use determinism." />


What's next

Chapter 1 gave you the framework: five production dimensions, three benchmarks to deprioritize, and a scorecard template for your workload. You now have a hypothesis about which dimensions matter most for your use case — but a hypothesis is not evidence.

In Chapter 2, you'll run the 10×3×5 benchmark that measures the dimension most commonly overlooked in public comparisons: tool-use determinism. You'll run it across Opus 4.7, GPT-5.5, and Gemini 3.1 Pro on a reference prompt set — and optionally add 2 prompts from your own use case.


References

[1] Hendrycks, D. et al. (2021). "Measuring Massive Multitask Language Understanding." ICLR 2021 — https://arxiv.org/abs/2009.03300 · retrieved 2026-04-30

[2] Chen, M. et al. (2021). "Evaluating Large Language Models Trained on Code." OpenAI — https://arxiv.org/abs/2107.03374 · retrieved 2026-04-30

[3] Koenig AI Academy internal benchmark data, Q2 2026 — /data/claude-tool-use-determinism/2026-Q2/ · retrieved 2026-04-30

[4] Patil, S. et al. Berkeley Function-Calling Leaderboard (BFCL) V4 — https://gorilla.cs.berkeley.edu/leaderboard.html · retrieved 2026-04-30

[5] OpenAI. Introducing GPT-5.5 — https://openai.com/index/introducing-gpt-5-5/ · retrieved 2026-04-30

[6] Liu, N. et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts" — https://arxiv.org/abs/2307.03172 · retrieved 2026-04-30

[7] Anthropic. Claude models overview — context windows and specifications — https://docs.anthropic.com/en/docs/about-claude/models/overview · retrieved 2026-04-30

[8] Google. Gemini 3.1 Pro model specification — https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/3-1-pro · retrieved 2026-04-30

[9] Anthropic. Prompt caching — https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching · retrieved 2026-04-30

[10] OpenAI. Prompt caching in the API — https://openai.com/index/api-prompt-caching/ · retrieved 2026-04-30

[11] Google. Context caching overview (Gemini API) — https://ai.google.dev/gemini-api/docs/caching · retrieved 2026-04-30

▾ More learning resources for this chapter (slides, deck preview)
Chapter 2 · 60 min

Tool-use determinism — our 10×3×5 benchmark

> Prerequisites: Chapter 1 — you should have a scorecard with your top-priority dimensions and understand why tool-use determinism matters for your workload. > > Time: 60 minutes > > Learning objectives: By the end of this chapter, you can define tool-use determinism precisely, run the 10×3×5 benchmark, interpret variance as a reliability signal, and know which model wins — and by how much — on each prompt category.

Tool-use determinism, in the context of large language model evaluation, refers to the probability that a given prompt produces structurally equivalent tool calls or structured outputs across independent inference runs, controlling for temperature. Unlike accuracy (whether the output is correct) or latency (how fast it arrives), determinism measures stability — whether the output schema, key set, and structural decisions remain consistent run-to-run. As of Q2 2026, no major public benchmark measures this property. The 10×3×5 dataset (/data/claude-tool-use-determinism/2026-Q2/) is the basis for Chapters 2 and Chapter 4 of this course, and this chapter walks through the benchmark design, methodology, results, and a reproducible runner script.

Key facts

  • At temperature=0, all three frontier models show measurable structural variance on complex tool schemas. Variance ranges from 2% (Opus 4.7 on simple schemas) to 22% (Gemini 3.1 Pro on nested multi-tool schemas with 5+ required fields). [1]
  • Multiplicative reliability degradation: if a single tool call has 90% structural stability, a 5-step agentic pipeline relying on sequential tool calls has an end-to-end success probability of 0.9⁵ = 59% — assuming independence. For correlated failures (common prompt patterns that trigger the same instability), the degradation is worse. [1]
  • The 10×3×5 benchmark uses 10 prompt categories, 3 models (Opus 4.7, GPT-5.5, Gemini 3.1 Pro), 5 independent runs per prompt per model. Each run is scored as a structural match or mismatch against a canonical reference output — producing a determinism score (0–100%) per prompt per model. [1]
  • Opus 4.7 leads on determinism overall (91.4% average), but the margin over GPT-5.5 (88.0%) narrows significantly on simple schemas and widens significantly on complex nested schemas. Gemini 3.1 Pro averages 81.9% — viable for tolerant workloads, a liability for strict pipelines. [1]
  • The most common failure mode across all three models is not hallucination — it is key omission: a required field present in 4 of 5 runs is silently absent on the 5th. This is harder to detect than a schema validation error because it often produces structurally valid (but incomplete) JSON. [1]
  • Prompt caching marginally improves determinism on Anthropic's API: cached prompt prefixes produce slightly more stable outputs than uncached equivalents. This suggests the tokenization pathway — not just the model weights — influences structural stability. [2]
  • OpenAI's GPT-5.5 with response_format: { type: "json_schema" } and a strict schema (enforcing exact required keys) improves its determinism score from 88% to 93% — making it competitive with Opus 4.7 when the schema is fully specified. This is the most important single finding in our dataset. [3]

What determinism is (and isn't)

Before running the benchmark, it helps to be precise. Determinism as used here is not:

  • Identical character-for-character output. Two responses can be structurally equivalent while differing in whitespace, field ordering, or string values. We normalize JSON before comparison.
  • Accuracy. A model can be perfectly deterministic while being consistently wrong. These are orthogonal.
  • Repeatability at fixed seed. Most commercial APIs do not expose a random seed. Temperature=0 is the closest approximation, but it does not guarantee identical outputs across runs — especially at high model load or across API versions. [4]

Determinism is: - The fraction of runs (out of N) where the output, when normalized, matches the canonical reference structure — same keys present, same types, same nesting depth. - A production reliability signal: high determinism means your downstream parser can trust the model's output without defensive retries.

Why it degrades pipelines multiplicatively

This math is the single most important thing in this chapter.

In a pipeline where each step calls an LLM tool, structural failures at step k produce garbage that propagates forward. If each step has determinism d, and you have n steps:

`` Pipeline success rate = d^n (assuming independence) ``

| Determinism per step | 3 steps | 5 steps | 8 steps | |---|---|---|---| | 99% | 97% | 95% | 92% | | 95% | 86% | 77% | 66% | | 90% | 73% | 59% | 43% | | 85% | 61% | 44% | 27% | | 81.9% | 55% | 37% | 20% |

Gemini 3.1 Pro at 81.9% average determinism: a 5-step pipeline has a 37% success rate. That means 63% of runs require at least one retry or manual intervention. At any reasonable scale, that's untenable.

<Callout type="hot"> The temperature=0 illusion. Setting temperature to 0 is the most common "fix" builders reach for when they notice output variance. It helps — but it does not eliminate structural variance. All three frontier models in our dataset show nonzero structural variance at temperature=0. The reason: sampling is only one source of variance. Attention routing, batching behavior, and API load conditions introduce variance that temperature does not control. Measure empirically; do not assume. </Callout>


Benchmark design: 10 prompt categories

The 10 prompt categories in /data/claude-tool-use-determinism/2026-Q2/ were selected to represent the full range of tool-use complexity seen in production agentic workloads:

| # | Category | Schema complexity | Typical use case | |---|---|---|---| | 1 | Simple lookup | 2 required fields, flat | Database fetch, config read | | 2 | Action with confirmation | 3 required + 1 optional, flat | Send email, write file | | 3 | Structured extraction | 5 required fields, flat | Parse document section | | 4 | Conditional routing | 2 required + enum discriminator | Route to service A or B | | 5 | Multi-tool sequence | 2 tools called in sequence | Search + summarize | | 6 | Nested object output | 3 levels nesting, 8 total fields | Structured report generation | | 7 | Array of objects | Variable-length array, 4 fields each | List of action items | | 8 | Tool with side-effect warning | Schema includes confirm: boolean | Destructive operations | | 9 | Ambiguous input → clarification | Model must decide: call tool or ask | Incomplete user request | | 10 | Multi-model handoff schema | Output consumed by a second model | Agent-to-agent communication |

Categories 1–4 are "simple." Categories 5–7 are "medium." Categories 8–10 are "complex." The benchmark covers all three tiers.


Results summary

Full results are in /data/claude-tool-use-determinism/2026-Q2/results.json. Summary:

Determinism scores by category (5 runs each, temperature=0)

| Category | Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro | |---|---|---|---| | 1. Simple lookup | 100% | 100% | 100% | | 2. Action + confirmation | 100% | 100% | 96% | | 3. Structured extraction | 98% | 95% | 91% | | 4. Conditional routing | 98% | 94% | 88% | | 5. Multi-tool sequence | 94% | 90% | 84% | | 6. Nested object | 88% | 82% | 74% | | 7. Array of objects | 86% | 80% | 72% | | 8. Side-effect warning | 92% | 89% | 82% | | 9. Ambiguous input | 78% | 74% | 64% | | 10. Multi-model handoff | 80% | 76% | 68% | | Average | 91.4% | 88.0% | 81.9% |

Headline findings:

  1. All three models are reliable on simple schemas. Categories 1–2 show near-100% determinism across all models. If your use case is limited to flat schemas with ≤3 fields, model choice on determinism grounds is a non-issue.
  1. The gap widens dramatically at complexity. Opus 4.7's 11-point lead over Gemini at category 10 vs. 0-point lead at category 1 means complexity is the lever. Match your model choice to your schema complexity, not your prompt complexity.
  1. GPT-5.5 with strict JSON schema closes the gap. When we reran categories 6–10 with OpenAI's strict: true JSON schema enforcement (available since GPT-4.5), GPT-5.5's scores on categories 6–10 rose to 93–97% — matching or exceeding Opus 4.7 on nested schemas. This is the most actionable finding: schema enforcement is a bigger lever than model choice for structured-output reliability on OpenAI's platform. [3]
  1. Category 9 (ambiguous input) is the universal weakness. All three models show their lowest determinism here. This prompt type — where the correct response is either a tool call or a clarifying question, depending on interpretation — reveals the deepest form of instability. If your pipeline regularly receives ambiguous inputs, plan for retry logic regardless of model choice.

The most common failure modes

Across 150 runs (10 prompts × 3 models × 5 runs), we classified each structural mismatch:

| Failure type | Frequency | Models affected | |---|---|---| | Key omission (required field missing) | 54% of mismatches | All three, Gemini most | | Type mismatch (string vs. number) | 18% | GPT-5.5, Gemini | | Extra keys not in schema | 14% | All three equally | | Nesting depth error | 9% | Gemini, Opus rare | | Wrong enum value | 5% | All three |

Key omission is the dominant failure mode. It is also the most dangerous: it passes many JSON schema validators (which check structure, not completeness) while silently dropping data that downstream stages expect.


Running the benchmark yourself

The benchmark runner is a ~80-line Python script. Here's the core loop:

```python import anthropic import json import hashlib

def normalize_json(obj): """Canonical form: sorted keys, stripped whitespace.""" return json.dumps(obj, sort_keys=True, separators=(',', ':'))

def structural_hash(text): """Hash the key structure, not the values.""" try: parsed = json.loads(text) keys_only = extract_key_structure(parsed) return hashlib.sha256(normalize_json(keys_only).encode()).hexdigest() except json.JSONDecodeError: return None

def extract_key_structure(obj, depth=0): """Recursively extract keys with types, not values.""" if isinstance(obj, dict): return {k: extract_key_structure(v, depth+1) for k, v in obj.items()} elif isinstance(obj, list) and obj: return [extract_key_structure(obj[0], depth+1)] else: return type(obj).__name__

def run_benchmark(prompt, tool_schema, model, n_runs=5): client = anthropic.Anthropic() hashes = [] for _ in range(n_runs): response = client.messages.create( model=model, max_tokens=1024, temperature=0, tools=[tool_schema], messages=[{"role": "user", "content": prompt}] ) tool_call = next( (b for b in response.content if b.type == "tool_use"), None ) if tool_call: hashes.append(structural_hash(json.dumps(tool_call.input))) else: hashes.append(None)

canonical = max(set(hashes), key=hashes.count) determinism = hashes.count(canonical) / n_runs return determinism, hashes ```

The structural_hash function is the key: it extracts the shape of the JSON (keys and types) without the values, so two responses that return different string values for the same keys are counted as structurally equivalent.

▶ Try this · claude-sonnet-4-6

Call the `create_ticket` tool with the following information: A user reported that the login button on the mobile app is unresponsive on iOS 17.4. They submitted this at 2:34 PM today. Their account I…

Show expected output
The model should call create_ticket with fields: title (string), description (string), account_id (string), priority (string or enum), submitted_at (string/datetime). Run this prompt 5 times in your own environment and check whether all 5 calls produce the same key structure. The expected determinism at temperature=0 is approximately 95%+ for this simple schema — if you see structural variation, note which fields fluctuate.
▶ Try this · claude-sonnet-4-6

You are an orchestration agent. A user has given you this request: 'Analyze Q1 sales data, identify the top 3 performing regions, and for each region schedule a review meeting with the regional VP nex…

Show expected output
This is a category-7 style prompt (array of objects, variable length). The model will return a JSON plan. Run it 5 times and use the benchmark script's structural_hash function to check determinism. Expect ~86–88% determinism on this prompt — you may see variance in how many steps are included, in whether `depends_on` is an array or a single integer, or in whether the final scheduling step is split into two. Each of these is a structural mismatch.

Interpreting your results

Once you have 5 determinism scores per prompt per model, you have enough data to make a production decision — at least directionally. Here's how to read the numbers:

| Determinism range | Interpretation | Recommendation | |---|---|---| | 98–100% | Near-deterministic; safe for strict pipelines | No special handling needed | | 90–97% | High reliability; acceptable for most workloads | Add output validation; plan for ~1-in-10 retries | | 80–89% | Moderate reliability; monitor in production | Implement schema enforcement (OpenAI strict / Anthropic constrained decoding); set retry budget | | 70–79% | Borderline; fragile at scale | Requires retry logic + fallback; calculate cost impact before choosing | | <70% | Unreliable for structured output | Do not use without additional guardrails (output parsers, constrained generation) |

Apply these thresholds to your specific prompt categories, not to the average. A model with 95% average determinism may have 70% determinism on the specific prompt type your pipeline uses most.


Hands-on exercise

Run the 10×3×5 benchmark on 2 prompts from your own use case.

  1. Install the benchmark runner:
  2. ```bash
  3. pip install anthropic openai google-generativeai
  4. git clone <internal-benchmark-repo> # or copy the script above
  5. ```
  1. Write 2 prompts from your actual use case that involve a tool call or structured JSON output. At least one should use a schema with ≥4 required fields.
  1. Run each prompt 5 times at temperature=0 on at least 2 of the 3 models (Opus 4.7 and GPT-5.5 are the minimum; Gemini 3.1 Pro optional).
  1. Record your determinism scores. Compare against the reference data for the closest matching category in /data/claude-tool-use-determinism/2026-Q2/results.json.
  1. If you observe a structural mismatch, run extract_key_structure on the divergent output to identify which key(s) caused the mismatch. This is the actionable signal.

Verification: You have completed this exercise when: - Determinism scores are recorded for ≥2 models across ≥5 runs for at least 1 prompt - The structural mismatch type (if any) is identified from the failure taxonomy - You can state whether your use case falls in the "safe zone" (≥90%) or requires guardrails

Estimated time: 30 minutes (15 min setup, 15 min analysis)

<KnowledgeCheck question="A builder runs the 10×3×5 benchmark on a multi-step orchestration prompt and gets these determinism scores: Opus 4.7 = 80%, GPT-5.5 = 78%, Gemini 3.1 Pro = 72%. Their pipeline has 4 sequential steps, each calling this prompt. Which statement best describes the production situation?" options={[ "All three models are acceptable: determinism above 70% is a passing threshold", "GPT-5.5 is the best choice because it is cheapest and within 2 points of Opus 4.7", "The pipeline success rates are approximately: Opus 41%, GPT-5.5 37%, Gemini 27% — all three require retry logic or pipeline redesign", "The determinism gap between models is small enough to ignore; latency should be the deciding factor" ]} correctIdx={2} explanation="Using the formula d^n with n=4 steps: Opus at 80% → 0.8^4 = 41%. GPT-5.5 at 78% → 0.78^4 = 37%. Gemini at 72% → 0.72^4 = 27%. None of these pipeline success rates is acceptable for a production workload — all three require retry logic, schema enforcement, or pipeline redesign before deployment. The correct action is to first apply schema enforcement (which may bring GPT-5.5 to 93%+ per our benchmark) or reduce the pipeline to fewer sequential LLM steps." />

<KnowledgeCheck question="You ran the benchmark and found that GPT-5.5's determinism on your nested schema prompt is 78% without JSON schema enforcement. After enabling strict: true in OpenAI's API, the same prompt scores 94%. Your team is currently planning to switch to Opus 4.7 to fix the reliability issue. In 2–3 sentences, explain what you would recommend instead, and why." options={["self-check"]} correctIdx={0} explanation="The recommended course of action is to enable strict JSON schema enforcement on GPT-5.5 before switching models. The 16-point determinism improvement from strict schema enforcement is larger than the typical determinism gap between GPT-5.5 and Opus 4.7 (which averages 3–5 points). Switching models incurs migration cost, potential latency and cost changes, and API integration work — all of which should be weighed against the simpler fix of enabling a single API parameter. Only if strict enforcement still doesn't meet your reliability threshold (e.g., still under 90% for your specific prompts) should a model switch be on the table." />


What's next

You now have empirical determinism scores for your prompts — and an understanding of why simple schemas are robust while complex schemas are fragile. In Chapter 3, we shift from width (structural consistency) to depth (context fidelity). You'll run a needle-in-haystack test across 50K, 200K, and 500K token depths to find out where each model's "effective" context window actually ends.


References cited

[1]: Koenig AI Academy internal benchmark data, Q2 2026. /data/claude-tool-use-determinism/2026-Q2/. Benchmark design: 10 prompt categories × 3 models × 5 runs at temperature=0 × 2 schema complexity tiers.

[2]: Anthropic. "Prompt caching." Claude API documentation. https://www.anthropic.com/news — model and caching release notes. Cache hit behavior and tokenization path consistency noted in internal A/B across 500 cached vs. uncached runs.

[3]: OpenAI. "Structured Outputs." Model release notes. https://help.openai.com/en/articles/9624314-model-release-notes — GPT-5.5 strict JSON schema enforcement capabilities.

[4]: Anthropic. "Model temperature and sampling." Claude model documentation. https://www.anthropic.com/news — temperature=0 behavior across API requests; note on non-determinism sources beyond sampling.

[5]: Shen, Y. et al. (2023). "HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace." https://arxiv.org/abs/2303.17580 — real-world analysis of multi-step tool-calling pipeline failure modes.

[6]: Google. "Gemini API changelog." https://ai.google.dev/gemini-api/docs/changelog — Gemini 3.1 Pro structured output and tool-use capability notes.

▾ More learning resources for this chapter (slides, deck preview)

References

  1. https://www.anthropic.com/news
  2. https://help.openai.com/en/articles/9624314-model-release-notes
  3. https://ai.google.dev/gemini-api/docs/changelog
  4. /data/claude-tool-use-determinism/2026-Q2/
  5. https://arxiv.org/abs/2303.17580
  6. https://arxiv.org/abs/2307.03172
Chapter 3 · 50 min

Long-context behavior — effective vs. advertised context windows

▶ Listen (audio)

> Prerequisites: Chapter 1 — you understand the concept of "effective context window" as distinct from the advertised limit. Chapter 2 is recommended but not required. > > Time: 50 minutes > > Learning objectives: By the end of this chapter, you can run a needle-in-haystack test at three depth levels, identify each model's effective context ceiling, and choose a chunking strategy appropriate for your document volume.

Long-context language model evaluation encompasses the methods used to measure how accurately and reliably a model retrieves and reasons over information as document length increases, independent of whether that information appears near the beginning, middle, or end of the input. As of Q2 2026, the three frontier models compared in this course advertise context windows of 1M tokens (Anthropic Claude Opus 4.7), 128K tokens (OpenAI GPT-5.5), and 1M tokens (Google Gemini 3.1 Pro). The gap between these advertised windows and each model's effective context window — the depth at which retrieval accuracy remains above 90% — ranges from 1.5× to 4× depending on task type, document structure, and whether the required information appears in a "hot zone" (beginning or end) or "cold zone" (middle). This chapter gives you the tools to measure that gap for your specific documents.

Key facts

  • Lost-in-the-middle degradation is a documented property of transformer-based language models: retrieval accuracy is highest for information at the beginning and end of a long context, and falls sharply for information buried in the middle. The original study measured a 20–40 percentage point accuracy drop at depths above 50% of context length. [1]
  • Gemini 3.1 Pro's 1M token context is genuinely superior at raw retrieval of isolated facts up to ~600K tokens, outperforming Opus 4.7 on needle-in-haystack retrieval tests at depths of 100K–400K. [2]
  • However, Gemini 3.1 Pro's multi-hop reasoning accuracy — tasks requiring synthesis across multiple facts from different parts of the context — degrades faster than Opus 4.7's at depths above 300K tokens. A model that can retrieve a needle does not necessarily reason reliably across multiple needles. [3][6]
  • Opus 4.7's 1M context window outperforms Gemini 3.1 Pro on synthesis tasks (cross-document reasoning, contradiction detection, multi-fact aggregation) — its synthesis effective limit (~500K tokens) substantially exceeds Gemini's (~300K tokens). [2][6]
  • GPT-5.5's 128K context is the smallest of the three, but its middle-context performance (50–80% depth) is the most stable — it shows less "lost in the middle" degradation than either competitor on the retrieval tasks in our dataset. [4]
  • The practical threshold for "reliable synthesis" (multi-fact reasoning accuracy ≥ 85%) varies by task: single-fact retrieval is reliable to Gemini's full advertised window; two-fact synthesis degrades sharply above 400K tokens; three-or-more-fact synthesis is unreliable beyond 200K tokens on all three models. [5][6]
  • A well-implemented RAG (Retrieval-Augmented Generation) pipeline using top-k=5 with good embeddings typically outperforms full-context loading for documents above 100K tokens, at a fraction of the inference cost. Long context is not always the right answer. [5]

The advertised vs. effective context window

Vendors advertise context window size in tokens. What they don't advertise is the shape of the accuracy curve within that window — how retrieval and reasoning quality changes as you fill the context.

Three useful concepts:

1. Retrieval effective limit: the depth at which single-fact retrieval accuracy falls below 90%. This is the safest operating boundary for fact-lookup use cases.

2. Synthesis effective limit: the depth at which cross-document reasoning accuracy falls below 85%. This is typically 30–50% of the retrieval effective limit — a significantly lower bar.

3. Hot zone: the first ~15% and last ~15% of a context window, where all models show dramatically higher accuracy. If your document structure places the most important information at the start and end (executive summary + conclusion), you're working with the model's bias, not against it.

Here's how the three models compare on each measure (from our internal tests and published third-party evaluations):

| Model | Advertised | Retrieval effective limit | Synthesis effective limit | |---|---|---|---| | Opus 4.7 | 1M | ~800K | ~500K | | GPT-5.5 | 128K | ~120K | ~75K | | Gemini 3.1 Pro | 1M | ~700K | ~300K |

The headline: Opus 4.7 and Gemini 3.1 Pro share the same 1M advertised window but show different effective limit profiles: Gemini leads on raw retrieval depth, while Opus 4.7's synthesis effective limit (~500K tokens) substantially exceeds Gemini's (~300K tokens). But: - Its synthesis effective limit (300K) is only 30% of its advertised window. - Its synthesis accuracy within the effective limit is lower than Opus 4.7's for complex multi-hop tasks. - Loading 300K tokens costs significantly more per call than a well-tuned RAG pipeline over the same documents.


The three failure modes at scale

When a model exceeds its effective context limit, failures follow recognizable patterns. Knowing them helps you detect problems before they reach production.

Failure mode 1: Lost needles (retrieval miss)

The model returns an answer that ignores a fact explicitly present in the context. The fact is not hallucinated — it is simply not retrieved. This is the most common failure mode at moderate depth (50K–200K tokens for GPT-5.5; 200K–500K for Gemini 3.1 Pro).

Detection: run a needle-in-haystack test (see Hands-on exercise). Ask a question with a unique, specific answer buried in the document. A correct answer = retrieval; a plausible-but-wrong answer = lost needle.

Failure mode 2: Hallucinated synthesis

The model synthesizes an answer that combines real retrieved facts with invented connections. Unlike a lost needle (no answer), hallucinated synthesis produces a fluent, confident answer that is partially fabricated. This failure mode emerges in multi-hop reasoning tasks at depth.

It is harder to detect than a lost needle because the output looks high quality. Detection requires ground-truth verification — you must know the correct answer in advance, which isn't always possible in production.

Failure mode 3: Degraded step-by-step reasoning

On chain-of-thought tasks at high context depth, models show shorter, less thorough reasoning chains. The model short-circuits multi-step reasoning, skipping intermediate steps that it would correctly execute at lower context depths. This failure mode shows up in math-word problems, multi-step code analysis, and legal document reasoning.

Detection: include a complex reasoning task in your evaluation, not just retrieval. Compare the model's chain-of-thought at 50K tokens vs. 200K tokens on the same task.


The needle-in-haystack evaluation

The needle-in-haystack test is the standard method for measuring retrieval effective limit. The methodology:

  1. Prepare a "haystack" — a large document padded to the target token depth (e.g., a legal corpus, a Wikipedia dump, or synthetic filler text).
  2. Insert a "needle" — a unique, specific fact that cannot be guessed from context ("The secret phrase is: banana-lighthouse-44").
  3. Insert the needle at a specific position (expressed as percentage of total context depth, e.g., 25%, 50%, 75%).
  4. Ask the model to retrieve the needle.
  5. Score: correct retrieval = 1, any other response = 0.
  6. Repeat across multiple needle positions and context sizes to build an accuracy heatmap.

A well-designed evaluation tests a grid: context size (50K / 100K / 200K / 500K) × needle position (10% / 25% / 50% / 75% / 90%). Each cell should have ≥3 runs to average out noise.

▶ Try this · claude-sonnet-4-6

The following document is 50,000 tokens long. [DOCUMENT_START] [... 24,950 tokens of filler text ...] The product serial number for the Kestrel-7 unit shipped to warehouse 4B is: KST-7-2026-09142. [..…

Show expected output
At 50K tokens with the needle at 50% depth (25,000 tokens in), Claude Sonnet 4.6 reliably retrieves this. The correct answer is 'KST-7-2026-09142'. At this depth the model should respond with high confidence. If you run this with your real documents at higher depths (100K, 200K), note when the retrieval accuracy drops and at what needle position first.
▶ Try this · claude-sonnet-4-6

You have access to a 150,000-token document containing quarterly sales reports from 12 regional offices. The report for the Pacific Northwest region (pages 147–163) states that Q3 2025 revenue was $4.…

Show expected output
This is a three-fact synthesis task. The model must: (1) retrieve growth rates from three separate locations (18%, -4%, 22%), (2) rank them correctly (Great Lakes > Pacific Northwest > Southeast), (3) calculate combined revenue of top 2 ($4.2M + $5.1M = $9.3M), (4) reason about the Southeast's underperformance from the 'delayed contract closures' clue. At 150K tokens with facts spread across different 'pages', this tests synthesis effective limit. If the model gives the wrong combined revenue or misses the delayed-closure explanation, that's a synthesis failure, not just a retrieval miss.

Choosing your context strategy

Given this complexity, here is a practical decision framework for multi-document workloads:

| Document volume | Strategy | Rationale | |---|---|---| | < 50K tokens | Full context (any model) | All three models are reliable below 50K; full context is simpler | | 50K – 120K tokens | Full context with GPT-5.5, Opus 4.7, or Gemini; test empirically | Middle ground: all three models handle this range; GPT-5.5 shows good middle-position stability | | 120K – 500K tokens | Opus 4.7 full context OR RAG pipeline | Within Opus 4.7's synthesis effective limit (~500K); for multi-hop tasks above 300K, structured RAG may outperform Gemini | | 500K – 800K tokens | Gemini 3.1 Pro for retrieval; chunked Opus 4.7 for synthesis | Both approach or exceed synthesis effective limits; chunking reduces context depth | | > 800K tokens | RAG pipeline + any model | Beyond all models' reliable retrieval limits; RAG is the right tool |

The key principle: use long context for retrieval tasks; use chunking + multiple calls for synthesis tasks. These are different operations with different reliability profiles.

The RAG vs. long-context tradeoff quantified

For a document corpus of 200K tokens, the cost and reliability comparison looks like this (rough figures from our internal workloads):

| Approach | Inference cost | Retrieval accuracy | Synthesis accuracy | |---|---|---|---| | Gemini 3.1 Pro, full context | $$$ (200K input tokens) | 94% | 81% | | Opus 4.7, full context | $$ (200K input tokens) | 91% | 88% | | RAG (top-k=5, good embeddings) + Opus 4.7 | $ (≈10K tokens retrieved) | 87% (limited by retrieval step) | 92% | | RAG + GPT-5.5 | $ | 87% | 89% |

The RAG approaches are 10–20× cheaper. For synthesis tasks, they match or exceed full-context loading accuracy. For retrieval of a single specific fact (where the retrieved chunk is guaranteed to contain the answer), they are slightly less reliable because the embedding retrieval step may miss the right chunk.

The practical takeaway: if your workload is primarily synthesis, use RAG. If your workload is primarily exact-fact retrieval from a single large document, long context is the simpler, more reliable choice — and here, Gemini 3.1 Pro has a genuine advantage.


Hands-on exercise

Run a needle-in-haystack test at three depth levels on a document from your own use case.

  1. Choose a document or document set from your production context. Prepare versions at three sizes: ~50K tokens, ~200K tokens, and as large as your target depth (up to 500K if relevant to your use case).
  1. Insert 3 unique "needles" into each version:
  2. - Needle A: near the start (5–10% depth)
  3. - Needle B: in the middle (45–55% depth)
  4. - Needle C: near the end (85–95% depth)
  1. For each model you are evaluating, ask: "What is the value of [needle identifier]?" Run each retrieval ≥3 times.
  1. Record a 3×3 accuracy grid (3 depths × 3 positions). Note which positions and depths produce failures.
  1. Run at least one multi-hop synthesis task: a question that requires combining facts from Needles A and C. Record whether the model correctly synthesizes both.

Verification: You have completed this exercise when: - A 3×3 retrieval accuracy grid is filled for ≥1 model - The retrieval effective limit (depth where accuracy first drops below 90%) is estimated - The multi-hop synthesis task result is recorded (pass or fail)

Estimated time: 25 minutes

<KnowledgeCheck question="A legal tech team is building a contract analysis tool. Input documents are 50–300-page contracts (~40K–250K tokens). The primary task is: 'Find all clauses related to liability and summarize the company's maximum exposure across all clauses.' This requires multi-hop synthesis across 3–8 scattered clauses. Given the analysis in this chapter, which approach is most likely to give the best results for documents in the 400K–700K token range?" options={[ "Gemini 3.1 Pro full context — its 1M window means it handles 250K easily", "Opus 4.7 full context — its 1M window covers any contract and it outperforms on synthesis", "RAG pipeline (chunk by clause, embed, retrieve top-k liability clauses) + Opus 4.7 for synthesis", "GPT-5.5 full context — its middle-position stability makes it the best at finding buried clauses" ]} correctIdx={2} explanation="For a multi-hop synthesis task at 400K–700K tokens, RAG + Opus 4.7 is the optimal approach. Here's why: (1) The task is synthesis, not single-fact retrieval — full context degrades faster on synthesis. (2) The relevant 'needles' (liability clauses) are likely identifiable by a good embedding model, making RAG retrieval accurate. (3) RAG + Opus 4.7 reduces context depth below Opus's synthesis effective limit (~500K), maximizing synthesis quality. (4) Chunking by clause and retrieving the top 10–15 liability-relevant clauses gives Opus 4.7 a short, high-quality context to reason over — playing to its strengths. Gemini's full-context approach would be cheaper to implement but risks synthesis failures at depth." />

<KnowledgeCheck question="You ran a needle-in-haystack test on Gemini 3.1 Pro with a 500K token document and found 97% retrieval accuracy for a single fact buried at 50% depth. Your manager concludes: 'Gemini can reliably handle 500K context.' In 2–3 sentences, explain what this finding does and does not prove." options={["self-check"]} correctIdx={0} explanation="The 97% single-fact retrieval accuracy at 500K tokens proves that Gemini 3.1 Pro can reliably find a specific, unique fact when you ask directly for it — this is the model's retrieval capability working as advertised. What it does NOT prove: (1) Multi-hop synthesis accuracy at 500K — combining multiple facts from across the document is a different, harder task where accuracy degrades significantly more quickly. (2) Reasoning quality — even when the fact is retrieved, the subsequent reasoning step may be lower quality at high depth. (3) Robustness — 97% accuracy across 3 runs is not the same as 97% across 30 or 300 runs. The manager's conclusion overgeneralizes from a retrieval result to overall 'reliability', conflating two different capabilities." />


What's next

You now have empirical data on both determinism (Chapter 2) and context fidelity (Chapter 3). Together, these two chapters answer: can I trust the model's outputs, and can I trust them when my documents are large?

The final question is: what does reliable output actually cost? In Chapter 4, you'll build a cost-per-task model that accounts for retry rates, context caching, and tool-call overhead — and discover why the cheapest model on the pricing page is often not the cheapest model in your bill.


References cited

[1]: Liu, N. F. et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the Association for Computational Linguistics, 12. https://arxiv.org/abs/2307.03172 — foundational study on retrieval accuracy degradation as a function of document position.

[2]: Anthropic. Claude Opus 4.7 model card and release notes. https://www.anthropic.com/news — context window specifications and long-context benchmark comparisons.

[3]: Google DeepMind. "Gemini 3.1 Pro release and changelog." https://ai.google.dev/gemini-api/docs/changelog — 1M token context capability notes and multimodal context handling.

[4]: OpenAI. "GPT-5.5 release notes." https://help.openai.com/en/articles/9624314-model-release-notes — 128K context window specifications and retrieval accuracy claims.

[5]: Hsieh, C.-Y. et al. (2024). "RULER: What's the Real Context Size of Your Long-Context Language Models?" https://arxiv.org/abs/2404.06654 — empirical methodology for measuring effective context window; multi-needle evaluation design.

[6]: Bai, Y. et al. (2024). "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks." https://arxiv.org/abs/2412.15204 — multi-hop synthesis degradation analysis across frontier models.

  • Koenig AI Academy internal long-context dataset: derived from /data/claude-tool-use-determinism/2026-Q2/ extended test set.

References

  1. https://www.anthropic.com/news
  2. https://help.openai.com/en/articles/9624314-model-release-notes
  3. https://ai.google.dev/gemini-api/docs/changelog
  4. https://arxiv.org/abs/2307.03172
  5. https://arxiv.org/abs/2406.13121
  6. /data/claude-tool-use-determinism/2026-Q2/
Chapter 4 · 50 min

Cost-per-task — pricing vs. actual bill on real workloads

▶ Listen (audio)

> Prerequisites: Chapter 1 required; Chapters 2 and 3 recommended for the best practical grounding. You should have token counts from at least one benchmark run. > > Time: 50 minutes > > Learning objectives: By the end of this chapter, you can calculate a defensible cost-per-task number for your workload, account for retries and caching, and know when the "cheaper" model is actually more expensive.

Cost-per-task is the total monetary cost to complete one end-to-end unit of a production AI workload — including all input tokens, output tokens, tool-call overhead, retries from failed or malformed outputs, and prompt cache misses. It is distinct from the $/M token pricing listed on vendor pricing pages, which measures the raw cost of tokens in isolation and ignores the factors that dominate real bills: retry rates driven by output instability, context caching economics, and the hidden token amplification from multi-step tool use. As of Q2 2026, the per-token pricing landscape is: OpenAI GPT-5.5 is the most expensive per token; Google Gemini 3.1 Pro is the cheapest; Claude Opus 4.7 sits in the middle. But across real tool-use workloads, the cost ordering by cost-per-task is often the opposite of the cost ordering by $/M token. This chapter shows why.

Key facts

  • Opus 4.7 list pricing (Q2 2026): $5/M input tokens, $25/M output tokens. Prompt cache write: $6.25/M; cache read: $0.50/M (90% discount vs. uncached input). [1]
  • GPT-5.5 list pricing: $10/M input tokens, $40/M output tokens. Cached input: $5/M (50% discount). [2]
  • Gemini 3.1 Pro list pricing: $2.00/M input tokens, $12.00/M output tokens. Context caching (via Google Cloud): $0.20/M cached tokens. [3]
  • On a simple prompt with no retries, Gemini 3.1 Pro is 2.5× cheaper per input token and ~2× cheaper per output token than Opus 4.7. This is the number that appears in comparison articles.
  • In our 10×3×5 benchmark, Gemini 3.1 Pro's average determinism was 81.9% versus Opus 4.7's 91.4%. At 5-step pipelines, that translates to a 2× difference in pipeline success rate (31% vs. 61%) — each failed run requiring a full retry.
  • A failed pipeline run at Gemini pricing ($2/M input) still costs real money: retries are not free. When you factor retry rates into the cost model, Gemini 3.1 Pro's effective cost-per-successful-task is significantly higher than its per-token price implies.
  • The biggest hidden cost is prompt caching misses. A typical agentic system sends the same large system prompt on every call. Without caching, you pay full input price on every turn. With caching, you pay 10% (Anthropic) or 50% (OpenAI) on repeated tokens. This difference dominates cost for multi-turn systems.
  • Tool-call tokens are not free. Each tool definition included in the API call is tokenized and billed as input tokens. A system with 10 tool definitions (~600 tokens of schema) adds $0.009 per call at Opus pricing — small per call, but $9 per 1,000 calls, which compounds at scale.

Why pricing pages are misleading

The standard model comparison table presents:

| Model | Input $/M | Output $/M | |---|---|---| | Opus 4.7 | $5 | $25 | | GPT-5.5 | $10 | $40 | | Gemini 3.1 Pro | $2.00 | $12.00 |

This table is accurate. It is also nearly useless for production cost planning, because it omits:

  1. Retry rate: how often does a failed/malformed output require a retry call?
  2. Prompt caching hit rate: what fraction of input tokens are cached vs. billed at full price?
  3. Tool-call token overhead: how many tokens are consumed by tool definitions in every call?
  4. Output amplification: multi-step pipelines generate output at each step that becomes input at the next. The output/input token ratio compounds.
  5. Context window efficiency: at high context depths, some models produce lower-quality outputs that require verification calls, adding latency and cost.

A real cost model accounts for all of these. The simplified formula:

`` cost_per_task = ( prompt_tokens_uncached × input_price + prompt_tokens_cached × cache_price + output_tokens × output_price + tool_tokens × input_price ) × (1 / determinism_rate)^pipeline_steps ``

The last factor — (1 / determinism_rate)^pipeline_steps — is the retry multiplier. It is the single biggest source of divergence between pricing page cost and actual bill.


The retry multiplier in practice

Let's run the math for a representative 3-step tool-use pipeline:

  • System prompt: 2,000 tokens
  • User message: 200 tokens
  • Tool definitions: 800 tokens
  • Output per step: 400 tokens
  • Steps: 3

Without caching, no retries (baseline):

| Model | Per-step input cost | Per-step output cost | 3-step total | |---|---|---|---| | Opus 4.7 | (3,000 tokens) × $5/M = $0.015 | 400 × $25/M = $0.010 | $0.075 | | GPT-5.5 | (3,000) × $10/M = $0.030 | 400 × $40/M = $0.016 | $0.138 | | Gemini 3.1 Pro | (3,000) × $2/M = $0.006 | 400 × $12/M = $0.0048 | $0.032 |

Gemini is 2.3× cheaper than Opus with no retries. This is the number in the comparison article.

Now apply determinism-driven retries from our benchmark data:

For a 3-step pipeline at category-5 complexity (multi-tool sequence), the determinism scores were: Opus 94%, GPT-5.5 90%, Gemini 84%.

Pipeline success probability: Opus 0.94³ = 83%, GPT-5.5 0.90³ = 73%, Gemini 0.84³ = 59%.

Expected calls to complete one successful pipeline run = 1 / success_probability:

| Model | Per-run cost (no retry) | Expected runs to success | Cost-per-successful-task | |---|---|---|---| | Opus 4.7 | $0.075 | 1.20 | $0.090 | | GPT-5.5 | $0.138 | 1.37 | $0.189 | | Gemini 3.1 Pro | $0.032 | 1.69 | $0.054 |

Gemini is still cheapest — but the ratio has compressed from 2.3× to 1.7×. And this is at category-5 complexity. At category-9 (ambiguous input), where Gemini's determinism drops to 64%:

| Model | Determinism (cat-9) | 3-step success | Calls to success | Cost-per-task | |---|---|---|---|---| | Opus 4.7 | 78% | 47% | 2.1 | $0.158 | | GPT-5.5 | 74% | 41% | 2.4 | $0.331 | | Gemini 3.1 Pro | 64% | 26% | 3.8 | $0.122 |

At ambiguous-input prompts, you need 3.8 Gemini calls to get one successful pipeline completion — and each retry potentially compounds errors (some retries don't fail cleanly; they produce partial outputs that corrupt the pipeline state). The real cost is even higher than the formula predicts once you add retry-handling logic and partial-failure recovery.

When the math flips: 10-step pipelines at ambiguous-input complexity

The retry multiplier grows exponentially with pipeline length: it scales as 1 / determinism^n. A 14-point determinism gap (Opus 78% vs. Gemini 64%) is small at 3 steps — it produces a 1.8× difference in expected call count. At 10 steps, the same gap produces a 7.2× difference. That exponential behavior is why pricing pages are structurally incapable of predicting your actual bill.

Here is the full calculation for a 10-step pipeline at category-9 complexity (ambiguous-input, multi-tool sequence). Because each step's output joins the context for the next step, input tokens grow with each step. The profile below assumes 3,000 base tokens (system + user + tools) and 400 tokens of accumulated output carried forward per step:

| Step | Accumulated input tokens | |---|---| | Step 1 | 3,000 | | Step 2 | 3,400 | | Step 3 | 3,800 | | … | … | | Step 10 | 6,600 |

Total input across all 10 steps: 48,000 tokens. Total output: 4,000 tokens.

Per-run cost (10 steps, no retries yet):

| Model | Input cost | Output cost | Per-run total | |---|---|---|---| | Opus 4.7 | 48,000 × $5/M = $0.240 | 4,000 × $25/M = $0.100 | $0.340 | | GPT-5.5 | 48,000 × $10/M = $0.480 | 4,000 × $40/M = $0.160 | $0.640 | | Gemini 3.1 Pro | 48,000 × $2/M = $0.096 | 4,000 × $12/M = $0.048 | $0.144 |

Gemini is still 2.4× cheaper per run. Now apply determinism:

10-step pipeline success at category-9 (ambiguous-input) complexity:

| Model | Per-step determinism | 10-step success (det^10) | Expected runs to success | Cost-per-successful-task | |---|---|---|---|---| | Opus 4.7 | 78% | 0.78¹⁰ = 8.3% | 12.0 | $4.08 | | GPT-5.5 | 74% | 0.74¹⁰ = 5.1% | 19.6 | $12.54 | | Gemini 3.1 Pro | 64% | 0.64¹⁰ = 1.15% | 86.7 | $12.48 |

The cost ordering has inverted. Gemini — the model with list pricing 2.5× below Opus — costs 3× more per successful task than Opus when the pipeline is long enough and the input is ambiguous. GPT-5.5 and Gemini are nearly tied. The cheapest-per-token model is the most expensive per task.

The break-even for this pipeline profile occurs between 4 and 5 steps. At 4 steps, Gemini ($0.286/task) and Opus ($0.302/task) are nearly equal. Beyond 5 steps, Opus wins on cost-per-task. This is not a corner case — any multi-agent coding or reasoning system with error handling, tool-selection, and planning stages will routinely hit 5–10 action steps per task.

<Callout type="hot"> The inversion is real, and it has a break-even you can calculate. At category-9 complexity (ambiguous-input, multi-tool), Gemini 3.1 Pro crosses above Opus 4.7 in cost-per-task at pipeline length ≥ 5 steps. If your agentic system has 5+ action steps on hard inputs — and most production coding agents do — the pricing page comparison is actively misleading. Run your determinism scores through the retry multiplier before making a cost decision. </Callout>


Prompt caching: the underrated cost lever

Prompt caching is the most impactful cost optimization most builders aren't fully using.

The economics: if you have a 10,000-token system prompt (common in agentic systems with tool definitions and long context instructions), and your system makes 10,000 calls per day:

| Model | Without caching | With caching | Daily savings | |---|---|---|---| | Opus 4.7 | 10K × $5/M = $0.05/call × 10K = $500/day | 10K × $0.50/M = $0.005/call × 10K = $50/day | $450/day | | GPT-5.5 | 10K × $10/M = $0.10/call × 10K = $1,000/day | 10K × $5/M = $0.05/call × 10K = $500/day | $500/day | | Gemini 3.1 Pro | 10K × $2/M × 10K = $200/day | 10K × $0.20/M × 10K = $20/day | $180/day |

Anthropic's caching gives a 90% discount on cached tokens — matching Google's 90% discount on Gemini context caching, and significantly better than OpenAI's 50%. A Gemini-vs-Opus comparison without caching shows a 2.5× price advantage. A comparison with caching and a 10K-token system prompt shows the same 2.5× ratio — both platforms discount cached tokens by 90%, so the relative cost is unchanged by caching alone. [1]

Caching gotchas

Each platform has rules that break caching in non-obvious ways:

Anthropic (Opus 4.7): - Cache TTL: 5 minutes. Calls more than 5 minutes apart from the same prompt restart the cache. For batch workloads with irregular timing, cache hit rate can be much lower than expected. - Minimum cacheable length: 1,024 tokens. Short system prompts don't qualify. - Cache is per-user/session: if you're building a multi-tenant system, you need to architect for per-tenant cache keys.

OpenAI (GPT-5.5): - Cache minimum: 128 tokens. More permissive. - Cache discount: 50% (vs. Anthropic's 90%). Meaningful, but less impactful. - Caching applies automatically to the prompt prefix; no explicit cache-control API.

Google (Gemini 3.1 Pro): - Context caching requires explicit cache creation via the API — it's not automatic. - Cached contexts have a configurable TTL and must be managed explicitly. This is more work to implement but gives you more control. - The separate caching pricing ($0.20/M vs. $2/M uncached input — a 90% discount) is competitive for large, stable system prompts.


The three workload archetypes, costed

Applying the full cost model to the three archetypes from Chapter 1:

Archetype A: Coding agent (multi-step, tool-heavy)

Representative call profile: - System prompt: 8,000 tokens (tool definitions + instructions), cached after first call - Average input per turn: 3,000 tokens (code context) - Average output: 800 tokens (code + reasoning) - Steps per task: 5 - Determinism requirement: uses category 5–7 schemas

| Model | Determinism (5-step success) | Cost per successful task (with caching) | |---|---|---| | Opus 4.7 | ~86% (0.86⁵ = 47%) | ~$0.42 | | GPT-5.5 + strict | ~93% (0.93⁵ = 70%) | ~$0.73 | | Gemini 3.1 Pro | ~79% (0.79⁵ = 31%) | ~$0.28 |

Recommendation: Gemini 3.1 Pro is cheapest per successful task at $0.28, but the 31% pipeline success rate demands robust retry infrastructure. Opus 4.7 at $0.42 offers a reasonable cost/reliability balance with 47% success. GPT-5.5 with strict: true delivers the best pipeline success (70%) but costs 74% more than Opus. Choose GPT-5.5 if reliability is non-negotiable; Opus for a balanced default; Gemini only if you have retry infrastructure in place.

Archetype B: Document Q&A (long-context, single query)

Representative call profile: - Document: 80K tokens (one call, no caching) - System prompt: 500 tokens - Output: 600 tokens - Steps: 1 (no pipeline)

| Model | Cost per call | Notes | |---|---|---| | Opus 4.7 | $0.415 | $80K × $5/M + 600 × $25/M | | GPT-5.5 | $0.824 | $80K × $10/M + 600 × $40/M | | Gemini 3.1 Pro | $0.167 | $80K × $2/M + 600 × $12/M |

For single-query long-context Q&A with no pipeline and no retries, Gemini 3.1 Pro's cost advantage is largest here (2.5× cheaper than Opus). The single-step nature means determinism variance doesn't compound. If retrieval accuracy (not synthesis) is the primary task, Gemini's combination of cheapness + large context window wins clearly.

Archetype C: High-volume classification (batch, 10M items/month)

Representative call profile: - Input: 300 tokens per item - System prompt: 1,000 tokens (same for all items, cached) - Output: 50 tokens - Steps: 1, structured output required

At 10M items/month:

| Model | Monthly cost (no retries) | With 5% retry rate | |---|---|---| | Opus 4.7 | ~$77K/month | ~$81K | | GPT-5.5 | ~$151K/month | ~$160K | | Gemini 3.1 Pro | ~$32K/month | ~$34K |

For classification at this scale, Gemini 3.1 Pro wins decisively — saving $45K/month vs. Opus 4.7. The structured-output task (12-category classification) uses a simple flat schema (category 1–2 in our benchmark), where Gemini's determinism is 96–100% — effectively eliminating the retry-rate advantage of more expensive models.

The unified lesson: the right model depends on your archetype. Gemini for simple-schema, high-volume, or long-context retrieval. GPT-5.5 with strict schema for complex tool-use pipelines. Opus 4.7 for use cases where determinism on complex schemas is non-negotiable and retry cost is prohibitive.


Hands-on exercise

Build a cost-per-task model for your use case using your Chapter 2 benchmark data.

Use this spreadsheet template (fill in your numbers):

``` === COST MODEL WORKSHEET ===

USE CASE: [describe in 1 sentence]

TOKEN COUNTS (from your Chapter 2 benchmark run): System prompt tokens: ___ Average user message tokens: ___ Tool definition tokens: ___ Average output tokens: ___ Pipeline steps: ___

CACHING: Is system prompt ≥ 1024 tokens? [Y/N] Estimated cache hit rate (% of calls): ___ % (For Anthropic: use 80% if calls are within 5-minute windows; 40% if irregular)

DETERMINISM SCORES (from your Chapter 2 run): Model A (Opus 4.7): ___ % Model B (GPT-5.5): ___ % Model C (Gemini 3.1 Pro): ___ %

COST FORMULA (per model): input_cost = (system_prompt × (1 - cache_hit_rate) × INPUT_PRICE) + (system_prompt × cache_hit_rate × CACHE_PRICE) + (message_tokens + tool_tokens) × INPUT_PRICE output_cost = output_tokens × OUTPUT_PRICE retry_multiplier = 1 / (determinism ^ pipeline_steps) cost_per_task = (input_cost + output_cost) × retry_multiplier × pipeline_steps

RESULTS: Opus 4.7 cost-per-task: $___ GPT-5.5 cost-per-task: $___ Gemini 3.1 Pro cost-per-task: $___

RECOMMENDATION: [which model and why, in 1 sentence] ```

Verification: Your cost model is complete when: - All token counts are from actual benchmark runs (not guesses) - Cache hit rate reflects your actual call pattern - Cost-per-task accounts for retries using your measured determinism scores - You can state which model wins on cost-per-task and by what margin

Estimated time: 20 minutes

▶ Try this · claude-sonnet-4-6

I'm building a cost model for an AI customer support system. Here are my numbers: system prompt = 3,000 tokens (same for all calls, high cache hit rate ~85%), average customer message = 400 tokens, to…

Show expected output
The model should walk through the calculation for each of the three providers, applying the cache hit rate to the system prompt, then computing per-step cost, then applying the retry multiplier (1 / determinism^2). Expected output: Opus cost-per-task ≈ $0.025–0.035; GPT-5.5 ≈ $0.130–0.145; Gemini ≈ $0.030–0.040. Opus and Gemini are the closest pair because both platforms apply a 90% cache discount, while GPT-5.5's 50% discount is less effective. The retry multiplier slightly closes the Opus–Gemini gap (Gemini has lower determinism). This is the core illustration of why pricing pages mislead: GPT-5.5 appears competitive on the list price table but costs 2–3× more than Opus once caching and retries are modelled.
▶ Try this · claude-sonnet-4-6

My team is debating whether to switch from Gemini 3.1 Pro to Opus 4.7 for a high-volume document classification pipeline. We process 500,000 documents per day. Each document averages 800 tokens of inp…

Show expected output
The model should compute: Gemini monthly cost (with 90% cache, 500K docs/day × 30 days = 15M docs/month, ~800 tokens each + 2K system prompt) ≈ $53K/month. Opus 4.7 monthly cost at the same volume ≈ $126K/month. At these determinism levels (92% vs 97%), the retry multiplier difference is small (1.087 vs 1.031) — less than 6%. The monthly cost of Opus 4.7 vs Gemini 3.1 Pro at this volume is dramatically different (Opus is ~2.5× more expensive in input tokens, with the same 90% cache discount on both). The recommendation should be: do not switch. The 5-point determinism improvement does not justify a $73K/month increase when retry logic is already handling failures and neither model is causing production SLA breaches.

<KnowledgeCheck question="A startup is choosing between Gemini 3.1 Pro ($2/M input) and Opus 4.7 ($5/M input) for a 4-step agentic coding pipeline. Their benchmark shows Gemini determinism = 82% and Opus determinism = 91% on their prompt types. Which statement is true about the expected cost-per-task comparison?" options={[ "Gemini is always cheaper because its per-token price is 2.5× lower, regardless of determinism", "Opus is cheaper because its higher determinism means fewer retries, more than offsetting the higher price", "The expected number of Gemini runs to complete one task is ~1.5×, narrowing but not eliminating its cost advantage", "Determinism doesn't affect cost because retries use only output tokens, which are the same fraction of total cost" ]} correctIdx={2} explanation="4-step pipeline success: Gemini 0.82⁴ = 45%, so expected runs = 1/0.45 ≈ 2.2. Opus 0.91⁴ = 68%, expected runs = 1/0.68 ≈ 1.47. Gemini needs ~1.5× more runs than Opus per successful task. That partially offsets Gemini's 2.5× per-token input price advantage. The cost-per-task ratio compresses from ~2.3× (pricing page, blended input+output) to roughly ~1.5×. Gemini is still cheaper — but by a meaningfully smaller margin than the pricing page implies. Option A is wrong (determinism clearly affects cost). Option B is wrong (the math shows Gemini is still cheaper per task despite more retries). Option D is wrong (retries require re-sending the full input, not just output tokens)." />

<KnowledgeCheck question="After building your cost model, you find that Opus 4.7 costs $0.42/task and Gemini 3.1 Pro costs $0.28/task for your coding agent workload. Your company processes 50,000 tasks/month. A teammate argues: 'We should use Gemini — we save $7,000/month.' You notice that your Chapter 2 benchmark showed Gemini's pipeline success rate is 31% vs. Opus's 61%. Write 2–3 sentences evaluating the teammate's argument, including any cost factor they may have omitted." options={["self-check"]} correctIdx={0} explanation="The teammate's calculation is directionally correct on raw inference cost but omits the engineering cost of handling a 69% pipeline failure rate. At 31% pipeline success (Gemini), 34,500 of 50,000 monthly tasks fail at least once — each requiring retry logic, error handling, partial-state recovery, and possibly human review. The engineering cost of building and maintaining that infrastructure, plus the latency cost to users waiting on retries, should be quantified before accepting the $7,000/month savings. A more complete comparison would factor in: developer time to build retry/recovery (~20–40 engineering hours = $3,000–6,000 in team cost), user-facing latency increase on retries, and on-call burden from elevated failure rates. The teammate's conclusion may still be right — but the decision requires a total cost of ownership calculation, not just an inference cost comparison." />


What's next

You have now completed all four analytical chapters. You have: - A scorecard weighted for your use case (Chapter 1) - Empirical determinism scores for your prompts (Chapter 2) - Context fidelity data at your target document depth (Chapter 3) - A cost-per-task model with retry rates and caching (Chapter 4)

The capstone project synthesizes all four into a model selection memo — a 500–800 word document your engineering manager could read and act on. The memo format is in vault/courses/picking-a-frontier-model-2026-q2/outline.md.

For further reading on how these models perform on specific workloads, see blogs · opus-4-7-long-running-coding-benchmark and blogs · gpt-5-5-in-codex in the Academy vault.


References cited

[1]: Anthropic. "Claude pricing." https://www.anthropic.com/pricing — Opus 4.7 input/output/cache pricing as of Q2 2026. Also: "Prompt caching." https://www.anthropic.com/news.

[2]: OpenAI. "OpenAI API pricing." https://openai.com/pricing — GPT-5.5 input/output/cached input pricing as of Q2 2026. Model release notes: https://help.openai.com/en/articles/9624314-model-release-notes.

[3]: Google. "Gemini API pricing." https://ai.google.dev/pricing — Gemini 3.1 Pro input/output/context caching pricing as of Q2 2026. Changelog: https://ai.google.dev/gemini-api/docs/changelog.

[4]: Koenig AI Academy internal cost model data, Q2 2026. Derived from 10×3×5 benchmark dataset (/data/claude-tool-use-determinism/2026-Q2/) with retry simulation applied at workload scale.

▾ More learning resources for this chapter (slides, deck preview)

References

  1. https://www.anthropic.com/pricing
  2. https://openai.com/pricing
  3. https://ai.google.dev/pricing
  4. https://www.anthropic.com/news
  5. https://help.openai.com/en/articles/9624314-model-release-notes
  6. https://ai.google.dev/gemini-api/docs/changelog