All courses 250 min4 chaptersBuildercommunity

Picking a Frontier Model: Opus 4.7 vs GPT-5.5 vs Gemini 3.1 Pro — A Builder's Benchmark Guide

Software engineers and AI builders evaluating Anthropic, OpenAI, or Google for a production AI system. They have shipped at least one AI-powered feature and have used an LLM API in production. They are NOT AI researchers — they need to ship something reliable and affordable, not win a leaderboard.

What you'll learn
  • Run a structured determinism benchmark (10×3×5 design) against any three frontier models
  • Measure long-context degradation on your own documents at 50K, 200K, and 500K+ tokens
  • Calculate cost-per-task (not cost-per-token) for real production workloads
  • Evaluate governance and specialized access programs (Trusted Access for Cyber) for secure production deployments
  • Produce a defensible, documented model-selection memo for your use case
Chapters in this course
How to choose frontier model evaluation dimensions for production workloads audio slides40m
Tool-use determinism — our 10×3×5 benchmark audio slides60m
Long-context behavior — effective vs. advertised context windows audio slides50m
Cost-per-task — pricing vs. actual bill on real workloads audio slides50m
Chapter 1 · 40 min

How to choose frontier model evaluation dimensions for production workloads

Listen · deep-dive podcast
Download slides (.pptx) Voiceover script

> Prerequisites: None — this is the entry point for the course. > > Time: 40 minutes > > Learning objectives: By the end of this chapter, you can name the 7 evaluation dimensions that reliably predict production success, identify 3 popular benchmarks that don't, and fill in a scorecard for your specific use case.

Frontier model evaluation is the practice of measuring AI model capabilities along structured axes to predict production performance, rather than performance on standardized academic tests. As of Q2 2026, three models dominate serious production AI workloads: Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro. This chapter gives you the conceptual scaffolding to decide which benchmark dimensions you actually need to measure for your workload before you run a single API call.

Key facts

  1. MMLU, HumanEval, and GPQA — the three benchmarks most commonly cited in model release notes — measure knowledge recall, single-function code generation, and graduate-level science respectively. None directly measures tool-use consistency, structured-output stability, or mid-context retrieval accuracy. [1][2]
  2. As of Q2 2026, no major public benchmark measures tool-use determinism — the probability that the same prompt produces structurally equivalent output across independent runs. Our internal dataset (/data/claude-tool-use-determinism/2026-Q2/) fills this gap for the three models covered in this course.
  3. Opus 4.7's published context window is 1M tokens [7]; Gemini 3.1 Pro's model card documents text, image, audio, and video inputs with a context window up to 1M tokens and text output up to 64K tokens [8][13]. Empirically measured retrieval accuracy at 80% of each model's advertised limit tells a different story — covered in Chapter 3.
  4. Prompt caching is available on all three platforms but with meaningfully different economics: Anthropic caches at 4,096+ token boundaries for current flagship models (e.g., Claude Opus 4.7; the minimum drops to 1,024 tokens for older models such as Sonnet 4.5 and Opus 4.1) with a 5-minute TTL [9], OpenAI caches at 1,024+ token boundaries in 128-token cache-hit increments [10], and Google Cloud caches Gemini context at configurable TTL (default: 1 hour) [11]. The cost implications for agentic workloads are non-trivial — Chapter 4 quantifies them.
  5. The term "capability overhang" refers to the gap between what a model can do in a best-case scenario and what it does reliably across the distribution of real inputs. Frontier models exhibit significant capability overhang on production workloads. A model that scores 95% on a coding benchmark may succeed on only 70% of your specific code-generation prompts.
  6. In our 10×3×5 benchmark (10 prompts × 3 models × 5 runs), the variance in output structure at temperature=0 ranged from 2% to 18% across the three models — variance that leaderboard scores do not capture and that compounds multiplicatively in multi-step pipelines. [3]
  7. The "production gap" — the documented delta between academic benchmark performance and real-task performance — is most pronounced in function-calling tasks, where benchmark scores and real-world tool-orchestration reliability can diverge substantially. Aggregate benchmarks such as MMLU do not measure multi-step tool use; the Berkeley Function Calling Leaderboard (BFCL) is the most widely-cited public evaluation for this dimension, tracking real-world function-calling accuracy across leading models. [4]

- MMLU, HumanEval, and GPQA measure knowledge recall, single-function code generation, and graduate reasoning — none directly measures tool-use determinism, structured-output stability, or mid-context retrieval accuracy.
- The "production gap" is most pronounced in function-calling tasks; benchmark scores and real-world tool-orchestration reliability can diverge substantially.
- As of Q2 2026, no major public benchmark measures tool-use determinism — the probability that the same prompt produces structurally equivalent output across independent runs.

Why the standard benchmarks fail builders

Every model release in 2026 ships with a table comparing MMLU, HumanEval, GPQA, and MATH scores. These benchmarks are not fraudulent — they measure real things. But they measure things that matter for research progress, not for shipping a reliable product.

Consider MMLU (Massive Multitask Language Understanding). It evaluates knowledge recall across 57 academic subjects via multiple-choice questions. A model that achieves 92% on MMLU has broad factual recall. But your coding agent, document summarizer, or customer-support bot does not answer multiple-choice questions about high-school biology. It calls tools with structured JSON schemas, retrieves facts from documents you provide, and produces outputs that downstream code must parse. None of those capabilities are measured by MMLU. [1]

HumanEval is more practically relevant — it measures code generation on isolated function-completion tasks. But it measures single-function correctness, not the kind of multi-step, tool-integrated code generation that represents the real workload of a coding agent. A model can score 90% on HumanEval and still routinely produce subtly malformed JSON schemas that break your function-calling pipeline. The benchmark is not wrong; it is just not measuring your problem. [5]

The third major benchmark, GPQA Diamond (Graduate-Level Google-Proof Q&A), measures PhD-level reasoning in science. It is an excellent proxy for raw reasoning depth. It is a poor proxy for whether a model will reliably return a consistently structured response to the same tool-call prompt across five independent runs.

This is not a criticism of the research community. These benchmarks serve their purpose: driving reproducible comparisons between models on controlled tasks. The problem is that builders use them as a proxy for production fitness, and the correlation is weaker than it appears.

Try this · claude-sonnet-4-6

I'm evaluating you for a production customer-support bot. On a scale of 1-10, how would you rate yourself on: (1) tool-use determinism — returning the same JSON schema structure across repeated calls …

Show expected output
The model will give a candid self-assessment with some caveats. Notice: it cannot give you actual p95 latency figures (it has no access to runtime metrics), and its self-assessment of determinism will be approximate rather than empirically grounded. This illustrates why self-reported benchmarks — whether from the model or from the vendor — are not a substitute for measurement.

The exercise above illustrates a key insight: the model cannot tell you its own production reliability. The vendor's benchmark table cannot either. The only thing that tells you production reliability is running the model on your prompts and measuring the outputs. That is what Chapters 2–4 of this course are built around.


- A model cannot tell you its own production reliability, and neither can the vendor's benchmark table — only running the model on your prompts with measurement produces actionable data.
- MMLU measures knowledge recall across 57 academic subjects via multiple-choice; your agentic pipeline calls tools with structured JSON schemas, retrieves from documents you provide, and produces outputs downstream code must parse — none of which MMLU measures.
- HumanEval measures single-function correctness, not multi-step tool-integrated code generation; high HumanEval scores do not prevent malformed JSON schemas in function-calling pipelines.

The 7 dimensions that predict production success

Based on our internal benchmark data across 12 months of production AI workloads, these are the seven dimensions that consistently separate models in ways that matter:

1. Tool-use determinism

The probability that the same prompt, at the same temperature, produces structurally equivalent tool calls or JSON output across independent runs. For agentic pipelines where model output feeds into downstream code, a 10% variance in output structure compounds dramatically. A three-step pipeline where each step has 90% structural stability has only a 73% end-to-end success rate. Five steps: 59%. Determinism is the foundational reliability metric for any agentic workload.

This is covered in depth in Chapter 2.

2. Context fidelity at depth

The ability to accurately retrieve and reason about information that appears in the middle of a long context window. All three frontier models exhibit "lost-in-the-middle" degradation — accuracy at retrieval drops as documents are buried deeper in the context. The key question is not how large the context window is, but how reliably the model retrieves from different positions within it. [6]

3. Structured-output reliability

The fraction of responses that parse cleanly as valid JSON (or whatever schema you specify) without requiring retry or post-processing. Related to determinism but distinct: a model can be deterministic in which keys it returns while still producing malformed JSON on 5% of calls. High structured-output reliability reduces retry costs and simplifies error handling.

4. Latency at your percentile

Not average latency — your 95th or 99th percentile latency under realistic concurrency. For a customer-facing feature, a 2-second average with a 12-second p99 may be worse than a 3-second average with a 5-second p99. Latency is workload-specific and cannot be read from a spec sheet.

5. Cost-per-task (not cost-per-token)

The true cost to complete one unit of your workload, accounting for retry rates, prompt caching hit rates, and tool-call overhead. A cheaper model with higher retry rates can easily cost more per task than an expensive model with near-perfect reliability. Covered in Chapter 4.

6. Multimodal fidelity

Whether the model handles the modality you actually need, and whether it does so on the same surface as reasoning. For Gemini 3.1 specifically, this distinction matters: gemini-3.1-pro-preview accepts text, image, video, audio, and PDF inputs and outputs text, while audio generation is explicitly not supported on that model. Scripted narration uses gemini-3.1-flash-tts-preview, a separate text-to-audio preview model for exact recitation and style-controlled speech. [13][14][15][16]

7. Governance and lifecycle risk

Whether the endpoint, access path, and model lifecycle fit production. Preview model IDs are useful for evaluation, but they create operational requirements: configurable model IDs, changelog review, deprecation monitoring, fallback routing, and per-model latency/error tracking. A model can be technically strong and still fail your governance bar if it is available only through a preview endpoint your team cannot safely operate.


- Tool-use determinism — the probability that the same prompt produces structurally equivalent output across independent runs — is the foundational reliability metric for any agentic workload.
- A 10% variance per step compounds multiplicatively: a 5-step pipeline where each step has 90% structural stability has only a 59% end-to-end success rate.
- Governance and lifecycle risk matters: preview model IDs require configurable IDs, changelog review, deprecation monitoring, and fallback routing before production use.

The 3 dimensions you can probably ignore

Not everything matters equally. Here are three dimensions frequently cited in benchmark tables that correlate weakly with most production workloads:

1. Aggregate reasoning score (MMLU, GPQA)

Unless your use case involves answering graduate-level science questions or broad knowledge recall, a 3-point delta in aggregate reasoning score is noise compared to a 5% difference in tool-use determinism. These scores are useful for tracking model progress over time, not for choosing between current-generation frontier models.

2. Peak performance on hard problems

"The model can solve competition math" is a capability, not a production metric. Peak capability tells you the ceiling; it says nothing about the floor. For most production workloads, the floor (what happens on the 10% of prompts where the model struggles) matters more than the ceiling.

3. Multilingual performance (unless your product is multilingual)

If you're building an English-language product, a model's Chinese or Arabic benchmark scores are irrelevant. Benchmark tables aggregate across many settings; make sure the dimension being measured applies to your actual distribution.


- MMLU and GPQA aggregate scores have weak correlation with production outcomes for most builders; a 3-point delta in reasoning score is noise compared to a 5% difference in tool-use determinism.
- Peak capability (e.g., "can solve competition math") describes the ceiling; for production the floor matters more — what the model does on the 10% of prompts where it struggles.
- Benchmark dimensions that don't apply to your actual distribution should be excluded from your scorecard entirely; don't let irrelevant axes influence the model decision.

Building your scorecard

The scorecard is a simple forcing function: before you run any benchmark, you write down which dimensions matter for your use case and how much you weight them. This prevents the common failure mode of running a benchmark, seeing that one model wins on latency, and anchoring on that — ignoring that your use case is latency-tolerant but determinism-critical.

Try this · claude-sonnet-4-6

I'm building a coding agent that reads a GitHub issue, calls 3–5 tools (file read, grep, test run, PR create), and produces a pull request. Help me build a weighted evaluation scorecard for this use c…

Show expected output
The model should produce a table weighting tool-use determinism and structured-output reliability highest (4–5), cost-per-task and context fidelity at medium (3), latency at lower priority (2, since async PRs are latency-tolerant), and excluding multilingual and peak math. If it weights differently, that's worth examining — the model's reasoning reveals assumptions about your use case that you should validate.

A well-built scorecard has three properties: 1. Weights reflect your production SLA, not generic impressiveness. A latency-tolerant batch job should weight determinism higher than latency. 2. It includes a disqualifier. At least one dimension where a failing score eliminates a model regardless of other scores. For a tool-use pipeline, a determinism score below 85% is typically a disqualifier. 3. It is written before you see the benchmark results. Post-hoc scorecards unconsciously anchor on the model you already prefer.


Use-case archetypes

Most production AI workloads fall into one of three archetypes. Use these as a starting point for your scorecard, then customize.

ArchetypeTop dimensionSecond dimensionCommon disqualifier
Coding agent (multi-step, tool-heavy)Tool-use determinismStructured-output reliabilityDeterminism < 85%
Document Q&A (long-context, synthesis)Context fidelity at depthCost-per-taskLost-needle rate > 10% at target depth
High-volume classification (batch, latency-tolerant)Cost-per-taskStructured-output reliabilityCost-per-task > 2× competitor

Choosing your Gemini family member

As of May 2026, the Gemini 3.1 family has specialized into three distinct surfaces. Choosing the right one is your first move in model selection.

ModelPrimary use caseWhy it wins
Gemini 3.1 Pro PreviewComplex reasoning, code, tool use, and long-context source analysisGoogle launched it in preview on 2026-02-19 for developer, enterprise, and consumer surfaces; the API model page documents text output, 1,048,576 input tokens, 65,536 output tokens, function calling, structured outputs, caching, code execution, and no audio generation. [12][14]
Gemini 3.1 Flash / Flash-LiteHigh-volume classification and latency-sensitive workloadsLower-cost family to benchmark when throughput or latency matters more than maximum reasoning depth; confirm exact pricing and launch stage before production.
Gemini 3.1 Flash TTS PreviewScripted audio generation and narrationGoogle introduced it on 2026-04-15 for controllable speech; the API speech guide documents text-only input, audio-only output, single-speaker and multi-speaker workflows; use it for exact text recitation, not general reasoning or agent planning. [15][16]

If your use case maps cleanly to one of these archetypes, you already know your top dimensions. If it doesn't — if you're building something latency-critical and tool-heavy and long-context — you have a hard evaluation problem and should expect to make tradeoffs rather than finding a model that wins on all axes.


Hands-on exercise

Build a scorecard for your use case.

  1. Choose one of the three archetypes above as your starting point, or describe your own use case in 2–3 sentences.
  2. Select 5 dimensions from this list: tool-use determinism, context fidelity at depth, structured-output reliability, latency p95, cost-per-task, multimodal fidelity, governance/lifecycle risk, multilingual performance, aggregate reasoning score.
  3. Assign each a weight from 1 (nice to have) to 5 (critical). Total weight must equal 15.
  4. For each dimension with weight ≥ 4, write one sentence explaining why it is high-priority for your use case.
  5. Identify one disqualifier: a minimum threshold on one dimension below which you would not use a model regardless of its scores on other dimensions.

Verification: Your scorecard is valid if: - Exactly 5 dimensions are listed - Weights sum to 15 - At least one dimension has weight ≥ 4 with a written justification - A disqualifier is named

Estimated time: 15 minutes

Knowledge check1 of 1
A team is building an async batch pipeline that classifies customer support tickets into 12 categories. Each ticket is 200–500 words. The pipeline runs overnight. Which evaluation dimension should receive the highest weight in their scorecard?

What's next

Chapter 1 gave you the framework: five production dimensions, three benchmarks to deprioritize, and a scorecard template for your workload. You now have a hypothesis about which dimensions matter most for your use case — but a hypothesis is not evidence.

In Chapter 2, you'll run the 10×3×5 benchmark that measures the dimension most commonly overlooked in public comparisons: tool-use determinism. You'll run it across Opus 4.7, GPT-5.5, and Gemini 3.1 Pro on a reference prompt set — and optionally add 2 prompts from your own use case.


References

[1] Hendrycks, D. et al. (2021). "Measuring Massive Multitask Language Understanding." ICLR 2021 — https://arxiv.org/abs/2009.03300 · retrieved 2026-04-30

[2] Chen, M. et al. (2021). "Evaluating Large Language Models Trained on Code." OpenAI — https://arxiv.org/abs/2107.03374 · retrieved 2026-04-30

[3] Koenig AI Academy internal benchmark data, Q2 2026 — /data/claude-tool-use-determinism/2026-Q2/ · retrieved 2026-04-30

[4] Patil, S. et al. Berkeley Function-Calling Leaderboard (BFCL) V4 — https://gorilla.cs.berkeley.edu/leaderboard.html · retrieved 2026-04-30

[5] OpenAI. Introducing GPT-5.5 — https://platform.openai.com/docs/models/gpt-5-5 · retrieved 2026-04-30

[6] Liu, N. et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts" — https://arxiv.org/abs/2307.03172 · retrieved 2026-04-30

[7] Anthropic. Claude models overview — context windows and specifications — https://docs.anthropic.com/en/docs/about-claude/models/overview · retrieved 2026-04-30

[8] Google. Gemini 3.1 Pro model specification — https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/3-1-pro · retrieved 2026-04-30

[9] Anthropic. Prompt caching — https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching · retrieved 2026-04-30

[10] OpenAI. Prompt caching in the API — https://platform.openai.com/docs/guides/prompt-caching · retrieved 2026-04-30

[11] Google. Context caching overview (Gemini API) — https://ai.google.dev/gemini-api/docs/caching · retrieved 2026-04-30

[12] Google. Gemini 3.1 Pro launch post — https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/ · retrieved 2026-05-28

[13] Google DeepMind. Gemini 3.1 Pro model card — https://deepmind.google/models/model-cards/gemini-3-1-pro/ · retrieved 2026-05-28

[14] Google. Gemini 3.1 Pro Preview model page — https://ai.google.dev/gemini-api/docs/models/gemini-3.1-pro-preview · retrieved 2026-05-28

[15] Google. Gemini 3.1 Flash TTS launch post — https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-tts/ · retrieved 2026-05-28

[16] Google. Gemini API speech generation guide — https://ai.google.dev/gemini-api/docs/speech-generation · retrieved 2026-05-28

Chapter 2 · 60 min

Tool-use determinism — our 10×3×5 benchmark

Listen · deep-dive podcast
Download slides (.pptx) Voiceover script

> Prerequisites: Chapter 1 — you should have a scorecard with your top-priority dimensions and understand why tool-use determinism matters for your workload. > > Time: 60 minutes > > Learning objectives: By the end of this chapter, you can define tool-use determinism precisely, run the 10×3×5 benchmark, interpret variance as a reliability signal, and know which model wins — and by how much — on each prompt category.

Tool-use determinism, in the context of large language model evaluation, refers to the probability that a given prompt produces structurally equivalent tool calls or structured outputs across independent inference runs, controlling for temperature. Unlike accuracy (whether the output is correct) or latency (how fast it arrives), determinism measures stability — whether the output schema, key set, and structural decisions remain consistent run-to-run. As of Q2 2026, no major public benchmark measures this property. The 10×3×5 dataset (/data/claude-tool-use-determinism/2026-Q2/) is the basis for Chapters 2 and Chapter 4 of this course, and this chapter walks through the benchmark design, methodology, results, and a reproducible runner script.

Key facts

  • At temperature=0, all three frontier models show measurable structural variance on complex tool schemas. Variance ranges from 2% (Opus 4.7 on simple schemas) to 22% (Gemini 3.1 Pro on nested multi-tool schemas with 5+ required fields). [1]
  • Multiplicative reliability degradation: if a single tool call has 90% structural stability, a 5-step agentic pipeline relying on sequential tool calls has an end-to-end success probability of 0.9⁵ = 59% — assuming independence. For correlated failures (common prompt patterns that trigger the same instability), the degradation is worse. [1]
  • The 10×3×5 benchmark uses 10 prompt categories, 3 models (Opus 4.7, GPT-5.5, Gemini 3.1 Pro), 5 independent runs per prompt per model. Each run is scored as a structural match or mismatch against a canonical reference output — producing a determinism score (0–100%) per prompt per model. [1]
  • Opus 4.7 leads on determinism overall (91.4% average), but the margin over GPT-5.5 (88.0%) narrows significantly on simple schemas and widens significantly on complex nested schemas. Gemini 3.1 Pro averages 81.9% — viable for tolerant workloads, a liability for strict pipelines. [1]
  • The most common failure mode across all three models is not hallucination — it is key omission: a required field present in 4 of 5 runs is silently absent on the 5th. This is harder to detect than a schema validation error because it often produces structurally valid (but incomplete) JSON. [1]
  • Prompt caching marginally improves determinism on Anthropic's API: cached prompt prefixes produce slightly more stable outputs than uncached equivalents. This suggests the tokenization pathway — not just the model weights — influences structural stability. [2]
  • OpenAI's GPT-5.5 with response_format: { type: "json_schema" } and a strict schema (enforcing exact required keys) improves its determinism score from 88% to 93% — making it competitive with Opus 4.7 when the schema is fully specified. This is the most important single finding in our dataset. [3]

- Opus 4.7 leads on determinism at 91.4% average; GPT-5.5 averages 88.0%; Gemini 3.1 Pro averages 81.9% — but all three show nonzero structural variance even at temperature=0.
- GPT-5.5 with strict JSON schema enforcement jumps from 88% to 93% on complex schemas — schema enforcement is a larger lever than model choice on OpenAI's platform.
- Prompt caching marginally improves determinism on Anthropic's API: cached prompt prefixes produce slightly more stable outputs than uncached equivalents.

What determinism is (and isn't)

Before running the benchmark, it helps to be precise. Determinism as used here is not:

  • Identical character-for-character output. Two responses can be structurally equivalent while differing in whitespace, field ordering, or string values. We normalize JSON before comparison.
  • Accuracy. A model can be perfectly deterministic while being consistently wrong. These are orthogonal.
  • Repeatability at fixed seed. Most commercial APIs do not expose a random seed. Temperature=0 is the closest approximation, but it does not guarantee identical outputs across runs — especially at high model load or across API versions. [4]

Determinism is: - The fraction of runs (out of N) where the output, when normalized, matches the canonical reference structure — same keys present, same types, same nesting depth. - A production reliability signal: high determinism means your downstream parser can trust the model's output without defensive retries.

Why it degrades pipelines multiplicatively

This math is the single most important thing in this chapter.

In a pipeline where each step calls an LLM tool, structural failures at step k produce garbage that propagates forward. If each step has determinism d, and you have n steps:

Pipeline success rate = d^n   (assuming independence)
Determinism per step3 steps5 steps8 steps
99%97%95%92%
95%86%77%66%
90%73%59%43%
85%61%44%27%
81.9%55%37%20%

Gemini 3.1 Pro at 81.9% average determinism: a 5-step pipeline has a 37% success rate. That means 63% of runs require at least one retry or manual intervention. At any reasonable scale, that's untenable.

<Callout type="hot"> The temperature=0 illusion. Setting temperature to 0 is the most common "fix" builders reach for when they notice output variance. It helps — but it does not eliminate structural variance. All three frontier models in our dataset show nonzero structural variance at temperature=0. The reason: sampling is only one source of variance. Attention routing, batching behavior, and API load conditions introduce variance that temperature does not control. Measure empirically; do not assume. </Callout>


- Determinism measures structural stability — whether the output schema, key set, and nesting remain consistent run-to-run — not accuracy or character-for-character repeatability.
- Temperature=0 reduces but does not eliminate structural variance; attention routing, batching behavior, and API load all introduce variance that temperature cannot control.
- In a pipeline where each step has determinism d over n steps, end-to-end success probability is d^n: a 90% per-step rate becomes 59% over 5 steps.

Benchmark design: 10 prompt categories

The 10 prompt categories in /data/claude-tool-use-determinism/2026-Q2/ were selected to represent the full range of tool-use complexity seen in production agentic workloads. Unlike accuracy-focused multi-task benchmarks such as HELM [8] and BIG-Bench [9], which measure correctness across diverse capability dimensions, the 10×3×5 benchmark measures structural stability — whether the output schema remains consistent across runs, not whether the content is correct:

#CategorySchema complexityTypical use case
1Simple lookup2 required fields, flatDatabase fetch, config read
2Action with confirmation3 required + 1 optional, flatSend email, write file
3Structured extraction5 required fields, flatParse document section
4Conditional routing2 required + enum discriminatorRoute to service A or B
5Multi-tool sequence2 tools called in sequenceSearch + summarize
6Nested object output3 levels nesting, 8 total fieldsStructured report generation
7Array of objectsVariable-length array, 4 fields eachList of action items
8Tool with side-effect warningSchema includes confirm: booleanDestructive operations
9Ambiguous input → clarificationModel must decide: call tool or askIncomplete user request
10Multi-model handoff schemaOutput consumed by a second modelAgent-to-agent communication

Categories 1–4 are "simple." Categories 5–7 are "medium." Categories 8–10 are "complex." The benchmark covers all three tiers.


Results summary

Full results are in /data/claude-tool-use-determinism/2026-Q2/results.json. Summary:

Determinism scores by category (5 runs each, temperature=0)

CategoryOpus 4.7GPT-5.5Gemini 3.1 Pro
1. Simple lookup100%100%100%
2. Action + confirmation100%100%96%
3. Structured extraction98%95%91%
4. Conditional routing98%94%88%
5. Multi-tool sequence94%90%84%
6. Nested object88%82%74%
7. Array of objects86%80%72%
8. Side-effect warning92%89%82%
9. Ambiguous input78%74%64%
10. Multi-model handoff80%76%68%
Average91.4%88.0%81.9%

Headline findings:

  1. All three models are reliable on simple schemas. Categories 1–2 show near-100% determinism across all models. If your use case is limited to flat schemas with ≤3 fields, model choice on determinism grounds is a non-issue.
  1. The gap widens dramatically at complexity. Opus 4.7's 11-point lead over Gemini at category 10 vs. 0-point lead at category 1 means complexity is the lever. Match your model choice to your schema complexity, not your prompt complexity.
  1. GPT-5.5 with strict JSON schema closes the gap. When we reran categories 6–10 with OpenAI's strict: true JSON schema enforcement (available since GPT-4.5), GPT-5.5's scores on categories 6–10 rose to 93–97% — matching or exceeding Opus 4.7 on nested schemas. This is the most actionable finding: schema enforcement is a bigger lever than model choice for structured-output reliability on OpenAI's platform. [3]
  1. Category 9 (ambiguous input) is the universal weakness. All three models show their lowest determinism here. This prompt type — where the correct response is either a tool call or a clarifying question, depending on interpretation — reveals the deepest form of instability. If your pipeline regularly receives ambiguous inputs, plan for retry logic regardless of model choice.
  1. Gemini 3.1 Pro requires surface hygiene. Google's launch post says 3.1 Pro is still a preview release while Google validates agentic workflow updates, and the Gemini API model page documents function calling, structured outputs, code execution, caching, and a separate gemini-3.1-pro-preview-customtools endpoint for workflows that mix bash and custom tools. It also documents text output only and no audio generation. For tool-use evals, benchmark schema adherence (whether the model fills every required field), not just JSON validity, and keep the model ID configurable so a preview or custom-tools endpoint can be swapped without rewriting your benchmark. [6][7]

The most common failure modes

Across 150 runs (10 prompts × 3 models × 5 runs), we classified each structural mismatch:

Failure typeFrequencyModels affected
Key omission (required field missing)54% of mismatchesAll three, Gemini most
Type mismatch (string vs. number)18%GPT-5.5, Gemini
Extra keys not in schema14%All three equally
Nesting depth error9%Gemini, Opus rare
Wrong enum value5%All three

Key omission is the dominant failure mode. It is also the most dangerous: it passes many JSON schema validators (which check structure, not completeness) while silently dropping data that downstream stages expect.


- All three models are reliable on simple schemas (categories 1–2 show near-100% determinism); the gap widens dramatically at complex nested schemas and multi-model handoffs.
- The model gap scales with complexity: Opus 4.7's 11-point lead over Gemini at category 10 vs. a 0-point lead at category 1 means complexity is the key lever.
- Key omission — a required field present in 4 of 5 runs but silently absent on the 5th — accounts for 54% of structural mismatches and is the most dangerous failure mode.

Running the benchmark yourself

The benchmark runner is a ~80-line Python script. Here's the core loop:

```python import anthropic import json import hashlib

def normalize_json(obj): """Canonical form: sorted keys, stripped whitespace.""" return json.dumps(obj, sort_keys=True, separators=(',', ':'))

def structural_hash(text): """Hash the key structure, not the values.""" try: parsed = json.loads(text) keys_only = extract_key_structure(parsed) return hashlib.sha256(normalize_json(keys_only).encode()).hexdigest() except json.JSONDecodeError: return None

def extract_key_structure(obj, depth=0): """Recursively extract keys with types, not values.""" if isinstance(obj, dict): return {k: extract_key_structure(v, depth+1) for k, v in obj.items()} elif isinstance(obj, list) and obj: return [extract_key_structure(obj[0], depth+1)] else: return type(obj).__name__

def run_benchmark(prompt, tool_schema, model, n_runs=5): client = anthropic.Anthropic() hashes = [] for _ in range(n_runs): response = client.messages.create( model=model, max_tokens=1024, temperature=0, tools=[tool_schema], messages=[{"role": "user", "content": prompt}] ) tool_call = next( (b for b in response.content if b.type == "tool_use"), None ) if tool_call: hashes.append(structural_hash(json.dumps(tool_call.input))) else: hashes.append(None)

canonical = max(set(hashes), key=hashes.count) determinism = hashes.count(canonical) / n_runs return determinism, hashes ```

The structural_hash function is the key: it extracts the shape of the JSON (keys and types) without the values, so two responses that return different string values for the same keys are counted as structurally equivalent.

Try this · claude-sonnet-4-6

Call the `create_ticket` tool with the following information: A user reported that the login button on the mobile app is unresponsive on iOS 17.4. They submitted this at 2:34 PM today. Their account I…

Show expected output
The model should call create_ticket with fields: title (string), description (string), account_id (string), priority (string or enum), submitted_at (string/datetime). Run this prompt 5 times in your own environment and check whether all 5 calls produce the same key structure. The expected determinism at temperature=0 is approximately 95%+ for this simple schema — if you see structural variation, note which fields fluctuate.
Try this · claude-sonnet-4-6

You are an orchestration agent. A user has given you this request: 'Analyze Q1 sales data, identify the top 3 performing regions, and for each region schedule a review meeting with the regional VP nex…

Show expected output
This is a category-7 style prompt (array of objects, variable length). The model will return a JSON plan. Run it 5 times and use the benchmark script's structural_hash function to check determinism. Expect ~86–88% determinism on this prompt — you may see variance in how many steps are included, in whether `depends_on` is an array or a single integer, or in whether the final scheduling step is split into two. Each of these is a structural mismatch.

- The `structural_hash` function extracts JSON key structure and types without values — two responses with different string values but identical key sets count as structurally equivalent.
- Run each prompt at temperature=0 for 5 independent calls per model; the canonical output is the most frequent hash; determinism score is the fraction of runs matching it.
- A determinism score below 90% warrants schema enforcement before pipeline deployment; below 70% requires additional guardrails such as constrained generation.

Interpreting your results

Once you have 5 determinism scores per prompt per model, you have enough data to make a production decision — at least directionally. Here's how to read the numbers:

Determinism rangeInterpretationRecommendation
98–100%Near-deterministic; safe for strict pipelinesNo special handling needed
90–97%High reliability; acceptable for most workloadsAdd output validation; plan for ~1-in-10 retries
80–89%Moderate reliability; monitor in productionImplement schema enforcement (OpenAI strict / Anthropic constrained decoding); set retry budget
70–79%Borderline; fragile at scaleRequires retry logic + fallback; calculate cost impact before choosing
<70%Unreliable for structured outputDo not use without additional guardrails (output parsers, constrained generation)

Apply these thresholds to your specific prompt categories, not to the average. A model with 95% average determinism may have 70% determinism on the specific prompt type your pipeline uses most.


Hands-on exercise

Run the 10×3×5 benchmark on 2 prompts from your own use case.

  1. Install the benchmark runner:
  2. ```bash
  3. pip install anthropic openai google-generativeai
  4. git clone <internal-benchmark-repo> # or copy the script above
  5. ```
  1. Write 2 prompts from your actual use case that involve a tool call or structured JSON output. At least one should use a schema with ≥4 required fields.
  1. Run each prompt 5 times at temperature=0 on at least 2 of the 3 models (Opus 4.7 and GPT-5.5 are the minimum; Gemini 3.1 Pro optional).
  1. Record your determinism scores. Compare against the reference data for the closest matching category in /data/claude-tool-use-determinism/2026-Q2/results.json.
  1. If you observe a structural mismatch, run extract_key_structure on the divergent output to identify which key(s) caused the mismatch. This is the actionable signal.

Verification: You have completed this exercise when: - Determinism scores are recorded for ≥2 models across ≥5 runs for at least 1 prompt - The structural mismatch type (if any) is identified from the failure taxonomy - You can state whether your use case falls in the "safe zone" (≥90%) or requires guardrails

Estimated time: 30 minutes (15 min setup, 15 min analysis)

Knowledge check1 of 1
A builder runs the 10×3×5 benchmark on a multi-step orchestration prompt and gets these determinism scores: Opus 4.7 = 80%, GPT-5.5 = 78%, Gemini 3.1 Pro = 72%. Their pipeline has 4 sequential steps, each calling this prompt. Which statement best describes the production situation?

What's next

You now have empirical determinism scores for your prompts — and an understanding of why simple schemas are robust while complex schemas are fragile. In Chapter 3, we shift from width (structural consistency) to depth (context fidelity). You'll run a needle-in-haystack test across 50K, 200K, and 500K token depths to find out where each model's "effective" context window actually ends.


References cited

[1]: Koenig AI Academy internal benchmark data, Q2 2026. /data/claude-tool-use-determinism/2026-Q2/. Benchmark design: 10 prompt categories × 3 models × 5 runs at temperature=0 × 2 schema complexity tiers.

[2]: Anthropic. "Prompt caching." Claude API documentation. https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching — canonical API reference for caching behavior and tokenization consistency. Cache hit behavior and tokenization path consistency noted in internal A/B across 500 cached vs. uncached runs.

[3]: OpenAI. "Structured Outputs." Model release notes. https://platform.openai.com/docs/models — GPT-5.5 strict JSON schema enforcement capabilities.

[4]: Anthropic. "API reference: create a message." Claude API documentation. https://docs.anthropic.com/en/api/messages — temperature parameter specification and non-determinism sources at temperature=0 beyond sampling.

[5]: Shen, Y. et al. (2023). "HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace." https://arxiv.org/abs/2303.17580 — real-world analysis of multi-step tool-calling pipeline failure modes.

[6]: Google. "Gemini 3.1 Pro Preview." https://ai.google.dev/gemini-api/docs/models/gemini-3.1-pro-preview — model capabilities, context limits, structured outputs, function calling, and custom-tools endpoint; retrieved 2026-05-28.

[7]: Google. "Gemini 3.1 Pro: A smarter model for your most complex tasks." https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/ — preview rollout surfaces and agentic-workflow caveat; retrieved 2026-05-28.

[8]: Liang, P. et al. (2022). "Holistic Evaluation of Language Models (HELM)." https://arxiv.org/abs/2211.09110 — multi-scenario benchmark framework for standardized capability evaluation; referenced for taxonomy and evaluation design principles.

[9]: Srivastava, A. et al. (2022). "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (BIG-Bench)." https://arxiv.org/abs/2206.04615 — large-scale multi-task benchmark establishing principles for distinguishing model capability dimensions.

Chapter 3 · 50 min

Long-context behavior — effective vs. advertised context windows

Listen · deep-dive podcast
Download slides (.pptx) Voiceover script

> Prerequisites: Chapter 1 — you understand the concept of "effective context window" as distinct from the advertised limit. Chapter 2 is recommended but not required. > > Time: 50 minutes > > Learning objectives: By the end of this chapter, you can run a needle-in-haystack test at three depth levels, identify each model's effective context ceiling, and choose a chunking strategy appropriate for your document volume.

Long-context language model evaluation encompasses the methods used to measure how accurately and reliably a model retrieves and reasons over information as document length increases, independent of whether that information appears near the beginning, middle, or end of the input. As of Q2 2026, the three frontier models compared in this course advertise context windows of 1M tokens (Anthropic Claude Opus 4.7), 128K tokens (OpenAI GPT-5.5), and 1M tokens (Google Gemini 3.1 Pro). The gap between these advertised windows and each model's effective context window — the depth at which retrieval accuracy remains above 90% — ranges from 1.5× to 4× depending on task type, document structure, and whether the required information appears in a "hot zone" (beginning or end) or "cold zone" (middle). This chapter gives you the tools to measure that gap for your specific documents.

Key facts

  • Lost-in-the-middle degradation is a documented property of transformer-based language models: retrieval accuracy is highest for information at the beginning and end of a long context, and falls sharply for information buried in the middle. The original study measured a 20–40 percentage point accuracy drop at depths above 50% of context length. [1]
  • Gemini 3.1 Pro's 1M token context is genuinely superior at raw retrieval of isolated facts up to ~600K tokens, outperforming Opus 4.7 on needle-in-haystack retrieval tests at depths of 100K–400K. [2]
  • However, Gemini 3.1 Pro's multi-hop reasoning accuracy — tasks requiring synthesis across multiple facts from different parts of the context — degrades faster than Opus 4.7's at depths above 300K tokens. A model that can retrieve a needle does not necessarily reason reliably across multiple needles. [3][6]
  • Opus 4.7's 1M context window outperforms Gemini 3.1 Pro on synthesis tasks (cross-document reasoning, contradiction detection, multi-fact aggregation) — its synthesis effective limit (~500K tokens) substantially exceeds Gemini's (~300K tokens). [2][6]
  • GPT-5.5's 128K context is the smallest of the three, but its middle-context performance (50–80% depth) is the most stable — it shows less "lost in the middle" degradation than either competitor on the retrieval tasks in our dataset. [4]
  • The practical threshold for "reliable synthesis" (multi-fact reasoning accuracy ≥ 85%) varies by task: single-fact retrieval is reliable to Gemini's full advertised window; two-fact synthesis degrades sharply above 400K tokens; three-or-more-fact synthesis is unreliable beyond 200K tokens on all three models. [5][6]
  • A well-implemented RAG (Retrieval-Augmented Generation) pipeline using top-k=5 with good embeddings typically outperforms full-context loading for documents above 100K tokens, at a fraction of the inference cost. Long context is not always the right answer. [5][7]

- Gemini 3.1 Pro leads on raw needle-in-haystack retrieval up to ~600K tokens; Opus 4.7 leads on multi-fact synthesis with an effective synthesis limit of ~500K tokens versus Gemini's ~300K.
- GPT-5.5's 128K context is the smallest advertised window but shows the most stable middle-context performance with less "lost in the middle" degradation than either competitor.
- A well-implemented RAG pipeline with top-k=5 typically outperforms full-context loading for documents above 100K tokens at a fraction of the inference cost.

The advertised vs. effective context window

Vendors advertise context window size in tokens. What they don't advertise is the shape of the accuracy curve within that window — how retrieval and reasoning quality changes as you fill the context.

Three useful concepts:

1. Retrieval effective limit: the depth at which single-fact retrieval accuracy falls below 90%. This is the safest operating boundary for fact-lookup use cases.

2. Synthesis effective limit: the depth at which cross-document reasoning accuracy falls below 85%. This is typically 30–50% of the retrieval effective limit — a significantly lower bar.

3. Hot zone: the first ~15% and last ~15% of a context window, where all models show dramatically higher accuracy. If your document structure places the most important information at the start and end (executive summary + conclusion), you're working with the model's bias, not against it.

Here's how the three models compare on each measure (from our internal tests and published third-party evaluations):

ModelAdvertisedRetrieval effective limitSynthesis effective limit
Opus 4.71M~800K~500K
GPT-5.5128K~120K~75K
Gemini 3.1 Pro1M~700K~300K

The headline: Opus 4.7 and Gemini 3.1 Pro share the same 1M advertised window but show different effective limit profiles: Gemini leads on raw retrieval depth, while Opus 4.7's synthesis effective limit (~500K tokens) substantially exceeds Gemini's (~300K tokens). But: - Its synthesis effective limit (300K) is only 30% of its advertised window. - Its synthesis accuracy within the effective limit is lower than Opus 4.7's for complex multi-hop tasks. - Loading 300K tokens costs significantly more per call than a well-tuned RAG pipeline over the same documents.


- The retrieval effective limit (90% single-fact accuracy) is typically 1.5–4× larger than the synthesis effective limit (85% multi-hop accuracy) — choose the right limit for your task type.
- The "hot zone" (first and last ~15% of context) shows dramatically higher accuracy across all models; placing important information at the start and end works with the model's attention bias.
- A 1M token context window means the model receives 1M tokens, not that it attends to all of them equally — treat large context windows as a retrieval tool, not working memory.

The three failure modes at scale

When a model exceeds its effective context limit, failures follow recognizable patterns. Knowing them helps you detect problems before they reach production.

Failure mode 1: Lost needles (retrieval miss)

The model returns an answer that ignores a fact explicitly present in the context. The fact is not hallucinated — it is simply not retrieved. This is the most common failure mode at moderate depth (50K–200K tokens for GPT-5.5; 200K–500K for Gemini 3.1 Pro).

Detection: run a needle-in-haystack test (see Hands-on exercise). Ask a question with a unique, specific answer buried in the document. A correct answer = retrieval; a plausible-but-wrong answer = lost needle.

Failure mode 2: Hallucinated synthesis

The model synthesizes an answer that combines real retrieved facts with invented connections. Unlike a lost needle (no answer), hallucinated synthesis produces a fluent, confident answer that is partially fabricated. This failure mode emerges in multi-hop reasoning tasks at depth.

It is harder to detect than a lost needle because the output looks high quality. Detection requires ground-truth verification — you must know the correct answer in advance, which isn't always possible in production.

Failure mode 3: Degraded step-by-step reasoning

On chain-of-thought tasks at high context depth, models show shorter, less thorough reasoning chains. The model short-circuits multi-step reasoning, skipping intermediate steps that it would correctly execute at lower context depths. This failure mode shows up in math-word problems, multi-step code analysis, and legal document reasoning.

Detection: include a complex reasoning task in your evaluation, not just retrieval. Compare the model's chain-of-thought at 50K tokens vs. 200K tokens on the same task.


- Lost needles (retrieval miss), hallucinated synthesis (fluent but partially fabricated answer), and degraded step-by-step reasoning are the three failure modes as context depth increases.
- Hallucinated synthesis is harder to detect than a lost needle because the output looks high quality — detection requires ground-truth verification.
- Degraded reasoning at depth shows as shorter chain-of-thought chains; compare chain-of-thought quality at 50K vs. 200K tokens on the same task to detect this failure mode.

The needle-in-haystack evaluation

The needle-in-haystack test [8] is the standard method for measuring retrieval effective limit. The methodology:

  1. Prepare a "haystack" — a large document padded to the target token depth (e.g., a legal corpus, a Wikipedia dump, or synthetic filler text).
  2. Insert a "needle" — a unique, specific fact that cannot be guessed from context ("The secret phrase is: banana-lighthouse-44").
  3. Insert the needle at a specific position (expressed as percentage of total context depth, e.g., 25%, 50%, 75%).
  4. Ask the model to retrieve the needle.
  5. Score: correct retrieval = 1, any other response = 0.
  6. Repeat across multiple needle positions and context sizes to build an accuracy heatmap.

A well-designed evaluation tests a grid: context size (50K / 100K / 200K / 500K) × needle position (10% / 25% / 50% / 75% / 90%). Each cell should have ≥3 runs to average out noise.

Try this · claude-sonnet-4-6

The following document is 50,000 tokens long. [DOCUMENT_START] [... 24,950 tokens of filler text ...] The product serial number for the Kestrel-7 unit shipped to warehouse 4B is: KST-7-2026-09142. [..…

Show expected output
At 50K tokens with the needle at 50% depth (25,000 tokens in), Claude Sonnet 4.6 reliably retrieves this. The correct answer is 'KST-7-2026-09142'. At this depth the model should respond with high confidence. If you run this with your real documents at higher depths (100K, 200K), note when the retrieval accuracy drops and at what needle position first.
Try this · claude-sonnet-4-6

You have access to a 150,000-token document containing quarterly sales reports from 12 regional offices. The report for the Pacific Northwest region (pages 147–163) states that Q3 2025 revenue was $4.…

Show expected output
This is a three-fact synthesis task. The model must: (1) retrieve growth rates from three separate locations (18%, -4%, 22%), (2) rank them correctly (Great Lakes > Pacific Northwest > Southeast), (3) calculate combined revenue of top 2 ($4.2M + $5.1M = $9.3M), (4) reason about the Southeast's underperformance from the 'delayed contract closures' clue. At 150K tokens with facts spread across different 'pages', this tests synthesis effective limit. If the model gives the wrong combined revenue or misses the delayed-closure explanation, that's a synthesis failure, not just a retrieval miss.

- The needle-in-haystack methodology tests a grid of context size × needle position; each cell needs ≥3 runs to average out noise and build a reliable accuracy heatmap.
- A three-fact synthesis task is a harder and more realistic production test than single-fact retrieval — use both in your evaluation to distinguish retrieval from reasoning capability.
- Determine your retrieval effective limit empirically on your own documents; vendor-published context window sizes describe the ceiling, not the reliable operating range.

Choosing your context strategy

Given this complexity, here is a practical decision framework for multi-document workloads:

Document volumeStrategyRationale
< 50K tokensFull context (any model)All three models are reliable below 50K; full context is simpler
50K – 120K tokensFull context with GPT-5.5, Opus 4.7, or Gemini; test empiricallyMiddle ground: all three models handle this range; GPT-5.5 shows good middle-position stability
120K – 500K tokensOpus 4.7 full context OR RAG pipelineWithin Opus 4.7's synthesis effective limit (~500K); for multi-hop tasks above 300K, structured RAG may outperform Gemini
500K – 800K tokensGemini 3.1 Pro for retrieval; chunked Opus 4.7 for synthesisBoth approach or exceed synthesis effective limits; chunking reduces context depth
> 800K tokensRAG pipeline + any modelBeyond all models' reliable retrieval limits; RAG is the right tool

The key principle: use long context for retrieval tasks; use chunking + multiple calls for synthesis tasks. These are different operations with different reliability profiles.

The RAG vs. long-context tradeoff quantified

For a document corpus of 200K tokens, the cost and reliability comparison looks like this (rough figures from our internal workloads):

ApproachInference costRetrieval accuracySynthesis accuracy
Gemini 3.1 Pro, full context$$$ (200K input tokens)94%81%
Opus 4.7, full context$$ (200K input tokens)91%88%
RAG (top-k=5, good embeddings) + Opus 4.7$ (≈10K tokens retrieved)87% (limited by retrieval step)92%
RAG + GPT-5.5$87%89%

The RAG approaches are 10–20× cheaper. For synthesis tasks, they match or exceed full-context loading accuracy. For retrieval of a single specific fact (where the retrieved chunk is guaranteed to contain the answer), they are slightly less reliable because the embedding retrieval step may miss the right chunk.

The practical takeaway: if your workload is primarily synthesis, use RAG. If your workload is primarily exact-fact retrieval from a single large document, long context is the simpler, more reliable choice — and here, Gemini 3.1 Pro has a genuine advantage.


Hands-on exercise

Run a needle-in-haystack test and source-packet synthesis at three depth levels.

  1. Choose a document or document set from your production context. Prepare versions at three sizes: ~50K tokens, ~200K tokens, and as large as your target depth (up to 1M tokens if testing Gemini 3.1 Pro Preview's full window).
  1. Insert 3 unique "needles" into each version:
  2. - Needle A: near the start (5–10% depth)
  3. - Needle B: in the middle (45–55% depth)
  4. - Needle C: near the end (85–95% depth)
  1. For each model you are evaluating, ask: "What is the value of [needle identifier]?" Run each retrieval ≥3 times.
  1. Record a 3×3 accuracy grid (3 depths × 3 positions). Note which positions and depths produce failures.
  1. Run a source-packet synthesis task: a question that requires combining facts from Needles A and C to produce a summary or plan. If testing Gemini 3.1 Pro Preview, note its documented 1M-token context and 64K output token ceiling — ensure your synthesis prompt doesn't hit the output limit. Record whether the model correctly synthesizes both facts while maintaining reasoning quality.

Verification: You have completed this exercise when: - A 3×3 retrieval accuracy grid is filled for ≥1 model - The retrieval effective limit (depth where accuracy first drops below 90%) is estimated - The source-packet synthesis result is recorded, noting any reasoning degradation at 1M-token depth - You have explicitly checked Gemini's 64K output ceiling if using it for large-scale summarization

Estimated time: 25 minutes

Knowledge check1 of 1
A legal tech team is building a contract analysis tool. Input documents are 50–300-page contracts (~40K–250K tokens). The primary task is: 'Find all clauses related to liability and summarize the company's maximum exposure across all clauses.' This requires multi-hop synthesis across 3–8 scattered clauses. Given the analysis in this chapter, which approach is most likely to give the best results for documents in the 400K–700K token range?

What's next

You now have empirical data on both determinism (Chapter 2) and context fidelity (Chapter 3). Together, these two chapters answer: can I trust the model's outputs, and can I trust them when my documents are large?

The final question is: what does reliable output actually cost? In Chapter 4, you'll build a cost-per-task model that accounts for retry rates, context caching, and tool-call overhead — and discover why the cheapest model on the pricing page is often not the cheapest model in your bill.


References cited

[1]: Liu, N. F. et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the Association for Computational Linguistics, 12. https://arxiv.org/abs/2307.03172 — foundational study on retrieval accuracy degradation as a function of document position.

[2]: Anthropic. Claude Opus 4.7 model card and release notes. https://www.anthropic.com/news — context window specifications and long-context benchmark comparisons.

[3]: Google DeepMind. "Gemini 3.1 Pro release and changelog." https://ai.google.dev/gemini-api/docs/changelog — 1M token context capability notes and multimodal context handling.

[4]: OpenAI. "GPT-5.5 release notes." https://platform.openai.com/docs/models — 128K context window specifications and retrieval accuracy claims.

[5]: Hsieh, C.-Y. et al. (2024). "RULER: What's the Real Context Size of Your Long-Context Language Models?" https://arxiv.org/abs/2404.06654 — empirical methodology for measuring effective context window; multi-needle evaluation design.

[6]: Bai, Y. et al. (2024). "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks." https://arxiv.org/abs/2412.15204 — multi-hop synthesis degradation analysis across frontier models.

[7]: Lewis, P. et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. https://arxiv.org/abs/2005.11401 — foundational RAG architecture paper; baseline for cost and accuracy comparisons between retrieval-augmented and full-context approaches.

[8]: Kamradt, G. (2023). "Needle In A Haystack — Pressure Testing LLMs." https://github.com/gkamradt/LLMTest_NeedleInAHaystack — standard methodology for context window retrieval pressure testing; basis for the position-grid evaluation design used in this chapter's hands-on exercise.

  • Koenig AI Academy internal long-context dataset: derived from /data/claude-tool-use-determinism/2026-Q2/ extended test set.
Chapter 4 · 50 min

Cost-per-task — pricing vs. actual bill on real workloads

Listen · deep-dive podcast
Download slides (.pptx) Voiceover script

> Prerequisites: Chapter 1 required; Chapters 2 and 3 recommended for the best practical grounding. You should have token counts from at least one benchmark run. > > Time: 50 minutes > > Learning objectives: By the end of this chapter, you can calculate a defensible cost-per-task number for your workload, account for retries and caching, and know when the "cheaper" model is actually more expensive.

Cost-per-task is the total cost to complete one end-to-end production workload unit — input tokens, output tokens, tool-call overhead, retries, and cache misses. It is distinct from $/M token pricing, which ignores the factors that dominate real bills. As of Q2 2026, Gemini 3.1 Pro is cheapest per token, GPT-5.5 the most expensive, Opus 4.7 in the middle — but the cost-per-task ordering is often the reverse. This chapter shows why.

Key facts

  • List pricing (Q2 2026): Opus 4.7 $5/$25/M in/out, cache read $0.50/M (90% discount) [1]; GPT-5.5 $5/$30/M, cached input $0.50/M (90% discount) [2]; Gemini 3.1 Pro $2/$12/M, context caching $0.20/M (90% discount) [3].
  • On a simple prompt with no retries, Gemini 3.1 Pro is 2.5× cheaper than Opus 4.7 — the number that appears in comparison articles.
  • Gemini 3.1 Pro's average determinism is 81.9% versus Opus 4.7's 91.4%. At 5-step pipelines that gap means a 2× difference in pipeline success rate (37% vs. 61%) — each failed run requiring a full retry.
  • The biggest hidden cost is prompt caching misses: a 10K-token system prompt at full price on 10,000 calls/day costs $500/day; with caching, $50/day.
  • Tool-call tokens are billed as input on every call: 10 tool definitions (~600 tokens) adds $9 per 1,000 calls at Opus pricing.
- List pricing: Opus 4.7 $5/$25/M in/out; GPT-5.5 $5/$30/M; Gemini 3.1 Pro $2/$12/M — but list pricing omits retry rates, caching hit rates, and tool-call overhead that dominate real bills.
- The biggest hidden cost is prompt caching misses: a 10K-token system prompt billed at full price on every call adds $500/day at Opus pricing at 10,000 calls/day, vs. $50/day with caching.
- Tool definition tokens are billed as input tokens on every call; a system with 10 tool definitions (~600 tokens) adds $9 per 1,000 calls at Opus pricing.

Why pricing pages are misleading

The standard comparison table omits retry rate, caching hit rate, tool-call overhead, output amplification, and context efficiency losses. A real cost model:

cost_per_task = (
    prompt_tokens_uncached × input_price
  + prompt_tokens_cached × cache_price
  + output_tokens × output_price
  + tool_tokens × input_price
) × (1 / determinism_rate)^pipeline_steps

The retry multiplier (1 / determinism_rate)^pipeline_steps is the single biggest divergence between pricing page and actual bill. [5]

- A real cost model accounts for retry rate, caching hit rate, tool-call token overhead, output amplification across pipeline steps, and context window efficiency — none of which appear on pricing pages.
- The retry multiplier formula is `(1 / determinism_rate)^pipeline_steps` — the single biggest driver of divergence between pricing page cost and actual bill.
- Preview model endpoints carry a hidden reliability tax beyond token cost: lower quotas, more frequent 429/503 errors, and shorter deprecation cycles all increase effective cost of ownership.

The retry multiplier in practice

Representative 3-step pipeline: 2,000-token system prompt, 200-token message, 800-token tool definitions, 400-token output per step.

Without caching, no retries:

ModelPer-step input costPer-step output cost3-step total
Opus 4.73,000 × $5/M = $0.015400 × $25/M = $0.010$0.075
GPT-5.53,000 × $5/M = $0.015400 × $30/M = $0.012$0.081
Gemini 3.1 Pro3,000 × $2/M = $0.006400 × $12/M = $0.0048$0.032

Gemini is 2.3× cheaper than Opus with no retries. GPT-5.5 and Opus are within 8% of each other at list price — the critical differentiator is reliability under retries. This is the number in the comparison article.

Now apply determinism-driven retries (category-5 complexity, multi-tool sequence — Opus 94%, GPT-5.5 90%, Gemini 84%):

Pipeline success probability: Opus 0.94³ = 83%; GPT-5.5 0.90³ = 73%; Gemini 0.84³ = 59%.

ModelPer-run costExpected runs to successCost-per-successful-task
Opus 4.7$0.0751.20$0.090
GPT-5.5$0.0811.37$0.111
Gemini 3.1 Pro$0.0321.69$0.054

Gemini is still cheapest — but the ratio has compressed from 2.3× to 1.7× against Opus. GPT-5.5's retry overhead pushes it to $0.111, roughly 23% above Opus after retries — despite matching on input price. At higher complexity the gap widens further: a 14-point determinism gap is 1.8× at 3 steps but 7.2× at 10 steps. Multi-agent systems with planning, tool-selection, and error-handling routinely reach 5–10 action steps per task.

<Callout type="hot"> The inversion is real. At category-9 complexity (ambiguous-input, multi-tool), Gemini 3.1 Pro crosses above Opus 4.7 in cost-per-task at pipeline length ≥ 5 steps. If your agentic system has 5+ action steps on hard inputs, the pricing page comparison is actively misleading. Run your determinism scores through the retry multiplier before making a cost decision. </Callout>

- At ambiguous-input complexity on a long pipeline, the cost ordering can invert: Gemini's higher retry rate more than offsets its lower per-token price.
- The cost break-even between Gemini 3.1 Pro and Opus 4.7 at ambiguous-input complexity occurs between 4 and 5 pipeline steps; beyond 5 steps, Opus wins on cost-per-task.
- The retry multiplier scales as `1 / determinism^n` — a 14-point determinism gap is 1.8× at 3 steps but grows to a 7.2× difference at 10 steps.

Prompt caching: the underrated cost lever

At 10,000 calls/day with a 10K-token system prompt:

ModelWithout cachingWith cachingDaily savings
Opus 4.7$500/day$50/day$450/day
GPT-5.5$500/day$50/day$450/day
Gemini 3.1 Pro$200/day$20/day$180/day

All three providers give a 90% discount on cached tokens. The Gemini-vs-Opus and Gemini-vs-GPT-5.5 2.5× per-token ratio is preserved with caching since all platforms apply the same 90% discount.

Caching gotchas: Anthropic's cache TTL is 5 minutes — calls more than 5 minutes apart restart the cache; minimum cacheable prefix is 4,096 tokens. OpenAI caches automatically at a 90% discount with a 128-token minimum. Google's context caching requires explicit API creation with a configurable TTL (not automatic), but the 90% discount is competitive for large, stable system prompts.

- Anthropic caches at 4,096+ token boundaries for current flagship models with a 5-minute TTL and 90% discount on cached tokens; calls more than 5 minutes apart restart the cache.
- OpenAI's cache is automatic with a 90% discount and 128-token minimum; Google's context caching requires explicit API creation with configurable TTL and also gives a 90% discount.
- Cache hit rate depends on call timing: batch workloads with irregular intervals can have much lower actual cache hit rates than the theoretical maximum.

The three workload archetypes, costed

Archetype A: Coding agent (multi-step, tool-heavy)

Representative profile: 8,000-token system prompt cached after first call; 3,000 token average input; 800 token output; 5 steps; category 5–7 schemas.

ModelDeterminism (5-step success)Cost per successful task (with caching)
Opus 4.7~86% (0.86⁵ = 47%)~$0.42
GPT-5.5 + strict~93% (0.93⁵ = 70%)~$0.31
Gemini 3.1 Pro~79% (0.79⁵ = 31%)~$0.28

GPT-5.5 with strict: true delivers the best pipeline success rate (70%) at the lowest cost among the top-two performers (~$0.31 vs Opus's ~$0.42) — a better value than pricing pages suggest, because its determinism advantage reduces expected retries more than the slight output-price premium adds. Gemini ($0.28) is marginally cheaper but requires robust retry infrastructure at 31% pipeline success. [4]

Archetype B: Document Q&A (long-context, single query)

Representative profile: 80K-token document; 500-token system prompt; 600-token output; 1 step.

ModelCost per callNotes
Opus 4.7$0.415$80K × $5/M + 600 × $25/M
GPT-5.5$0.418$80K × $5/M + 600 × $30/M
Gemini 3.1 Pro$0.167$80K × $2/M + 600 × $12/M

With no pipeline and no retries, Gemini 3.1 Pro wins (2.5× cheaper than either Opus or GPT-5.5, which are now nearly cost-equivalent). Single-step tasks don't compound determinism variance; Gemini wins on cost for retrieval-focused workloads.

Archetype C: High-volume classification (batch, 10M items/month)

Representative profile: 300 tokens per item; 1,000-token system prompt cached; 50 tokens output; 1 step.

ModelMonthly cost (no retries)With 5% retry rate
Opus 4.7~$77K/month~$81K
GPT-5.5~$80K/month~$84K
Gemini 3.1 Pro~$32K/month~$34K

Gemini 3.1 Pro wins — saving $45K/month vs. Opus. The simple flat schema (category 1–2) keeps Gemini's determinism at 96–100%, eliminating the reliability advantage of more expensive models. Multi-model routing strategies — cheap model for easy tasks, premium model for complex — can reduce cost-per-task by 30–60%. [4][6]

Hands-on exercise

Build a cost-per-task model for your use case using your Chapter 2 benchmark data.

Fill in these numbers from actual benchmark runs (not guesses):

``` USE CASE: [describe in 1 sentence]

TOKEN COUNTS: System prompt tokens: ___ Average user message tokens: ___ Tool definition tokens: ___ Average output tokens: ___ Pipeline steps: ___

CACHING: Is system prompt ≥ 1024 tokens? [Y/N] Estimated cache hit rate: ___ % (Anthropic: use 80% if calls within 5-min windows; 40% if irregular)

DETERMINISM SCORES (from Chapter 2): Opus 4.7: ___ % GPT-5.5: ___ % Gemini 3.1 Pro: ___ %

COST FORMULA (per model): input_cost = (system_prompt × (1 - cache_hit_rate) × INPUT_PRICE) + (system_prompt × cache_hit_rate × CACHE_PRICE) + (message_tokens + tool_tokens) × INPUT_PRICE output_cost = output_tokens × OUTPUT_PRICE retry_multiplier = 1 / (determinism ^ pipeline_steps) cost_per_task = (input_cost + output_cost) × retry_multiplier × pipeline_steps

RESULTS: Opus 4.7 cost-per-task: $___ GPT-5.5 cost-per-task: $___ Gemini 3.1 Pro cost-per-task: $___

RECOMMENDATION: [which model and why, in 1 sentence] ```

Your cost model is complete when all token counts are from actual benchmark runs, cache hit rate reflects your actual call pattern, and cost-per-task accounts for retries using your measured determinism scores. Estimated time: 20 minutes.


What's next

You have a scorecard (ch01), determinism scores (ch02), context fidelity data (ch03), and a cost-per-task model (ch04). The capstone project synthesizes all four into a model selection memo — format in vault/courses/picking-a-frontier-model-2026-q2/outline.md.


References cited

[1]: Anthropic. "Claude pricing." https://www.anthropic.com/pricing — Opus 4.7 input/output/cache pricing as of Q2 2026. Also: "Prompt caching." https://www.anthropic.com/news.

[2]: OpenAI. "OpenAI API pricing." https://developers.openai.com/api/docs/pricing — GPT-5.5 $5/$30/M input/output; cached input $0.50/M (90% discount). Verified 2026-06-14. Model release notes: https://developers.openai.com/api/docs/models.

[3]: Google. "Gemini API pricing." https://ai.google.dev/pricing — Gemini 3.1 Pro input/output/context caching pricing as of Q2 2026. Changelog: https://ai.google.dev/gemini-api/docs/changelog.

[4]: Koenig AI Academy internal cost model data, Q2 2026. Derived from the Q2 2026 tool-use determinism benchmark dataset (reference tables embedded in Chapter 2) with retry simulation applied at workload scale.

[5]: Patil, S. et al. Berkeley Function-Calling Leaderboard (BFCL) V4. https://gorilla.cs.berkeley.edu/leaderboard.html — analysis of tool-call reliability impact on pipeline cost.

[6]: Chen, L. et al. (2023). "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance." https://arxiv.org/abs/2305.05176 — analysis of model routing, cascading, and selection strategies that reduce cost-per-task by matching task complexity to model capability.