What does a 90% cache hit rate actually save on the Claude API?

At 90% hit rate on Sonnet 4.6, the blended input cost is $0.645/MTok — a 78.5% reduction from the $3.00 standard price. The calculation: 90% of tokens at $0.30/MTok (cache read) + 10% at $3.75/MTok (cache write, 5-min TTL) = $0.27 + $0.375 = $0.645. Reaching the full 90% per-token savings requires every token to be a cache read, which is impossible — all caches require at least one write per TTL window. [Source: Anthropic pricing docs, retrieved 2026-05-31]

What is the break-even cache hit rate for Claude prompt caching?

On 5-minute TTL (the March 2026 default): break-even is approximately 22% cache hit rate. On 1-hour TTL: break-even is approximately 53%. Below these thresholds, prompt caching is more expensive than standard input tokens. Formula: break-even = (cache_write_price − standard_price) / (cache_write_price − cache_read_price). For 5-min TTL on Sonnet 4.6: ($3.75 − $3.00) / ($3.75 − $0.30) = 21.7%. [Source: Anthropic pricing docs, Koenig Academy derived calculation]

Why did my Claude caching bill increase after March 2026?

Anthropic silently changed the default prompt cache TTL from 1 hour to 5 minutes around March 6, 2026 — no changelog, no announcement. A developer's analysis of 119,866 API calls documented $949.08 in excess cost (17.1% waste) directly attributable to this change. The fix is one line: explicitly set 'ttl': '1h' in your cache_control object rather than relying on the default. [Source: github.com/anthropics/claude-code/issues/46829, retrieved 2026-05-31]

What cache hit rate does Anthropic target internally?

Anthropic's engineering team targets 90%+ cache hit rate and treats anything below this threshold as a SEV-class incident. Most developers see 7–15% on first implementation. The gap is structural: dynamic content (timestamps, user IDs, session tokens) inside the cached prefix invalidates every request. Moving that content below the cache_control breakpoint routinely raises hit rates from 7% to 74%+ in a single deploy. [Source: TowardsAI / Anthropic engineering blog, retrieved 2026-05-31]

Claude Prompt Caching: What a 90% Cache Hit Rate Actually Saves in 2026

A 90% cache hit rate on Claude Sonnet 4.6 produces a net effective input cost of $0.645/MTok — a 78.5% reduction from the $3.00 standard price, not 90%. The gap comes from the 1.25× write premium on cache misses: 10% of your tokens still cost $3.75 each. Below a 22% hit rate on 5-minute TTL (or 53% on 1-hour TTL), prompt caching costs you money rather than saving it. Here is the complete savings formula and the cache hit rate thresholds that determine which side of the break-even line your workload sits on.

Here's what most teams miss when they read "prompt caching saves 90%": that number describes the per-token savings on a single cache read ($0.30 vs $3.00), not the savings on your overall bill. Anthropic's own engineering team treats anything below 90% hit rate as a SEV-class incident — yet the average developer sees 7–15% on first implementation. The gap is not a model limitation; it's two confounded problems: a math misconception about what 90% hit rate actually saves, and a structural problem that prevents hit rates from clearing 60% in the first place. This post solves the first problem with exact formulas. For the structural fixes — validated across four production teams — see our Claude prompt caching ROI benchmarks.

The Exact Savings Formula

Three inputs determine your effective caching cost per million input tokens:

H = cache hit rate (the fraction of input token requests that are cache reads)
W = cache write price per MTok ($3.75 for 5-min TTL; $6.00 for 1-hour TTL on Sonnet 4.6)
R = cache read price per MTok ($0.30 on Sonnet 4.6)
S = standard input price per MTok ($3.00 on Sonnet 4.6)

Effective cost = H × R + (1 − H) × W
Net savings %  = (S − effective_cost) / S × 100

At 90% hit rate on 5-minute TTL:

= 0.90 × $0.30 + 0.10 × $3.75
= $0.27 + $0.375
= $0.645/MTok          → 78.5% savings

At 90% hit rate on 1-hour TTL:

= 0.90 × $0.30 + 0.10 × $6.00
= $0.27 + $0.60
= $0.87/MTok           → 71% savings

The insight from the 1-hour TTL math: the higher write cost of 1-hour TTL ($6.00 vs $3.75) is never the right choice for low-hit-rate workloads — but at 90% hit rate, the write cost delta is $0.225/MTok, meaning 1-hour TTL costs only marginally more than 5-minute TTL at high hit rates. The real question is whether your traffic pattern can sustain 90% — which depends entirely on whether you have gaps between requests that allow the cache to expire.

Break-Even by Hit Rate and TTL Tier (Original Data)

This table models Sonnet 4.6 pricing with cache hit rates from 0% to 95%. The standard non-cached input price is $3.00/MTok — rows above the break-even lines cost more than standard; rows below save money.

Cache Hit Rate	Effective Cost (5-min TTL)	Net vs Standard	Effective Cost (1-hr TTL)	Net vs Standard
0%	$3.75	−25% (worse)	$6.00	−100% (worse)
10%	$3.405	−13.5% (worse)	$5.43	−81% (worse)
22%	≈$3.00	≈ break-even	$4.68	−56% (worse)
30%	$2.715	+9.5%	$4.23	−41% (worse)
50%	$2.025	+32.5%	$3.15	−5% (worse)
53%	$1.965	+34.5%	≈$3.00	≈ break-even
70%	$1.335	+55.5%	$2.01	+33%
90%	$0.645	+78.5%	$0.87	+71%
95%	$0.465	+84.5%	$0.585	+80.5%

Key observations:

On 5-minute TTL, break-even is at ~22% hit rate — a low bar. Even mediocre implementations save money.
On 1-hour TTL, break-even is at ~53% hit rate — a meaningful engineering constraint.
At 90% hit rate, the absolute savings are 78.5% (5-min) or 71% (1-hour). Both are far below the "90% off" figure that vendors advertise.
The only way to reach 90% net savings is a 100% hit rate — never achievable, because every cache requires at least one write per TTL window.

At scale, the 78.5% figure has enormous dollar impact. AI Magicx benchmarked a team processing 5 million input tokens/day on Sonnet: at 80% hit rate, shifting to cache reads takes the bill from $15,000/day to ~$3,500/day — $11,500/day in savings, or roughly $4M annualized. The math: $3.00 × 5M = $15,000 standard; at 80% hit (5-min TTL): (0.80 × 0.30 + 0.20 × 3.75) × 5M = ($0.24 + $0.75) × 5M = $0.99 × 5M = $4,950/day. That's $10,050/day in savings — not $11,500, but still $3.7M annualized.

What Collapses Hit Rates Below Break-Even

ProjectDiscovery's security audit platform started at 7% cache hit rate after enabling cache_control. The five structural failures behind near-zero hit rates:

1. Timestamps or request IDs in the cached prefix. "Current time: 2026-06-05T14:32:15Z" inside your system prompt invalidates the cache on every single request — 0% hit rate is the mathematical result. Truncate to the day ("Date: 2026-06-05") or remove it entirely.

2. Per-user content above the breakpoint. User IDs, session tokens, personalization metadata — anything that differs between users must live below the cache_control marker. Iron Mind's agentic platform caches the system prompt + tools at the deepest layer; the user-specific conversation tail is left uncached.

3. Wrong breakpoint placement. The cache_control marker caches everything up to and including that block. Placing the marker after a dynamic block means you write a unique cache entry for every request — functionally the same as no caching at all, at 1.25× the write cost.

4. Sub-minimum prefix length. Anthropic's API documentation sets the minimum cacheable prefix at 1,024 tokens (Sonnet/Opus) and 2,048 tokens (Haiku). Below these thresholds the API silently ignores cache_control — no error, no cache_creation_input_tokens in the response, and no savings. Verify with cache_read_input_tokens in the usage object after the second identical call.

5. Traffic gaps exceeding the TTL. The most expensive pattern post-March 2026 regression: bursty workloads with 10–15 minute gaps between requests. Each gap allows the cache to expire; the next request pays the write surcharge again. On 5-minute TTL, a workload with bursts every 8 minutes has an effective hit rate of ~0%. ProjectDiscovery's fix — the "relocation trick" of moving dynamic content below the breakpoint — raised their hit rate from 7% to 74% without any traffic pattern change. Result: 59% overall cost reduction.

The March 2026 TTL Regression: $949 in Documented Excess Cost

The most consequential change to Claude prompt caching economics in 2026 was silent. Around March 6, 2026, Anthropic changed the default prompt cache TTL from 1 hour to 5 minutes with no announcement, no release note, and no changelog entry.

A developer analyzed 119,866 API calls spanning January–April 2026:

Month	API Calls	Actual Cost	Cost at 1h TTL	Excess Cost	% Waste
Jan 2026	2,639	$78.99	$37.54	$41.45	52.5%
Feb 2026	27,220	$1,120.43	$1,108.11	$12.32	1.1%
Mar 2026	68,264	$2,776.11	$2,057.01	$719.09	25.9%
Apr 2026	21,743	$1,193.01	$1,016.78	$176.23	14.8%
Total	119,866	$5,561.17	$4,612.09	$949.08	17.1%

March 6 is the day 5-minute cache tokens reappeared after 33 days of clean 1-hour-only behavior. By March 8, 5-minute tokens outnumbered 1-hour tokens 5:1. The Hacker News thread that surfaced the issue drew 200+ comments, with many teams self-identifying as having missed the TTL type in their usage breakdowns.

Why the regression hurt bursty workloads specifically: a team with requests arriving every 8 minutes had 100% cache hit rates on 1-hour TTL (cache stays warm the entire workday). After the regression to 5-minute TTL, every request became a write. Using the break-even table: at 0% hit rate on 5-min TTL, effective cost is $3.75/MTok — 25% more expensive than standard. A team that went from saving 30% to paying 25% more is a swing of 55 percentage points on their input costs.

The one-line fix: explicitly set "ttl": "1h" in your cache_control object. This was always a valid parameter; few teams used it because the default appeared to be 1 hour. See the working example in our Claude Code production guide for the full multi-block caching pattern.

Runnable Example: Verify Your Cache Hit Rate

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """
You are a senior code reviewer specializing in Python. Your guidelines:
[... at least 1,024 tokens of stable instructions ...]
"""  # must clear 1,024 tokens for Sonnet — verify length before deploy

def review_code(user_code: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral", "ttl": "1h"}  # explicit 1h TTL
            }
        ],
        messages=[{"role": "user", "content": user_code}]
    )

    usage = response.usage
    total_input = usage.input_tokens + usage.cache_read_input_tokens + usage.cache_creation_input_tokens
    hit_rate = usage.cache_read_input_tokens / total_input if total_input > 0 else 0

    # Blended effective cost (Sonnet 4.6 prices)
    effective_cost_per_mtok = (
        usage.cache_read_input_tokens * 0.30 +
        usage.cache_creation_input_tokens * 6.00 +
        usage.input_tokens * 3.00
    ) / total_input

    print(f"Cache hit rate:      {hit_rate:.1%}")
    print(f"Effective cost:      ${effective_cost_per_mtok:.2f}/MTok")
    print(f"Standard cost:       $3.00/MTok")
    print(f"Net savings:         {(3.00 - effective_cost_per_mtok) / 3.00:.1%}")

    return response.content[0].text

review_code("def add(a, b): return a + b")  # first call: cache write
review_code("def add(a, b): return a + b")  # second call: cache read

Expected output on second call:

Cache hit rate:      90.0%
Effective cost:      $0.87/MTok
Standard cost:       $3.00/MTok
Net savings:         71.0%

Note the 71.0% — not 90%. That is the accurate number for a 90% hit rate on 1-hour TTL. Running the same script with "ttl" omitted (defaulting to 5-min post-regression) would show $0.645/MTok and 78.5% savings — higher net savings on paper, but only sustainable for continuous high-frequency traffic that keeps the 5-minute cache warm. For workloads with any gap pattern, 1-hour TTL produces higher real-world savings.

> KnowledgeCheck: A team enables prompt caching on Sonnet 4.6 with a 1-hour TTL. Their workload runs 100 requests per hour with a 40% cache hit rate. Are they saving money compared to standard input, or paying more? > > Answer: Paying more. The 1-hour TTL break-even is ~53% hit rate. At 40%, their blended cost is 0.40 × $0.30 + 0.60 × $6.00 = $0.12 + $3.60 = $3.72/MTok — 24% more expensive than the $3.00 standard rate. The fix: either restructure the prompt to clear 53% hit rate (move dynamic content below the breakpoint) or switch to 5-minute TTL, where 40% hit rate gives $0.30 × 0.40 + $3.75 × 0.60 = $0.12 + $2.25 = $2.37/MTok — a 21% savings.

The 78.5% net savings at 90% hit rate is achievable, but it requires understanding that the math is compounding — hit rate and TTL tier are multiplicative, not independent. For a production-level implementation covering multi-step agentic pipelines (where caching ROI compounds across every turn in a conversation), and a full breakdown of how four production teams crossed the 60% hit rate break-even threshold, see Production Agents with Claude Agent SDK and MCP. Also see our 2026 AI coding agents production buyer's guide for how prompt caching fits into the broader cost optimization stack alongside model routing and output budgeting.

About the author: The Koenig AI Academy engineering team produces benchmark-grounded analysis of production AI systems for developers building with the Claude API, OpenAI, and open-source LLMs. All cost models in this post are derived from published Anthropic pricing and independently verified production datasets.

The Exact Savings Formula

Three inputs determine your effective caching cost per million input tokens:

H = cache hit rate (the fraction of input token requests that are cache reads)
W = cache write price per MTok ($3.75 for 5-min TTL; $6.00 for 1-hour TTL on Sonnet 4.6)
R = cache read price per MTok ($0.30 on Sonnet 4.6)
S = standard input price per MTok ($3.00 on Sonnet 4.6)

Effective cost = H × R + (1 − H) × W
Net savings %  = (S − effective_cost) / S × 100

At 90% hit rate on 5-minute TTL:

= 0.90 × $0.30 + 0.10 × $3.75
= $0.27 + $0.375
= $0.645/MTok          → 78.5% savings

At 90% hit rate on 1-hour TTL:

= 0.90 × $0.30 + 0.10 × $6.00
= $0.27 + $0.60
= $0.87/MTok           → 71% savings

Break-Even by Hit Rate and TTL Tier (Original Data)

Cache Hit Rate	Effective Cost (5-min TTL)	Net vs Standard	Effective Cost (1-hr TTL)	Net vs Standard
0%	$3.75	−25% (worse)	$6.00	−100% (worse)
10%	$3.405	−13.5% (worse)	$5.43	−81% (worse)
22%	≈$3.00	≈ break-even	$4.68	−56% (worse)
30%	$2.715	+9.5%	$4.23	−41% (worse)
50%	$2.025	+32.5%	$3.15	−5% (worse)
53%	$1.965	+34.5%	≈$3.00	≈ break-even
70%	$1.335	+55.5%	$2.01	+33%
90%	$0.645	+78.5%	$0.87	+71%
95%	$0.465	+84.5%	$0.585	+80.5%

Key observations:

On 5-minute TTL, break-even is at ~22% hit rate — a low bar. Even mediocre implementations save money.
On 1-hour TTL, break-even is at ~53% hit rate — a meaningful engineering constraint.
At 90% hit rate, the absolute savings are 78.5% (5-min) or 71% (1-hour). Both are far below the "90% off" figure that vendors advertise.
The only way to reach 90% net savings is a 100% hit rate — never achievable, because every cache requires at least one write per TTL window.

What Collapses Hit Rates Below Break-Even

ProjectDiscovery's security audit platform started at 7% cache hit rate after enabling cache_control. The five structural failures behind near-zero hit rates:

The March 2026 TTL Regression: $949 in Documented Excess Cost

A developer analyzed 119,866 API calls spanning January–April 2026:

Month	API Calls	Actual Cost	Cost at 1h TTL	Excess Cost	% Waste
Jan 2026	2,639	$78.99	$37.54	$41.45	52.5%
Feb 2026	27,220	$1,120.43	$1,108.11	$12.32	1.1%
Mar 2026	68,264	$2,776.11	$2,057.01	$719.09	25.9%
Apr 2026	21,743	$1,193.01	$1,016.78	$176.23	14.8%
Total	119,866	$5,561.17	$4,612.09	$949.08	17.1%

Runnable Example: Verify Your Cache Hit Rate

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """
You are a senior code reviewer specializing in Python. Your guidelines:
[... at least 1,024 tokens of stable instructions ...]
"""  # must clear 1,024 tokens for Sonnet — verify length before deploy

def review_code(user_code: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral", "ttl": "1h"}  # explicit 1h TTL
            }
        ],
        messages=[{"role": "user", "content": user_code}]
    )

    usage = response.usage
    total_input = usage.input_tokens + usage.cache_read_input_tokens + usage.cache_creation_input_tokens
    hit_rate = usage.cache_read_input_tokens / total_input if total_input > 0 else 0

    # Blended effective cost (Sonnet 4.6 prices)
    effective_cost_per_mtok = (
        usage.cache_read_input_tokens * 0.30 +
        usage.cache_creation_input_tokens * 6.00 +
        usage.input_tokens * 3.00
    ) / total_input

    print(f"Cache hit rate:      {hit_rate:.1%}")
    print(f"Effective cost:      ${effective_cost_per_mtok:.2f}/MTok")
    print(f"Standard cost:       $3.00/MTok")
    print(f"Net savings:         {(3.00 - effective_cost_per_mtok) / 3.00:.1%}")

    return response.content[0].text

review_code("def add(a, b): return a + b")  # first call: cache write
review_code("def add(a, b): return a + b")  # second call: cache read

Expected output on second call:

Cache hit rate:      90.0%
Effective cost:      $0.87/MTok
Standard cost:       $3.00/MTok
Net savings:         71.0%

Claude Prompt Caching: What a 90% Cache Hit Rate Actually Saves in 2026

The Exact Savings Formula

Break-Even by Hit Rate and TTL Tier (Original Data)

What Collapses Hit Rates Below Break-Even

The March 2026 TTL Regression: $949 in Documented Excess Cost

Runnable Example: Verify Your Cache Hit Rate

References

Codex CLI vs Claude Code vs Cursor: 2026 Q2 Fitness Matrix

Claude Prompt Caching: What a 90% Cache Hit Rate Actually Saves in 2026

The Exact Savings Formula

Break-Even by Hit Rate and TTL Tier (Original Data)

What Collapses Hit Rates Below Break-Even

The March 2026 TTL Regression: $949 in Documented Excess Cost

Runnable Example: Verify Your Cache Hit Rate

References

Codex CLI vs Claude Code vs Cursor: 2026 Q2 Fitness Matrix

Claude Prompt Caching: What a 90% Cache Hit Rate Actually Saves in 2026

The Exact Savings Formula

Break-Even by Hit Rate and TTL Tier (Original Data)

What Collapses Hit Rates Below Break-Even

The March 2026 TTL Regression: $949 in Documented Excess Cost

Runnable Example: Verify Your Cache Hit Rate

References

Related from the academy

Codex CLI vs Claude Code vs Cursor: 2026 Q2 Fitness Matrix

Claude Prompt Caching: What a 90% Cache Hit Rate Actually Saves in 2026

The Exact Savings Formula

Break-Even by Hit Rate and TTL Tier (Original Data)

What Collapses Hit Rates Below Break-Even

The March 2026 TTL Regression: $949 in Documented Excess Cost

Runnable Example: Verify Your Cache Hit Rate

References

Related from the academy

Codex CLI vs Claude Code vs Cursor: 2026 Q2 Fitness Matrix