Claude Prompt Caching: What a 90% Cache Hit Rate Actually Saves in 2026
- Calculate your exact effective cache savings using the blended cost formula.
- Determine which TTL tier your traffic pattern warrants and the hit rate required to break even.
- Identify the structural anti-patterns that collapse hit rates and apply the one-line March 2026 TTL regression fix.
A 90% cache hit rate on Claude Sonnet 4.6 produces a net effective input cost of $0.645/MTok — a 78.5% reduction from the $3.00 standard price, not 90%. The gap comes from the 1.25× write premium on cache misses: 10% of your tokens still cost $3.75 each. Below a 22% hit rate on 5-minute TTL (or 53% on 1-hour TTL), prompt caching costs you money rather than saving it. Here is the complete savings formula and the cache hit rate thresholds that determine which side of the break-even line your workload sits on.
Here's what most teams miss when they read "prompt caching saves 90%": that number describes the per-token savings on a single cache read ($0.30 vs $3.00), not the savings on your overall bill. Anthropic's own engineering team treats anything below 90% hit rate as a SEV-class incident — yet the average developer sees 7–15% on first implementation. The gap is not a model limitation; it's two confounded problems: a math misconception about what 90% hit rate actually saves, and a structural problem that prevents hit rates from clearing 60% in the first place. This post solves the first problem with exact formulas. For the structural fixes — validated across four production teams — see our Claude prompt caching ROI benchmarks.
The Exact Savings Formula
Three inputs determine your effective caching cost per million input tokens:
- H = cache hit rate (the fraction of input token requests that are cache reads)
- W = cache write price per MTok ($3.75 for 5-min TTL; $6.00 for 1-hour TTL on Sonnet 4.6)
- R = cache read price per MTok ($0.30 on Sonnet 4.6)
- S = standard input price per MTok ($3.00 on Sonnet 4.6)
``
Effective cost = H × R + (1 − H) × W
Net savings % = (S − effective_cost) / S × 100
``
At 90% hit rate on 5-minute TTL:
``
= 0.90 × $0.30 + 0.10 × $3.75
= $0.27 + $0.375
= $0.645/MTok → 78.5% savings
``
At 90% hit rate on 1-hour TTL:
``
= 0.90 × $0.30 + 0.10 × $6.00
= $0.27 + $0.60
= $0.87/MTok → 71% savings
``
The insight from the 1-hour TTL math: the higher write cost of 1-hour TTL ($6.00 vs $3.75) is never the right choice for low-hit-rate workloads — but at 90% hit rate, the write cost delta is $0.225/MTok, meaning 1-hour TTL costs only marginally more than 5-minute TTL at high hit rates. The real question is whether your traffic pattern can sustain 90% — which depends entirely on whether you have gaps between requests that allow the cache to expire.
Break-Even by Hit Rate and TTL Tier (Original Data)
This table models Sonnet 4.6 pricing with cache hit rates from 0% to 95%. The standard non-cached input price is $3.00/MTok — rows above the break-even lines cost more than standard; rows below save money.
| Cache Hit Rate | Effective Cost (5-min TTL) | Net vs Standard | Effective Cost (1-hr TTL) | Net vs Standard |
|---|---|---|---|---|
| 0% | $3.75 | −25% (worse) | $6.00 | −100% (worse) |
| 10% | $3.405 | −13.5% (worse) | $5.43 | −81% (worse) |
| 22% | ≈$3.00 | ≈ break-even | $4.68 | −56% (worse) |
| 30% | $2.715 | +9.5% | $4.23 | −41% (worse) |
| 50% | $2.025 | +32.5% | $3.15 | −5% (worse) |
| 53% | $1.965 | +34.5% | ≈$3.00 | ≈ break-even |
| 70% | $1.335 | +55.5% | $2.01 | +33% |
| 90% | $0.645 | +78.5% | $0.87 | +71% |
| 95% | $0.465 | +84.5% | $0.585 | +80.5% |
Key observations:
- On 5-minute TTL, break-even is at ~22% hit rate — a low bar. Even mediocre implementations save money.
- On 1-hour TTL, break-even is at ~53% hit rate — a meaningful engineering constraint.
- At 90% hit rate, the absolute savings are 78.5% (5-min) or 71% (1-hour). Both are far below the "90% off" figure that vendors advertise.
- The only way to reach 90% net savings is a 100% hit rate — never achievable, because every cache requires at least one write per TTL window.
At scale, the 78.5% figure has enormous dollar impact. AI Magicx benchmarked a team processing 5 million input tokens/day on Sonnet: at 80% hit rate, shifting to cache reads takes the bill from $15,000/day to ~$3,500/day — $11,500/day in savings, or roughly $4M annualized. The math: $3.00 × 5M = $15,000 standard; at 80% hit (5-min TTL): (0.80 × 0.30 + 0.20 × 3.75) × 5M = ($0.24 + $0.75) × 5M = $0.99 × 5M = $4,950/day. That's $10,050/day in savings — not $11,500, but still $3.7M annualized.
What Collapses Hit Rates Below Break-Even
ProjectDiscovery's security audit platform started at 7% cache hit rate after enabling cache_control. The five structural failures behind near-zero hit rates:
1. Timestamps or request IDs in the cached prefix.
"Current time: 2026-06-05T14:32:15Z" inside your system prompt invalidates the cache on every single request — 0% hit rate is the mathematical result. Truncate to the day ("Date: 2026-06-05") or remove it entirely.
2. Per-user content above the breakpoint.
User IDs, session tokens, personalization metadata — anything that differs between users must live below the cache_control marker. Iron Mind's agentic platform caches the system prompt + tools at the deepest layer; the user-specific conversation tail is left uncached.
3. Wrong breakpoint placement.
The cache_control marker caches everything up to and including that block. Placing the marker after a dynamic block means you write a unique cache entry for every request — functionally the same as no caching at all, at 1.25× the write cost.
4. Sub-minimum prefix length.
Anthropic's API documentation sets the minimum cacheable prefix at 1,024 tokens (Sonnet/Opus) and 2,048 tokens (Haiku). Below these thresholds the API silently ignores cache_control — no error, no cache_creation_input_tokens in the response, and no savings. Verify with cache_read_input_tokens in the usage object after the second identical call.
5. Traffic gaps exceeding the TTL. The most expensive pattern post-March 2026 regression: bursty workloads with 10–15 minute gaps between requests. Each gap allows the cache to expire; the next request pays the write surcharge again. On 5-minute TTL, a workload with bursts every 8 minutes has an effective hit rate of ~0%. ProjectDiscovery's fix — the "relocation trick" of moving dynamic content below the breakpoint — raised their hit rate from 7% to 74% without any traffic pattern change. Result: 59% overall cost reduction.
The March 2026 TTL Regression: $949 in Documented Excess Cost
The most consequential change to Claude prompt caching economics in 2026 was silent. Around March 6, 2026, Anthropic changed the default prompt cache TTL from 1 hour to 5 minutes with no announcement, no release note, and no changelog entry.
A developer analyzed 119,866 API calls spanning January–April 2026:
| Month | API Calls | Actual Cost | Cost at 1h TTL | Excess Cost | % Waste |
|---|---|---|---|---|---|
| Jan 2026 | 2,639 | $78.99 | $37.54 | $41.45 | 52.5% |
| Feb 2026 | 27,220 | $1,120.43 | $1,108.11 | $12.32 | 1.1% |
| Mar 2026 | 68,264 | $2,776.11 | $2,057.01 | $719.09 | 25.9% |
| Apr 2026 | 21,743 | $1,193.01 | $1,016.78 | $176.23 | 14.8% |
| Total | 119,866 | $5,561.17 | $4,612.09 | $949.08 | 17.1% |
March 6 is the day 5-minute cache tokens reappeared after 33 days of clean 1-hour-only behavior. By March 8, 5-minute tokens outnumbered 1-hour tokens 5:1. The Hacker News thread that surfaced the issue drew 200+ comments, with many teams self-identifying as having missed the TTL type in their usage breakdowns.
Why the regression hurt bursty workloads specifically: a team with requests arriving every 8 minutes had 100% cache hit rates on 1-hour TTL (cache stays warm the entire workday). After the regression to 5-minute TTL, every request became a write. Using the break-even table: at 0% hit rate on 5-min TTL, effective cost is $3.75/MTok — 25% more expensive than standard. A team that went from saving 30% to paying 25% more is a swing of 55 percentage points on their input costs.
The one-line fix: explicitly set "ttl": "1h" in your cache_control object. This was always a valid parameter; few teams used it because the default appeared to be 1 hour. See the working example in our Claude Code production guide for the full multi-block caching pattern.
Runnable Example: Verify Your Cache Hit Rate
```python import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = """ You are a senior code reviewer specializing in Python. Your guidelines: [... at least 1,024 tokens of stable instructions ...] """ # must clear 1,024 tokens for Sonnet — verify length before deploy
def review_code(user_code: str) -> dict: response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system=[ { "type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral", "ttl": "1h"} # explicit 1h TTL } ], messages=[{"role": "user", "content": user_code}] )
usage = response.usage total_input = usage.input_tokens + usage.cache_read_input_tokens + usage.cache_creation_input_tokens hit_rate = usage.cache_read_input_tokens / total_input if total_input > 0 else 0
# Blended effective cost (Sonnet 4.6 prices) effective_cost_per_mtok = ( usage.cache_read_input_tokens 0.30 + usage.cache_creation_input_tokens 6.00 + usage.input_tokens * 3.00 ) / total_input
print(f"Cache hit rate: {hit_rate:.1%}") print(f"Effective cost: ${effective_cost_per_mtok:.2f}/MTok") print(f"Standard cost: $3.00/MTok") print(f"Net savings: {(3.00 - effective_cost_per_mtok) / 3.00:.1%}")
return response.content[0].text
review_code("def add(a, b): return a + b") # first call: cache write review_code("def add(a, b): return a + b") # second call: cache read ```
Expected output on second call:
``
Cache hit rate: 90.0%
Effective cost: $0.87/MTok
Standard cost: $3.00/MTok
Net savings: 71.0%
``
Note the 71.0% — not 90%. That is the accurate number for a 90% hit rate on 1-hour TTL. Running the same script with "ttl" omitted (defaulting to 5-min post-regression) would show $0.645/MTok and 78.5% savings — higher net savings on paper, but only sustainable for continuous high-frequency traffic that keeps the 5-minute cache warm. For workloads with any gap pattern, 1-hour TTL produces higher real-world savings.
> KnowledgeCheck: A team enables prompt caching on Sonnet 4.6 with a 1-hour TTL. Their workload runs 100 requests per hour with a 40% cache hit rate. Are they saving money compared to standard input, or paying more? > > Answer: Paying more. The 1-hour TTL break-even is ~53% hit rate. At 40%, their blended cost is 0.40 × $0.30 + 0.60 × $6.00 = $0.12 + $3.60 = $3.72/MTok — 24% more expensive than the $3.00 standard rate. The fix: either restructure the prompt to clear 53% hit rate (move dynamic content below the breakpoint) or switch to 5-minute TTL, where 40% hit rate gives $0.30 × 0.40 + $3.75 × 0.60 = $0.12 + $2.25 = $2.37/MTok — a 21% savings.
The 78.5% net savings at 90% hit rate is achievable, but it requires understanding that the math is compounding — hit rate and TTL tier are multiplicative, not independent. For a production-level implementation covering multi-step agentic pipelines (where caching ROI compounds across every turn in a conversation), and a full breakdown of how four production teams crossed the 60% hit rate break-even threshold, see Production Agents with Claude Agent SDK and MCP. Also see our 2026 AI coding agents production buyer's guide for how prompt caching fits into the broader cost optimization stack alongside model routing and output budgeting.
<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "Article", "headline": "Claude Prompt Caching: What a 90% Cache Hit Rate Actually Saves in 2026", "description": "At 90% cache hit rate, Claude Sonnet 4.6 caching saves 78.5% on input costs — not 90%. The exact savings formula, break-even thresholds by TTL, and the March 2026 regression fix.", "author": { "@type": "Organization", "name": "Koenig AI Academy", "url": "https://academy.kspl.tech" }, "publisher": { "@type": "Organization", "name": "Koenig AI Academy", "url": "https://academy.kspl.tech", "logo": { "@type": "ImageObject", "url": "https://academy.kspl.tech/img/logo.png" } }, "datePublished": "2026-06-05", "dateModified": "2026-06-05", "image": "https://academy.kspl.tech/img/blogs/2026-06-05-claude-prompt-caching-90-percent-hit-math/hero.png", "mainEntityOfPage": { "@type": "WebPage", "@id": "https://academy.kspl.tech/blog/claude-prompt-caching-90-percent-hit-math" } } </script>
About the author: The Koenig AI Academy engineering team produces benchmark-grounded analysis of production AI systems for developers building with the Claude API, OpenAI, and open-source LLMs. All cost models in this post are derived from published Anthropic pricing and independently verified production datasets.