← All blog posts 7 min readanthropic

Claude Prompt Caching Saves 60–90% on Input Tokens — If You Avoid These Five Mistakes in 2026

What you'll learn
  • Calculate whether your workload's cache hit rate clears the 60% break-even threshold before enabling caching.
  • Apply the 'relocation trick' to move dynamic content below the cache breakpoint and lift hit rates from single digits to 74%+.
  • Explicitly set ttl: 1h in cache_control to avoid the March 2026 silent default regression.

Claude prompt caching cuts input token costs by 60–90% for teams that structure their prompts correctly. Cache reads on Sonnet 4.6 cost $0.30/MTok versus $3.00/MTok standard — a 90% reduction per read. In 2026 the key variable is whether your hit rate clears 60%; below that threshold, the cache write premium means caching is costing you money, not saving it. Production data from three independent teams confirms the savings are real — but so is the failure mode.

Here's what most coverage misses: Anthropic's own engineering team treats a cache hit rate below 90% as a SEV-class incident. The average developer enabling caching for the first time sees a 7–15% hit rate and concludes "caching doesn't really work." The gap between those two numbers isn't a model limitation — it's a prompt structure problem. And in March 2026, a silent TTL change from Anthropic quietly reversed gains for anyone relying on the 1-hour default.

What prompt caching actually costs

Before the savings math, the pricing structure:

Token typeSonnet 4.6 ($/MTok)vs. standard
Standard input$3.00
Cache write (5-min TTL)$3.751.25×
Cache write (1-hour TTL)$6.002.0×
Cache read$0.300.1× (90% off)

The break-even rule: a cache prefix is profitable if it's read at least twice within the TTL window. A 10,000-token system prompt costs $0.0375 to write (Sonnet, 5-min TTL). Two reads at $0.03 = $0.06 in reads. Against a $0.06 standard cost, the net is $0.0225 in savings — from the third read onward, savings are pure. On 1-hour TTL the write costs $0.06 but amortizes across ~20+ reads during a business-hours session.

One hard constraint: minimum cacheable prefix is 1,024 tokens (Sonnet/Opus) or 2,048 tokens (Haiku). Below those thresholds the API silently ignores the cache_control directive — no error, no savings, no indication anything went wrong. Always verify with cache_read_input_tokens in the response object.

Four production benchmarks

The savings spread is wide — here's the raw data from four independent teams, each using a different workload type:

[Iron Mind](https://iron-mind.ai/blog/prompt-caching-claude-production) — agentic systems platform: ~90% cost reduction, ~80% latency reduction on cached prefixes. Their approach: cache at the boundary between stable and dynamic content. System prompt + tools at the deepest cache layer; conversation tail left uncached. Key callout: "Letting the 5-minute TTL expire between requests is the most expensive failure mode. If your traffic is bursty — a few requests then 10 minutes of silence — your cache evaporates and you pay the write surcharge again."

[AI Magicx](https://www.aimagicx.com/blog/prompt-caching-claude-api-cost-optimization-2026) — three workloads benchmarked:

WorkloadBeforeAfterSavings
Simple system prompt (5K tokens, high freq)$0.015/req$0.0015/req90%
RAG context caching (10–20K tokens)$0.045/req$0.009/req80%
Research assistant (10K sessions/mo, 4.5 avg turns)$4,140/mo$1,650/mo60%

Average across workloads: 65% savings. Implementation time: 2–4 hours.

[ProjectDiscovery/Neo](https://projectdiscovery.io/blog/how-we-cut-llm-cost-with-prompt-caching) — security audit platform: Started at 7% cache hit rate after naively enabling caching. After the "relocation trick" (moving dynamic content like timestamps and user IDs below the cache breakpoint): 74% hit rate in a single deployment. After full optimization: 84% hit rate, 59% overall cost reduction.

[BoringBot](https://boringbot.substack.com/p/how-to-save-millions-in-claude-tokens) — four-technique stacking:

TechniqueStandalone savings
Model routing77.1%
Prompt caching71.5%
Multi-turn caching63.2%
Output budgeting56.8%
All four combined89.3%

Prompt caching alone at 71.5% is competitive but leaves ~18 points on the table versus the full stack.

The March 2026 TTL regression: $949 in documented excess cost

The most significant recent development in prompt caching economics is a silent regression Anthropic introduced around March 6, 2026: the default prompt cache TTL shifted from 1 hour to 5 minutes. No announcement, no changelog entry.

A developer analyzed 119,866 API calls across two machines spanning January–April 2026:

MonthActual costCost at 1h TTLExcess cost% waste
Jan 2026$78.99$37.54$41.4552.5%
Feb 2026$1,120.43$1,108.11$12.321.1%
Mar 2026$2,776.11$2,057.01$719.0925.9%
Apr 2026$1,193.01$1,016.78$176.2314.8%
Total$5,561.17$4,612.09$949.0817.1%

The transition was visible to the day — March 6 is when 5-minute tokens first reappeared after 33 days of clean 1h-only behavior. By March 8, 5-minute tokens outnumbered 1-hour by 5:1. The HN discussion that followed drew 200+ comments.

The fix is one line: explicitly set "ttl": "1h" in your cache_control object instead of relying on the default.

Five anti-patterns that kill cache hit rate

Based on the production case studies above, these are the structural failures that separate a 7% hit rate from a 90% one:

  1. Timestamps in cached content. "Current time: 2026-05-31T14:32:15Z" in your system prompt invalidates the cache on every request. Truncate to the day or remove it.
  1. User-specific content above the breakpoint. User IDs, session tokens, personalization metadata — anything that varies per user must move below the cache_control marker.
  1. Wrong breakpoint placement. The cache_control marker caches everything up to and including that block. Place it at the boundary between your stable prefix and the first dynamic element.
  1. Sub-minimum token prefix. Below 1,024 tokens (Sonnet/Opus) or 2,048 (Haiku), the API silently ignores the directive. Verify with cache_read_input_tokens in the response.
  1. Bursty traffic on 5-minute TTL. If requests arrive every 10–15 minutes, each one triggers a fresh write at 1.25×. Either concentrate traffic density or explicitly request 1-hour TTL.

Runnable example: enabling 1-hour TTL with explicit cache_control

```python import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """ You are a senior code reviewer specializing in Python. Your guidelines: [... at least 1,024 tokens of stable instructions ...] """

def review_code(user_code: str) -> str: response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system=[ { "type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral", "ttl": "1h"} # explicit 1h TTL } ], messages=[{"role": "user", "content": user_code}] )

# Verify cache is working — first call shows cache_creation_input_tokens usage = response.usage print(f"Cache read: {usage.cache_read_input_tokens} tokens") print(f"Cache write: {usage.cache_creation_input_tokens} tokens") # Second call onward: cache_read_input_tokens > 0, cache_creation = 0

return response.content[0].text ```

Expected output on second call: `` Cache read: 1847 tokens Cache write: 0 tokens ``

The cache_read_input_tokens count on the second call confirms the 90% cost reduction is active. If it's 0 after the second identical call, check for dynamic content above the breakpoint or a sub-minimum prefix length.


> KnowledgeCheck: A team caches a 900-token system prompt on Sonnet 4.6 and sees zero savings despite enabling cache_control. What is the most likely cause? > > Answer: The prefix is below the 1,024-token minimum for Sonnet 4.6. The API silently ignores cache_control directives on prefixes shorter than this threshold — no error is raised and no cache write occurs. The fix is to expand the system prompt past 1,024 tokens or verify with cache_read_input_tokens in the response.


Prompt caching is the highest-ROI single change you can make to a Claude API integration — but only if you get the structure right. The production data across four teams confirms the 60–90% savings range is achievable, and the March 2026 TTL regression is fixable in one line. For a complete treatment of caching in multi-step agentic pipelines — where the ROI compounds across every turn — see Production Agents with Claude Agent SDK + MCP Connector.

References

  1. iron-mind.ai
  2. www.aimagicx.com
  3. projectdiscovery.io
  4. boringbot.substack.com
  5. github.com
  6. dev.to
  7. pub.towardsai.net
  8. dev.to
  9. news.ycombinator.com
Next up
community 10 min read

Why Local Model Benchmarks Lie: What Agent Trace Evaluation Reveals

Continue reading