Claude’s Prompt Caching: A Practical Guide for Everyday Users
You're using Claude wrong: the counterintuitive truth about "cheaper" models that's actually costing you more.
The following is a synopsis of this detailed article:
The Big Idea (In Plain English)
When you chat with Claude, it has to “read” your entire conversation every single time you send a message. That’s expensive and slow. Prompt caching lets Claude bookmark parts of your conversation so it doesn’t have to re-read everything from scratch.
Think of it like a TV show recap. Instead of rewatching all previous episodes, you get a 2-minute “previously on...” summary. Claude does something similar — but automatically, and it dramatically cuts costs and response time.
The payoff:
Up to 90% cheaper to re-use cached context
Up to 85% faster responses on cached content
Why This Matters to You (Even If You’re Not a Developer)
You might think “I’m not paying API bills, why do I care?” Here’s why it matters:
Faster responses: Cache hits mean Claude answers quicker
Longer, more capable sessions: Efficient caching means Claude can maintain more context without hitting limits
Better product experiences: Apps built on Claude that use caching well are cheaper to run, which means more sustainable products for you to use
The #1 Rule: Stop Changing Things Mid-Conversation
The cache works by recognizing identical patterns. The moment anything changes in the “stable” part of your conversation, Claude has to re-read everything.
Common ways people accidentally break this:
What You DoWhat Actually HappensSwitch from Opus to Haiku mid-session to “save money”Claude re-reads everything at full price — often costs more than staying on OpusStart a new topic in an existing long sessionAdds dynamic content that can disrupt stable patternsKeep editing your initial instructionsBreaks the cached version every time
The counterintuitive truth about model switching
Say you’re deep in a long conversation with Claude Opus (the powerful, expensive model). You think: “I’ll switch to Haiku (the cheap model) to save money.”
Here’s what actually happens:
Your cached Opus session was costing you ~$0.15 per message (cache read price)
Switching to Haiku forces a full re-read of everything: costs ~$0.18 just for that one switch
You paid MORE to use the cheaper model
✅ What to do instead: If you need a lighter model for a subtask, start a fresh conversation with a short summary of what it needs to know. Keep your main conversation going on Opus.
How Claude Organizes Information (And Why You Should Too)
Claude Code (Anthropic’s AI coding tool) discovered that the order you put information in matters enormously. Think of it like a filing cabinet:
📁 DRAWER 1 (Never changes)
└── Core instructions, base rules
📁 DRAWER 2 (Changes per project)
└── Project-specific context, your CLAUDE.md file
📁 DRAWER 3 (Changes per session)
└── What you've discussed so far
📁 DRAWER 4 (Always new)
└── Your current messageThe golden rule: Stable stuff goes first. New stuff goes last.
If you put something that changes (like today’s date, or your current mood about the project) at the TOP of your context, it breaks the cache for everything below it — which could be thousands of tokens of useful context.
Practical examples of what breaks caching:
❌ Starting every message with “Today is [date]...”
❌ Randomly reordering your list of instructions
❌ Injecting your username into what should be a fixed system prompt
✅ Keeping your core instructions identical every session
✅ Putting project context in a fixed file (like CLAUDE.md) that doesn’t change
The “Don’t Touch the Tools” Rule
If you’re using Claude through an app or API that gives it tools (like web search, code execution, file access), those tools are part of the cached “fingerprint.”
Changing tools mid-session is like changing the lock mid-conversation — Claude has to start over.
Adding a tool mid-session → full re-read
Removing a tool mid-session → full re-read
Changing the order of tools → full re-read
What good apps do instead:
Load all possible tools upfront, even ones that won’t be used every turn. The tiny cost of carrying unused tools is nothing compared to the cost of re-reading a 100,000-token conversation.
For apps with huge numbers of tools (50+), good developers send lightweight placeholders (”stubs”) for each tool upfront, then load the full details only when Claude actually needs to use that specific tool.
Plan Mode: Why It Exists and Why You Should Use It
Claude Code has a “Plan Mode” where Claude thinks through a problem before doing anything. You might wonder: why not just give Claude fewer tools in planning mode to keep things simple?
Turns out, that would break the cache every time you switch modes.
Instead, Claude Code’s clever solution:
Keeps all tools loaded at all times
Adds special “EnterPlanMode” and “ExitPlanMode” tools
Uses instructions to tell Claude what it can/can’t do in each mode
The result: Claude can smoothly switch between planning and doing without expensive cache resets. It can even decide on its own when planning is needed.
For you as a user: Don’t fight Plan Mode when Claude enters it. It’s doing you a favor — thinking before acting, and doing so efficiently.
What Happens When You Hit the Conversation Limit
Long sessions eventually fill up Claude’s “memory” (context window). When this happens, Claude needs to compact — summarize what’s happened so far and continue fresh.
The wrong way (what naive apps do):
Start a completely new request with a stripped-down prompt to generate the summary → Cache miss, full re-read cost at the worst possible moment
The right way (what Claude Code does):
Create a “fork” that looks nearly identical to the original conversation, just with a summarization instruction added at the end → Almost everything is still cached, nearly free to run
What this means for you:
Don’t panic when Claude Code auto-compacts your session
It’s not losing your context — it’s summarizing it efficiently
Good apps trigger this at ~80-85% capacity, not when completely full, so there’s always room to do it properly
Quick Reference: Do’s and Don’ts
✅ DO:
Keep core instructions consistent across sessions — don’t rewrite them each time
Start a new session when switching to a completely different task
Use CLAUDE.md (or equivalent project files) to store stable project context
Let Claude enter Plan Mode on its own — it knows when it’s useful
Trust compaction — it’s working hard behind the scenes to keep your session going
❌ DON’T:
Switch models mid-conversation to save money — it usually costs more
Add timestamps or dynamic info to your stable instructions
Pivot dramatically within a long session — start fresh instead
Manually restart when Claude compacts — let it handle the transition
The One-Sentence Mental Model
Put stable things first, changing things last — and never switch models mid-conversation.
Everything else in Claude’s caching strategy flows from this. The engineers at Anthropic didn’t figure this out from documentation — they figured it out from expensive production mistakes. Now you don’t have to make the same ones.

