Design the Context Stack First, Then Write the Prompt
- Distinguish the five layers of a production context stack and where prompt engineering fits inside them
- Identify and avoid the six most common context engineering failure modes
- Apply context topology decisions (retrieval shape, memory boundaries, output contracts) to a real agent setup
Context engineering and prompt engineering are not competing disciplines — context engineering is the architecture that contains prompt engineering. In 2026, the prompt itself is one instruction file inside a larger system of retrieval shape, memory boundaries, trust levels, and output contracts. Operators who understand this distinction build reliable agents; those who don't keep chasing phrasing instead of fixing the context stack.
The common framing — "prompt engineering is dead, context engineering is the new thing" — is wrong in a useful way. Prompt engineering isn't dead. It moved. A practitioner in r/PromptEngineering described the shift precisely: "the prompt itself takes an afternoon to dial in" — the hard problems are context window management, retry logic, output validation, and the gap between dev and prod behavior. The phrasing is easy. The surrounding system is the actual source of unreliability.
What Changes When You Build Agents
Single-turn prompts hide the problem. When the model answers once and the conversation ends, bad context just means a bad answer — you iterate the phrasing.
Agents break that feedback loop. As OpenAI's Agents SDK documentation explains, agent applications "plan, call tools, collaborate across specialists" across multiple turns, with the SDK owning orchestration, tool execution, approvals, and state. At that point, the question is no longer "what sentence do I type?" It is "what contract do I enforce between the model, the tools, and the rest of the system?"
Another practitioner in r/PromptEngineering put it plainly: context engineering means "managing the entire context state" — instructions, tools, frameworks, MCP connections, external data, message history, and behavior rules. That is an architecture problem, not a phrasing problem.
The community has mostly internalized this. Hacker News discussion on the topic converged on the same framing: the useful distinction is not "prompting versus magic" but "instructions versus the context you deliberately assemble around them."
The Five Layers of a Context Stack
A production context stack has five distinct layers. Prompt engineering lives in layer one. The other four are where most reliability work happens.
1. Instruction files (the prompt layer) This is the CLAUDE.md, the system prompt, the agents.json. It defines scope, persona, exclusions, and rules. The key insight from r/ClaudeAI's parallel-agent thread is that parallel agents only work reliably when there are "detailed specs that tell each agent exactly where to look and what not to touch." The instruction file is not a one-off prompt — it is a persistent control surface that encodes conventions across sessions and across the entire team. HN's Prompt Contracts thread formalizes this as a structured framework: instructions are contractual, not conversational.
2. Retrieval shape Instead of stuffing everything into the prompt, context engineering decides what comes from which retrieval path. OpenAI's file search lets models "search your files for relevant information" before responding. Web search adds live sourcing. The architecture question is: what belongs in the stable prefix, what should be retrieved on demand, and what should never enter the window at all? This is a topology decision, not a wording decision.
3. Prefix structure for caching OpenAI's prompt caching documentation is blunt: cache hits require "exact prefix matches." This means instructions, examples, and policy should front-load, while variable user-specific state belongs at the tail. Getting this wrong means paying full token cost on every call. Getting it right cuts latency and cost without touching model quality. That tradeoff is invisible if you think in prompts rather than context windows.
4. Output contracts OpenAI's structured outputs ensure responses "adhere to a JSON schema." This is the cleanest teaching bridge between prompt engineering and context engineering: prompts tell the model what to do, but schemas tell the system what counts as a valid result and who owns the next step. Without an output contract, a semantically correct answer can be structurally unusable — and you won't know until the next agent in the pipeline fails.
5. Memory and trust boundaries This is the layer that separates context engineering from advanced prompting. Anthropic's London 2026 recap describes Claude Managed Agents running in "self-hosted sandboxes" with private MCP tunnels — bringing the execution boundary and data boundary closer to the enterprise. Google's Gemini CLI Trusted Folders documentation shows the inverse: when a folder is untrusted, automatic memory loading is disabled, workspace settings are ignored, and tool auto-acceptance is off. Memory is a trust decision, not a convenience feature. Context engineering decides what to load automatically, what to require explicit approval for, and what to exclude entirely.
The Six Failure Modes That Prompt Wording Can't Fix
These failures are architectural. No amount of phrasing improvement resolves them.
- Stale context — old history not compacted; the model acts on outdated state
- Wrong retrieval shape — too much noise, too little signal; the model hallucinates to fill the gap
- Instruction conflict — repo rules, system prompt, and tool instructions disagree; the model resolves it unpredictably
- Trust leakage — untrusted files or environment variables load into a session that should have been restricted
- Output drift — semantically correct but structurally unusable; the downstream system rejects the response
- Tool overload — too many available tools erode selection quality; routing and scoped tool exposure matter
Each of these has a context-level fix. OpenAI's evals documentation describes evals as a way to "test model outputs" against specified criteria — which means the eval harness is how you catch these failures before users do. (KOEA-6584, the next post in this series, covers eval and regression harnesses for context stacks specifically.)
A Practical Operator's Checklist
Before touching prompt phrasing on a failing agent, walk the context stack first:
``
[ ] Instruction file: Is the scope explicit? Does it name what the agent must NOT touch?
[ ] Retrieval shape: Does the task require file search, web search, or neither?
[ ] Prefix structure: Are instructions front-loaded for cache hits?
[ ] Output contract: Is there a JSON schema enforcing the response shape?
[ ] Memory policy: What loads automatically? What is explicitly excluded?
[ ] Tool scope: Is the tool list as small as the task requires?
``
If any of these is missing or wrong, fixing the prompt won't help. The checklist is the context engineering work. The prompt is what you tune after the stack is stable.
Knowledge Check
Which of the following is a context engineering decision rather than a prompt engineering decision?
A) Rewriting the system prompt to be more specific about tone B) Moving instructions to the front of the prompt to improve cache hit rate C) Adding "be concise" to the user message D) Replacing "list" with "enumerate" in the task description
Correct answer: B. Front-loading instructions for cache hit rate is an architectural decision about context structure — it changes cost and latency, not just model behavior. Options A, C, and D are prompt engineering decisions.
Context engineering is not a rebrand. It is the acknowledgment that modern agent systems need deliberate control surfaces: instruction files that survive session churn, retrieval layers that load signal not noise, output schemas that define what valid means, and trust boundaries that decide what the model should never see. Prompt engineering lives inside that stack — but it is not the stack.
This is Part 1 of the Production Agent Engineering in 2026 series. Part 2 covers building and running the eval harness that catches context failures before users do — covering regression suites, context-stack evals, and the tools that make them worth running. For hands-on implementation of the production agent context patterns covered here, continue with Production Agents with Claude Agent SDK + MCP Connector or How to build a production Claude Agent SDK app in 6 chapters.