← All blog posts 7 min readcommunity

Build production RAG by putting MCP connectors in front of retrieval, not inside every app

By Koenig AI Academy Editorial Team · May 12, 2026 · Reviewed by Koenig AI Academy editors
What you'll learn
  • Decide when retrieval should be exposed as an MCP Resource versus an MCP Tool
  • Design a production RAG path that budgets connector latency, scopes access, and keeps vector storage behind one protocol boundary

To build production RAG with MCP connectors in 2026, put a single MCP boundary in front of your knowledge systems, expose read-heavy retrieval through MCP primitives, and keep connector overhead inside your latency budget. MCP already gives you standardized read access through Resources and standardized action access through Tools (server concepts, Resources spec). OpenAI's Responses API now treats both maintained connectors and remote MCP servers as first-class mcp tools, while Cloudflare's Agents stack shows the complementary production pattern: durable application state plus vector retrieval behind one runtime boundary (OpenAI guide, Cloudflare Agents).

The part most teams miss is that MCP is not the retrieval algorithm. It is the control plane around retrieval. Your embeddings model, chunking policy, ACLs, metadata joins, and relevance evaluation still decide whether answers are good. What MCP changes is where that logic lives: instead of every chat app, workflow runner, and agent framework wiring its own auth, discovery, and retrieval plumbing, one MCP server or connector layer can present the same knowledge surface everywhere. That is why the real production win is operational consistency, not protocol novelty.

Use MCP Resources for read-heavy retrieval, and Tools only when the model must take action

For production RAG, the default read path should look like data access, not like tool execution. MCP's own server model draws that line clearly: Tools are active functions the model calls to perform actions, while Resources are passive, read-only data sources identified by URIs (server concepts). The Resources spec then gives you the exact mechanics: resources/list for discovery, resources/read for content fetches, URI templates for parameterized lookups, and optional subscriptions when the underlying data changes (Resources spec).

That maps unusually well to RAG. A knowledge collection, schema file, policy document, or queryable retrieval endpoint is usually read-only from the model's point of view. If you model those as Resources, you make the retrieval contract explicit: the application or client fetches context, and the model reasons over it. Save Tools for the cases where the model genuinely needs to do something active, such as re-indexing a corpus, kicking off a sync job, or fetching from a third-party system that is exposed only as callable actions. Teams that wrap every retrieval step as a generic tool call usually end up with vague permissions and hard-to-debug prompt behavior.

Keep your vector store and document ACLs behind one MCP boundary

The strongest production pattern is to hide your retrieval internals behind one connector layer. Cloudflare's RAG guidance is a good example: use the agent's own SQL database as the source of truth, store embeddings in Vectorize or another vector database, query the vector index, then re-associate results with durable application data before returning context (Cloudflare RAG docs). That is the right architectural shape even if you are not using Cloudflare. The vector index should not be your whole application contract. It is one subsystem behind a stable interface.

OpenAI's MCP guide points in the same direction from the client side. The Responses API can talk to OpenAI-maintained connectors for services like Google Drive, SharePoint, Dropbox, Gmail, and Outlook, or to any remote MCP server that exposes a compatible surface (OpenAI guide). In practice, that gives you a clean split: use maintained connectors for systems of record you do not want to wrap yourself, and use your own MCP server for proprietary retrieval logic, private indexes, document ACL enforcement, and metadata joins. One agent can then ground answers across both without each runtime needing direct credentials to every backing store.

This is also the security argument for MCP in RAG. When you keep retrieval behind one server boundary, you can enforce access policy there instead of duplicating it across SDKs and frontends. The Resources spec explicitly calls out URI validation and access controls for sensitive resources (Resources spec). The protocol repository itself is now the shared spec and documentation hub, which matters because the same contract can be implemented consistently across stacks instead of remaining a one-vendor feature (MCP repository).

Cache discovery and budget transport latency before you tune embeddings

Most production RAG bottlenecks are not where teams first look. The 2026 MCP roadmap puts transport evolution and scalability at the top of the protocol agenda, including stateless scaling and .well-known metadata for discovery, because remote MCP usage only works well if the control plane can scale like internet infrastructure rather than like a long-lived desktop session (2026 MCP roadmap). If your connector layer adds more latency than the retrieval it fronts, the protocol is not your feature anymore; it is your bottleneck.

OpenAI's guide exposes a very practical latency lever here: the API first produces an mcp_list_tools item to discover the available surface, and as long as that item stays in context, it does not need to fetch the tool list again at every turn (OpenAI guide). The same guide also supports allowed_tools filtering and deferred loading, which means you can reduce context bloat and connector chatter before you touch chunk sizes or rerankers (OpenAI guide).

The production takeaway is simple. Measure p95 time from user question to retrieved context, not just vector query speed. A 40 ms ANN lookup does not help if you spend another 200 ms rediscovering capabilities or waiting on a badly placed remote server. The roadmap's emphasis on transport scale exists because protocol overhead becomes visible very quickly once retrieval moves off localhost (2026 MCP roadmap).

Judge production RAG on grounded answers and connector reliability at the same time

A production RAG system succeeds only when the answers are grounded and the retrieval path is dependable. MCP helps with the second half by making the retrieval surface inspectable and repeatable, but it does not remove the need to evaluate answer quality. In practice, you should score both layers: whether the returned context was relevant enough to support the final answer, and whether the connector path stayed reliable under real traffic.

That means tracking ordinary RAG questions alongside connector questions. Did the retrieved passages actually support the answer? Did the agent read the right resource or call the right connector? How often did auth fail, discovery retry, or context arrive too slowly to be useful? The nice thing about MCP here is not that it invents a new eval science. It is that standardized discovery, URIs, and server boundaries make those failures easier to isolate. When retrieval is buried inside five different application adapters, every incident looks unique. When it is fronted by one MCP layer, the failure modes become comparable.

▶ Interactive prompt cell (full demo on lesson pages)
✓ Knowledge check (interactive on lesson pages)

References

  1. modelcontextprotocol.io
  2. modelcontextprotocol.io
  3. blog.modelcontextprotocol.io
  4. platform.openai.com
  5. developers.cloudflare.com
  6. developers.cloudflare.com
  7. github.com