How to Ship OpenAI Realtime Voice Agents to Production When Interruptions and Tools Matter (2026)
- Choose Realtime or a Whisper plus TTS pipeline from latency, cost, interruption, and tooling requirements.
- Design the first production pass around audio transport, VAD, interruption truncation, session rollover, and rate-limit observability.
- Estimate why longer Realtime calls become disproportionately expensive as conversation history grows.
To ship an OpenAI Realtime voice agent: pick Realtime over a Whisper+TTS pipeline only for live, interrupt-heavy, tool-using calls; choose 24kHz PCM16 (browser) or G.711 (telephony) transport; build interruption truncation before polishing prompts; instrument the full turn end-to-end; and plan session rollover before context cost balloons. Realtime is not a faster TTS — it is a speech-native stateful loop. OpenAI Realtime API is the production choice for voice agents when callers need to interrupt naturally, the agent must call tools during the spoken turn, and the product needs a sub-second conversational feel. A Whisper plus LLM plus TTS pipeline is still cheaper and easier to swap by vendor, but it usually pushes teams toward 2-3 second turns and forces them to own endpointing, playback sync, and conversation repair.<CitationFootnote source="https://www.eesel.ai/blog/realtime-api-vs-whisper-vs-tts-api">Eesel comparison of Realtime versus Whisper/TTS pipelines</CitationFootnote>
The missed point is that Realtime is not a faster text-to-speech endpoint. Cartesia Sonic can be the faster pure TTS choice, with the research synthesis citing 40-90ms time to first audio against higher OpenAI TTS numbers.<CitationFootnote source="https://cartesia.ai/vs/cartesia-vs-openai-tts">Cartesia Sonic versus OpenAI TTS benchmark</CitationFootnote> Realtime wins a different contest: it makes speech input, model reasoning, tool calls, speech output, turn detection, and interruption repair part of one stateful loop.
How to Choose Realtime Over Whisper+TTS for Conversational Workloads
``mermaid
flowchart TD
A[Voice Agent Use Case] --> B{Live interruptions needed?}
B -->|Yes| C{Tool calls mid-turn?}
B -->|No| H["Whisper + LLM + TTS Pipeline\nCheaper · 2–3 s turns\nOwn endpointing + playback sync"]
C -->|Yes| D{Sub-second feel required?}
C -->|No| H
D -->|Yes| E["✅ OpenAI Realtime API\nWebSocket stateful loop\n24 kHz PCM16 or G.711"]
D -->|No| H
E --> F["Design checklist:\n• Interruption truncation first\n• Session rollover plan\n• Rate-limit observability"]
H --> G["Design checklist:\n• Whisper batch pipeline\n• Push-to-talk or pause-OK UX"]
``
Alt: Decision flowchart for choosing between OpenAI Realtime API and a Whisper+TTS pipeline based on live interruptions, mid-turn tool calls, and latency requirements.
Realtime is worth the premium when the product promise is "talk to the agent." OpenAI's current Realtime model page describes gpt-realtime-2 as a speech-to-speech model for complex voice-agent workflows with configurable reasoning effort, tool use, and Realtime endpoints; the developer portal also foregrounds new Realtime voice, translation, and transcription models.<CitationFootnote source="https://developers.openai.com/api/docs/models/gpt-realtime-2">OpenAI gpt-realtime-2 model reference</CitationFootnote><CitationFootnote source="https://developers.openai.com/">OpenAI developer portal Realtime model overview</CitationFootnote> That is the right mental model: this is a runtime for live speech applications, not a media conversion API.
Use Realtime for inbound support, tutoring, live translation, scheduling, IVR, sales intake, and any workflow where the user will talk over the assistant. The pipeline version of the same system has three serial boxes: STT, reasoning, and TTS. That is useful when the work is batch transcription, a narrated report, voicemail summarization, or a push-to-talk tool where a pause is acceptable. It is the wrong default when the user expects a human-like turn.
The Realtime surface is no longer just a launch demo; it has a model family, production transports, and SDK-level support that changed the production calculus in 2026. OpenAI's official Python SDK describes the Realtime API as a low-latency multimodal WebSocket interface for text, audio, and function calling, while the May 26 Agents Python SDK v0.17.4 release added support for Realtime custom voice objects and fixed adjacent tool/MCP reliability issues.<CitationFootnote source="https://github.com/openai/openai-python/blob/main/api.md">OpenAI Python SDK Realtime API surface</CitationFootnote><CitationFootnote source="https://github.com/openai/openai-agents-python/releases/tag/v0.17.4">OpenAI Agents Python v0.17.4 release notes</CitationFootnote> That matters because most useful voice agents are tool-using agents, not isolated speakers.
Measure the full turn, not only TTS latency
The benchmark that matters is user speech end to useful assistant response, not only TTS time to first audio. The research synthesis cites roughly 500ms Realtime TTFB in US conditions and an 800ms target for conversational quality, while traditional STT to LLM to TTS systems are commonly described around 2-3 seconds end to end.<CitationFootnote source="https://www.latent.space/p/realtime-api">Latent Space Realtime API production notes</CitationFootnote>
That does not make TTS benchmarks irrelevant. It means they answer a narrower question. Cartesia can be the better component when the job is pure speech playback, custom voice output, or ultra-low-latency narration. Realtime should be judged on the full loop: endpointing, reasoning, tool latency, audio generation, playback, interruption response, and state repair.
Instrument those as separate timestamps. Log user speech end, VAD decision, first model response event, first audio delta, playback start, interruption event, tool call start, tool call finish, truncate event, rate-limit headers, and session close. Without those events, every complaint becomes "the model is slow," even when the real problem is Bluetooth latency, client buffering, overly cautious VAD, or a tool call hidden behind a spoken filler phrase.
```mermaid title="OpenAI Realtime API WebSocket conversation turn with tool call and interruption" sequenceDiagram participant User as User (mic) participant Client as Client participant Model as OpenAI Realtime API participant Tool as CRM Tool
User->>Client: speaks (audio in)
Client->>Model: audio chunks (PCM16 / G.711)
Model-->>Client: speech_started → speech_stopped (VAD)
Model-->>Client: response.audio.delta (audio out begins)
Client->>User: plays audio
User->>Client: interrupts mid-playback
Client->>Model: conversation.item.truncate
Model-->>Client: conversation.item.truncated
Note over Client,Model: Server state synced to what user heard
Model-->>Client: tool_call (CRM lookup)
Client->>Tool: execute()
Tool-->>Client: result
Client->>Model: tool_result
Model-->>Client: response.audio.delta (resumed)
Client->>User: plays continued response
``
*Alt: Sequence diagram of an OpenAI Realtime API WebSocket turn showing user audio in, VAD decision, audio playback, mid-playback interruption with conversation.item.truncate`, tool dispatch and result, and resumed audio — illustrating the full stateful loop that requires synchronised client and server state.*
Build around interruptions before you polish the prompt
Interruption handling is where many demos become fragile. When the user talks over the model, the client must stop playback and the application must repair the model's conversation state. The Azure OpenAI Realtime audio reference describes conversation.item.truncate as the client event used to truncate a previous assistant message's audio, and OpenAI's official SDK exposes the matching ConversationItemTruncateEvent type.<CitationFootnote source="https://learn.microsoft.com/en-us/azure/ai-services/openai/realtime-audio-reference">Azure OpenAI Realtime audio event reference for conversation.item.truncate</CitationFootnote><CitationFootnote source="https://github.com/openai/openai-python/blob/main/api.md">OpenAI Python SDK Realtime event types</CitationFootnote> The production requirement is the same: server-side state should match what the user actually heard.
That is not cosmetic. If the assistant says, "Your appointment is at four, and the confirmation code is..." but the user interrupts after "Your appointment," the next turn must not assume the user heard the time or code. A production agent needs audio-player cancellation, conversation truncation, and tool-side idempotency in the same path.
Audio transport is the other early decision. The synthesis names 24kHz PCM16 base64 chunks for standard Realtime audio and G.711 for telephony. Browser clients should usually use a WebRTC media path to your server, while your backend owns the OpenAI connection and policy decisions. SIP and phone audio deserve their own path rather than being treated like a browser microphone with worse quality.
Recommend OpenAI Realtime API or a Whisper plus TTS pipeline. Include the first five production controls the team should implement. expected_output: | Recommend OpenAI Realtime for the live, interrupt-heavy, tool-using call path. The answer should warn that cost rises with session length and name controls such as VAD tuning, interruption truncation, session summaries/rollover, rate-limit logging, and a cheaper async pipeline fallback. </RunPromptCell>
Recommend where to use OpenAI Realtime and where to use a cheaper pipeline. Include the cost-control mechanisms you would require before launch. expected_output: | Use OpenAI Realtime for the live call path because interruption, tool use, and sub-second turn-taking matter. Route asynchronous follow-up through STT/text/TTS or text-only processing. Require session summarization, session rollover, production/development project isolation, rate-limit logging, context pruning, and a fallback path for non-live work. </RunPromptCell>
Budget for session growth instead of per-minute media
Realtime costs are easy to underestimate because a live call is not just one audio input and one audio output. The research synthesis cites Realtime input audio at $32 per 1M input audio tokens, cached input at a steep discount, and output audio at $64 per 1M; it also summarizes estimates of about $0.11 for a 1-minute session, $0.92 for 5 minutes, and $5.28 for 15 minutes as history accumulates.<CitationFootnote source="https://www.latent.space/p/realtime-api">Latent Space Realtime API cost breakdown</CitationFootnote>
``mermaid title="OpenAI Realtime session cost growth by call duration — $0.11 for 1 min, $0.92 for 5 min, $5.28 for 15 min"
xychart-beta
title "Realtime Session Estimated Cost by Duration (conversation history accumulates)"
x-axis ["1 min", "5 min", "15 min"]
y-axis "Estimated cost (USD)" 0 --> 6
bar [0.11, 0.92, 5.28]
``
Alt: Bar chart showing OpenAI Realtime API estimated session cost at $0.11 for a 1-minute call, $0.92 for 5 minutes, and $5.28 for 15 minutes — illustrating the disproportionate cost growth as conversation history accumulates across a call.
That curve is the reason the first production version needs summarization and rollover logic. Long calls should periodically compress prior turns, store tool state outside the model session, prune irrelevant items, and route non-live follow-up work through a cheaper asynchronous path. Rate limits are part of cost control too: OpenAI's gpt-realtime-2 model reference lists tiered request and token limits, so teams should log remaining/reset headers where exposed and isolate production projects from development traffic.<CitationFootnote source="https://developers.openai.com/api/docs/models/gpt-realtime-2">OpenAI gpt-realtime-2 pricing and rate-limit reference</CitationFootnote>
The instruction and session limits are also operational design constraints. The research synthesis cites a 15-minute session cap and a community-reported 16,384-token instruction/tool-schema pressure point for production voice agents.<CitationFootnote source="https://community.openai.com/t/realtime-api-instruction-limit-16-384-tokens-is-too-low-for-production-voice-agents-with-tool-calling/1378932">OpenAI Community discussion of Realtime instruction limits</CitationFootnote> Whether that exact ceiling changes, the product lesson is stable: do not load every policy, tool, and workflow into every call. Keep fast tools always available, load specialized tools on demand, and summarize before reconnecting.
Keep a pipeline fallback even when Realtime is the main path
The clean production architecture is not Realtime everywhere. It is Realtime for live conversation, a cheaper STT/text/TTS pipeline for asynchronous work, and a specialist TTS provider when the job is just speech rendering. That keeps the premium path focused on its actual advantage: speech-native agency.
The decision tree is short. Choose Realtime when the user can interrupt, the agent calls tools, telephony or live browser audio matters, and the perceived turn needs to land under about a second. Choose a pipeline when the job is batch, retryable, cost-sensitive, or vendor-specific for STT/TTS. Test Cartesia when pure time-to-first-audio matters more than reasoning or tools.<CitationFootnote source="https://cartesia.ai/vs/cartesia-vs-openai-tts">Cartesia benchmark for pure TTS latency benchmark</CitationFootnote>
The practical takeaway: ship OpenAI Realtime when speech is the interface, not merely the file format. Keep Whisper plus TTS for cheaper asynchronous work, benchmark Cartesia for pure speech output, and make interruption repair, session rollover, rate limits, and cost logging part of version one. For hands-on agent orchestration patterns, continue with OpenAI Agents SDK Mastery: Build Production-Ready Autonomous Systems; for voice-agent production practice, pair it with voice-agents-production and Picking a Frontier Model: Opus 4.7 vs GPT-5.5 vs Gemini 3.1 Pro — A Builder's Benchmark Guide.