Use OpenAI Realtime API when voice agents need interruptions, tools, and sub-second turns (2026)
- Choose between OpenAI Realtime API and a self-assembled Whisper plus TTS pipeline using latency, cost, and product requirements.
- Identify the production gotchas that break voice agents: PCM16 and G.711 audio chunks, interruption truncation, rate limits, and 15-minute sessions.
- Estimate Realtime API session cost and compare it with Cartesia Sonic 3.5 and lower-cost STT plus TTS pipelines.
OpenAI Realtime API is the better production choice for voice agents when the user experience depends on natural interruptions, speech-to-speech latency around 500-800ms, live tool calls, and phone or browser audio transport. A self-assembled Whisper plus LLM plus TTS pipeline is still cheaper and more modular, but it usually lands closer to 2-3 seconds end to end and makes your team own turn detection, playback sync, and conversation repair.[1][4]
The contrarian point is that Realtime is not primarily a faster TTS product. Cartesia Sonic 3.5 can beat OpenAI on pure time-to-first-audio, with the synthesis citing 40ms Turbo and 90ms Sonic latency against OpenAI TTS around 199ms.[5][7] Realtime wins a different contest: it turns voice into one stateful session where audio input, model reasoning, speech output, tool calls, interruption handling, and telephony formats can move together.
Choose Realtime for live conversation; choose pipelines for cheap modular speech
The old voice-agent stack has three obvious boxes: speech-to-text, reasoning, and text-to-speech. A user speaks, Whisper or another STT engine transcribes the utterance, an LLM decides what to do, and a TTS engine speaks the response. That architecture is understandable, debuggable, and vendor-flexible. It is also why many early voice agents felt like voice wrappers around chatbots rather than conversations.
The synthesis summarizes the practical gap: traditional STT to LLM to TTS chains tend to produce 2-3 seconds of end-to-end latency, lose prosody and emotion between transcription and response, and handle interruptions poorly.[4] Once speech has been flattened into text, the model no longer knows whether the user sounded confused, amused, urgent, or halfway through a sentence. You can reconstruct some of that with metadata, confidence scores, or custom prompts, but then you are rebuilding a speech-native interaction loop out of separate services.
Realtime API changes the integration boundary. The synthesis describes a single Realtime endpoint for gpt-realtime-2, native audio I/O, server VAD and endpointing, and tool calls mid-stream.[2][4] Instead of treating audio as a preprocessing step, Realtime treats speech as part of the model session. That matters for live support, tutoring, translation, scheduling, IVR, sales intake, and any agent that must respond before the user feels the turn has died.
This does not make Whisper plus TTS obsolete. It narrows where it belongs. Use a pipeline when the job is batch transcription, voicemail summarization, narrated report generation, asynchronous coaching feedback, or a push-to-talk workflow where users expect a pause. Use Realtime when the product promise is "talk to this agent" rather than "submit audio and receive a spoken answer."
Measure round-trip TTFB, not just TTS TTFA
The most common benchmark mistake is comparing only TTS time-to-first-audio. TTFA is useful when you are buying a TTS engine, but it is incomplete for a voice agent. A real user waits through capture, endpointing, model reasoning, audio generation, playback buffering, and sometimes network jitter. The metric you need is round-trip time to first meaningful response.
The synthesis gives the headline numbers: Realtime around 500ms TTFB in US conditions, with an 800ms target for end-to-end conversational quality.[4] It contrasts that with a traditional pipeline around 2-3 seconds, typically composed of STT latency, LLM latency, and TTS latency.[4] The exact numbers will move with region, model, device, transport, and utterance length, but the architectural difference is stable: a chained pipeline has serial stages, while Realtime can behave like a speech-native session.
Cartesia Sonic 3.5 is still the baseline worth respecting. The Cartesia comparison in the synthesis cites 40ms Turbo and 90ms Sonic latency, while Cartesia's Sonic page positions it for streaming TTS with emotes, laughter, 40+ languages, and a roughly $0.03/min TTS-only price.[5][7] Our earlier latency article, voice-agents-2026-tts-latency-benchmark, makes the same product-level point: Cartesia can be the fastest paid TTS choice, but the voice-agent bottleneck is often the full turn, not the final audio renderer.
That means you should instrument four timestamps in production:
| Measurement | What it tells you | Typical fix |
|---|---|---|
| User speech end to model response start | Endpointing and reasoning latency | Tune VAD, lower reasoning.effort, shorten prompts |
| First audio delta to playback start | Client buffering and audio output latency | Reduce buffer size, avoid TCP stalls on client audio |
| User interruption to assistant stop | Barge-in quality | Wire interruption events to playback cancellation |
| Full turn round trip | What the user actually feels | Optimize transport, prompts, and session state together |
Budget for session cost, not only per-minute audio
Realtime's cost model is the part teams underestimate. The synthesis cites Realtime input audio at $32 per 1M input audio tokens, cached input at $0.40 per 1M, and output audio at $64 per 1M.[4] It also summarizes estimated conversational costs around $0.11 for 1 minute, $0.92 for 5 minutes, and $5.28 for 15 minutes because history growth compounds over longer sessions.[4]
Those estimates are not the same as pricing a single TTS response. In a live agent, each new response sees conversation state that came before it. If you let a call run for 15 minutes while keeping every turn, every tool result, and every verbose instruction in context, your marginal response cost grows. The cost curve is why production Realtime agents need summarization, truncation, and routing logic, not just a billing alert.
A self-assembled pipeline can be far cheaper. The synthesis estimates pipeline cost around $0.01-0.05/min, combining STT, cheaper text-model inference, and TTS.[4] Cartesia Sonic 3.5 is cited at $0.03/min for TTS-only use.[7] If you are generating spoken summaries, audio lessons, or scripted outbound reminders, those economics are hard to ignore.
The practical calculation is not "Realtime is expensive" versus "pipeline is cheap." It is this:
| Scenario | Better default | Why |
|---|---|---|
| 30-second support triage with tool lookup | Realtime | Latency and interruption quality matter more than raw media cost |
| 8-minute customer-service call | Realtime with summarization | Live turn-taking matters, but context cost must be controlled |
| Batch transcription and spoken summary | Whisper plus TTS | No live turn-taking requirement |
| Pure ultra-low-latency speech playback | Cartesia Sonic 3.5 | TTS speed matters; no reasoning loop needed |
| Custom voice, language, or on-device speech | Pipeline | Vendor control and deployment flexibility matter |
At scale, add rate limits to the model. OpenAI rate limits are measured in requests per minute, tokens per minute, requests per day, and tokens per day, with tiers based on spend.[1] The synthesis specifically calls out headers such as x-ratelimit-remaining-requests and x-ratelimit-reset-tokens.[1] For a normal text API, you can often retry a failed request. For a live call, a rate-limit failure is a product failure unless you route around it.
Your production cost plan should therefore include four controls: short default sessions, explicit summaries before context gets large, per-project rate-limit isolation between development and production, and a non-Realtime path for work that does not need live speech.
Build for PCM16, G.711, VAD, and interruption repair from day one
The production gotchas are not glamorous. They are audio format, session duration, prompt size, and what happens when someone interrupts the model.
The synthesis names the audio formats directly: 24kHz PCM16 base64 chunks for standard Realtime audio, and G.711 for telephony.[4] If you are coming from web development, the base64 event stream can look like an implementation detail. It is not. Chunk size, buffering, resampling, and transport choice shape perceived latency. In the synthesis, input arrives through events like input_audio_buffer.append; output arrives as response.audio.delta chunks.[4]
For browser and mobile clients, the architecture pattern is a WebRTC path for user media and a controlled backend path to OpenAI. The synthesis notes the WebRTC proxy architecture and the reason: WebRTC over UDP handles interactive audio better than raw client WebSocket paths that can suffer TCP head-of-line blocking.[4] For phone systems, use G.711 and SIP-oriented designs rather than forcing PSTN audio through a web-media mental model.[3][4]
VAD is the second failure point. If silence detection is too aggressive, the model starts speaking while the user is still thinking. If it is too slow, every turn feels delayed. The synthesis flags silence_duration_ms=800-1000ms as a practical tuning range and calls out reasoning.effort=minimal/low as a latency optimization.[2][4] The right setting depends on the product. A medical intake agent should wait longer than a drive-through ordering assistant.
Interruption handling is where demo code usually breaks. The WorkAdventure implementation note demonstrates interruption handling through a cancelResponse() helper that clears audio playback and syncs conversation state so the model does not retain context for audio the user never heard.[9] At the Realtime API layer, this corresponds to sending conversation.item.truncate to align the server-side session with what was actually played before the barge-in. If the assistant says, "Your appointment is at four..." and the user interrupts after "Your appointment," your conversation state must not pretend the user heard the rest.
Session limits are the third thing to design around. The synthesis cites a 15-minute session limit and a production complaint about a 16,384-token instruction limit for Realtime agents with tool calling.[4][10] Whether or not your calls usually last 15 minutes, you need a reconnection strategy before launch: summarize the call, store tool state outside the session, open a fresh session, and replay only what the next turn needs.
Use a decision tree before you commit to Realtime
The simplest decision tree is also the most useful:
- Does the user need to interrupt the agent naturally?
- Does the agent need to call tools inside the spoken turn?
- Does the product need sub-second conversational feel?
- Will the agent run on phone calls, live translation, support, coaching, or tutoring?
- Is engineering time more expensive than the extra Realtime media cost?
If most answers are yes, start with Realtime. If most answers are no, start with a pipeline.
Now run the reverse test:
- Is the workload batch, asynchronous, or push-to-talk?
- Is per-minute cost the primary constraint?
- Do you need a custom TTS vendor such as Cartesia Sonic 3.5?
- Do you need specialized STT for accents, noise, or domain vocabulary?
- Would losing a live session be acceptable because each step can be retried?
If most answers are yes, a Whisper plus TTS pipeline is probably the better architecture.
This comparison also keeps vendor claims in the right lane. Cartesia Sonic 3.5 is an excellent TTS benchmark and should be in your test set, especially if pure speech playback latency is the job.[5][7] Realtime should be judged on speech-native agency: tool calls, prosody, endpointing, interruptions, session control, and total turn latency. Whisper plus TTS should be judged on modularity, cost, and operational transparency.
Ship the first production version with observability and fallbacks
The minimum viable production voice agent is not just a Realtime session that works once. It is a system that can tell you why a call felt slow, why a user interrupted twice, why rate limits spiked, and when a session needs to be rolled over.
Log at least these events: session start, transport type, audio format, VAD settings, first user audio chunk, detected end of user turn, first model response event, first audio delta, playback start, interruption event, truncate event, tool call start and end, rate-limit headers, summary creation, and session close. These are the timestamps that let you distinguish model latency from media latency. For a practical observability setup using Langfuse and cost tracking, see ai-agent-observability-langfuse.
The synthesis also points to prompting patterns that are easy to miss: use lower reasoning.effort for latency-sensitive turns, use preambles to mask tool latency, read entities such as account numbers digit by digit, and use a no-op wait_for_user tool when the model needs to stop speaking and wait.[2] Voice prompting is not chat prompting read aloud. The prompt has to manage timing, turn boundaries, and what the user actually hears.
Finally, keep a pipeline fallback. If rate limits are tight, if a user uploads a long recording, or if a call becomes asynchronous, route that work through STT plus text plus TTS. Realtime is the premium live path. It should not become the only audio path in your system.
The practical takeaway: use OpenAI Realtime when speech is the interface. Use Whisper plus TTS when speech is just an input or output format. Use Cartesia Sonic 3.5 when TTS latency is the whole problem. For implementation practice that connects these choices to tool calling and agent orchestration, continue with OpenAI Agents SDK Mastery: Build Production-Ready Autonomous Systems and keep voice-agents-2026-tts-latency-benchmark nearby as the TTS comparison baseline.
Further reading
[1] OpenAI rate limits guide: https://developers.openai.com/docs/guides/rate-limits
[2] OpenAI Realtime models prompting guide: https://developers.openai.com/docs/guides/realtime-models-prompting
[3] OpenAI platform changelog: https://platform.openai.com/docs/changelog
[4] Latent Space Realtime API production notes: https://www.latent.space/p/realtime-api
[5] Cartesia Sonic 3.5 vs OpenAI TTS benchmark: https://cartesia.ai/vs/cartesia-vs-openai-tts
[6] Eesel Realtime API vs Whisper vs TTS API: https://www.eesel.ai/blog/realtime-api-vs-whisper-vs-tts-api
[7] Cartesia Sonic product page: https://cartesia.ai/sonic
[8] Inworld voice AI and TTS APIs 2026 benchmark: https://inworld.ai/resources/best-voice-ai-tts-apis-for-real-time-voice-agents-2026-benchmarks
[9] WorkAdventure interrupt handling in Realtime API: https://docs.workadventu.re/blog/realtime-api-interrupting-the-model/
[10] OpenAI Community Realtime instruction limit discussion: https://community.openai.com/t/realtime-api-instruction-limit-16-384-tokens-is-too-low-for-production-voice-agents-with-tool-calling/1378932
References
- OpenAI rate limits guide· retrieved 2026-05-13
- OpenAI Realtime models prompting guide· retrieved 2026-05-13
- OpenAI platform changelog· retrieved 2026-05-13
- Latent Space: Realtime API latency and production notes· retrieved 2026-05-13
- Cartesia Sonic 3.5 vs OpenAI TTS benchmark· retrieved 2026-05-13
- Eesel: Realtime API vs Whisper vs TTS API· retrieved 2026-05-13
- Cartesia Sonic product page· retrieved 2026-05-13
- Inworld voice AI and TTS APIs 2026 benchmark· retrieved 2026-05-13
- WorkAdventure: Interrupting the OpenAI Realtime model· retrieved 2026-05-13
- OpenAI Community: Realtime instruction limit discussion· retrieved 2026-05-13