Infrastructure

Latency

The elapsed time between a user or system request and the response becoming available.

Latency shapes how an AI product feels. A five-second answer may be acceptable for a research report, but painful for autocomplete, support chat, or interactive coding.

AI latency comes from several layers: network travel, retrieval, tool calls, queueing, model prefill, token generation, and post-processing. Improving it often requires changing the whole workflow, not just choosing a faster model.