Your AI Token Streams Are Fragile: A CTO Guide to Resumable, Cancellable, Multi‑Device SSE

By Diogo Hudson Dias
Engineer in a São Paulo office reviewing a dashboard of live AI token stream metrics showing reconnects and cancellations on dual monitors.

Your LLM UI probably demos beautifully on office Wi‑Fi. Then it hits the real world: mobile networks that flap, users who refresh mid‑stream, app foreground/background churn, and people switching from laptop to phone. Suddenly you’re double‑paying for output tokens and support tickets say “it keeps spinning.”

Server‑Sent Events (SSE) is still the most practical way to deliver token streams from most providers. It plays nicely with proxies, works over HTTP/2, and doesn’t require full duplex like WebSockets. But out‑of‑the‑box SSE is fragile. The good news: you can harden it. A recent wave of posts about making SSE streams resumable, cancellable, and multi‑device is right — and it’s time to turn that advice into a production‑grade blueprint you can hand to your team.

The failure modes you’re paying for

  • Abandoned reruns: A user refreshes mid‑generation. Your backend starts a new provider call. You just doubled spend for the same request.
  • Phantom compute after cancel: The user hits stop; your UI closes the TCP connection. Your server doesn’t propagate abort upstream, so the provider keeps generating — and billing.
  • Multi‑device drift: A user opens the same conversation on web and mobile. Both trigger a generation. You pay twice, content diverges, and reconciling deltas is messy.
  • Middlebox buffering: CDNs and proxies buffer your stream “to be helpful.” Time‑to‑first‑token (TTFT) jumps from 300–500 ms to multiple seconds. Users bail.
  • Lost progress on flaky networks: 3–7% of mobile SSE connections will drop in a 60‑second window. Without resumption, you rerun the whole generation.

If your product moves 300M output tokens/month at $15 per million (typical for GPT‑4o‑class output), 8–12% duplication/waste is $360–$540/month — small on paper, but it correlates with churn and support load. At scale (billions of tokens, or pricier models), the dollars grow fast. More importantly, broken streams kill trust.

Decision framework: when SSE, when WebSockets, when WebRTC

  • SSE if you only need server→client token streaming, simple reconnect semantics, and broad proxy/CDN compatibility. Most chat‑style LLM UIs fit this.
  • WebSockets if you require true duplex during generation (e.g., real‑time tool progress, user interruptions that must be delivered instantly across NATs) and you control the edge path. Harder to operate at Internet scale.
  • WebRTC if you’re doing audio/video streams with low latency and NAT traversal. Useful for streaming voice features just launched by major providers, but overkill for text tokens.

This post assumes SSE. If you pick WebSockets or WebRTC, most of the control‑plane ideas still apply: stream IDs, sequence numbers, fan‑out, and cancellation.

The architecture that stops paying twice

Don’t let clients connect directly to your model provider. Put a lightweight Stream Broker in your stack:

  • Client opens SSE to your Edge at /v1/streams/:stream_id.
  • Edge/Ingress terminates TLS, disables buffering, and forwards to the Stream Broker (sticky by stream_id).
  • Stream Broker is a stateless worker with a short‑lived in‑memory cache (or Redis/NATS) keyed by stream_id. It does three jobs:
    • Multiplex/fan‑out: If a generation is already in progress for this stream_id, attach the new subscriber; don’t start another provider call.
    • Resumption: Buffer a rolling window of recent deltas with monotonically increasing event.id. On reconnect with Last-Event-ID, replay missing parts.
    • Cancellation: On user stop, cancel the upstream provider request via an abortable HTTP client. If the last subscriber disconnects, auto‑cancel.
  • Model Worker holds provider credentials, requests streaming completions, normalizes provider‑specific deltas to a common event format.

Keep stream state in memory for 60–120 seconds after completion/cancel to accommodate fast device switches or tab refreshes without a full rerun.

Stream identity and sequencing

  • Derive stream_id deterministically from conversation_id + turn_id + a 32‑bit epoch. If the user edits the prompt or system state, increment the epoch to prevent stale resumes.
  • Emit SSE id fields as strictly increasing integers per stream, starting at 1. Avoid provider chunk IDs; renumber on your side.
  • Send events with types: delta, tool, heartbeat, complete, error. Never overload data with control signals.

Client reconnect logic (the only one you should have)

  1. On connection drop, reconnect with exponential backoff capped at 2 seconds.
  2. Include Last-Event-ID with the most recent id you applied. If broker has the buffer, it replays; otherwise it returns a 410 Gone to force a full rerun.
  3. Render idempotently. If you see an old id, discard; if you miss one, request a targeted replay via query (?from_id=123) after reconnect.

Cancellability that actually stops the bill

SSE is one‑way. Closing the browser tab doesn’t guarantee the provider stops generating. The cancel must flow server‑side:

  • Stop button → POST /v1/streams/:stream_id/cancel with a short‑lived stream‑scoped token (60s TTL). Do not rely on connection close detection alone.
  • Broker cancel behavior:
    • Mark stream as cancelling; notify all subscribers with an event: error carrying a typed reason (user_cancelled, superseded).
    • Abort the provider HTTP request (AbortController or client‑specific cancel token). If the provider doesn’t support abort, fall back to closing the TCP socket.
    • Keep the last N deltas in memory so a fast resume can show partial content without re‑invoking the model.
  • Single‑flight by default: Enforce one active generation per conversation. A new start automatically cancels the previous one with reason superseded.

In practice, robust cancellation reduces average output tokens consumed per user conversation turn by 15–35% because people habitually “trim” responses. That’s real money on premium models and — more importantly — a UX that feels in control.

Multi‑device without double billing

A user switching from laptop to phone should not trigger a fresh model call. You want compute once, deliver many:

  • Attach many subscribers to one upstream call. The broker treats each client connection as a subscriber. New subscribers receive a fast replay of buffered deltas (based on Last-Event-ID) and then live tokens.
  • Fan‑out atop a shared event log. Persist only structured deltas and metadata (token index, tool calls). Store full assembled text only at complete, or reassemble on the fly.
  • Enforce device concurrency policy. For example: allow multiple readers, single initiator. New initiators on a different device auto‑cancel the in‑flight stream with superseded.
  • Prevent accidental duplicates. Debounce start actions for 300–500 ms per conversation to avoid double‑click/race starts.

Across audits we’ve run for US consumer apps, 5–12% of LLM calls were duplicated by tab refreshes or quick device switches. A brokered fan‑out reduces that to <2%, and p95 time‑to‑first‑token typically drops 30–50% once buffering is fixed.

Edge and proxy configuration (where most teams stumble)

Middleboxes love to buffer. Your job is to make streaming unambiguous from HTTP headers through to origin. Baseline config:

  • Response headers from the broker: Content-Type: text/event-stream, Cache-Control: no-cache, no-transform, Connection: keep-alive, X-Accel-Buffering: no (honored by Nginx), and send a heartbeat comment : ping\n\n every 10–15 seconds.
  • Nginx/Envoy: disable proxy buffering for text/event-stream, flush small chunks, increase proxy_read_timeout to cover long generations (e.g., 120s+). Ensure HTTP/2 is enabled client‑side; keep chunking semantics intact.
  • CDN/Edge: many CDNs buffer by default. Use features labeled “origin streaming,” “no buffering,” or bypass caching for text/event-stream. Set no-transform to prevent content modification. If your edge vendor doesn’t support true streaming on free tiers, bypass it for SSE paths.
  • Provider connections: use HTTP/2 where supported; set small write buffers to flush tokens promptly. Some providers batch tokens unless Accept: text/event-stream and stream=true are set precisely.

Provider quirks you should normalize away

  • Token delta format: Providers disagree on partial vs. full text frames. Normalize to event: delta with { id, text_delta, tool_calls_delta, usage? }. Assemble text client‑side.
  • Stop reasons: Expose a consistent set: length, user_cancelled, content_filter, error. Map provider‑specific reasons to yours.
  • Abort semantics: Some APIs only stop on TCP close; others accept request‑scoped cancellation. Standardize via your HTTP client and keep an integration test that verifies <1 second stop after cancel across providers.
  • Usage accounting: Don’t trust just the provider’s async usage webhook. Compute your own token counts from deltas for real‑time metering and fraud detection.

Security: lock scopes to the stream, not the user

  • Stream‑scoped tokens: Issue a JWT usable only for /v1/streams/:stream_id and /cancel, TTL ≤ 5 minutes. Embed conversation_id, epoch, user ID, and permissible actions.
  • No PII in logs: Never log event.data. Log only id, event, byte sizes, and timing.
  • Provider keys stay server‑side. Clients never hold model API keys. The broker mediates. This also enables central cancellation and fan‑out.
  • Rate limits per conversation: Cap concurrent in‑flight generations to 1 by default. Burst policies go at org or project level.

Resilience and fallbacks

  • Resume window TTL: Keep a 60–120 second in‑memory ring buffer per stream (~64–256 KB is enough). If the buffer is gone, return 410 Gone so the client knows to rerun.
  • Edge bypass switch: Feature‑flag a path that routes SSE directly to origin if the CDN starts buffering during an incident.
  • Provider failover: Mid‑stream failover is rarely clean. Prefer: fast fail detection (1–2 seconds), cancel, then transparent restart on a secondary provider with a clear UI indicator “switched to backup model.”
  • Chaos test networking: Inject 3–5% packet loss, 250–600 ms variable RTT, and random 1–3 second disconnects. Your SSE should remain readable and resumable.

Data model: event‑sourced streams beat ad‑hoc strings

Persist each stream as an append‑only log of structured events:

  • delta: { id, ts, text_delta, tool_delta?, usage_partial? }
  • heartbeat: { id, ts }
  • complete: { id, ts, stop_reason, usage_final }
  • error: { id, ts, code, retriable }

On read, you can rebuild the final text, audit usage precisely, and run analytics on interruption patterns. And if you introduce multi‑provider A/Bs, you’ll want this lineage.

Observability: KPIs that prove it’s working

  • TTFT (time‑to‑first‑token): target <600 ms p50, <1.5 s p95 on broadband; mobile may be higher.
  • Stream completion rate: percentage of streams ending in complete vs. error/cancel. Track by device and network type.
  • Resume success rate: percentage of reconnects that replay missing deltas without a full rerun. Aim >90% within the resume window.
  • Duplicate suppression: share of attempted duplicate starts that were multiplexed into an existing stream. You want this high.
  • Cancel propagation time: time from user click to upstream abort acknowledgment. Aim <1 second.
  • Edge buffering incidents: count spikes in TTFT and sudden drops in heartbeat delivery; alert when detected.

One marketplace product we supported cut duplicate token burn by 9.2% and improved p95 TTFT by 41% after moving to a brokered SSE with proper edge config and resumption.

Rollout plan you can ship this quarter

Week 0–2: Prove the broker

  • Introduce a minimal Stream Broker in your API layer. Normalize provider deltas, emit SSE with id, and add heartbeats.
  • Disable buffering end‑to‑end on a canary route. Measure TTFT and drop rates.
  • Add /cancel and abort propagation to one provider integration.

Week 3–5: Make it durable

  • Add a ring buffer per stream (in‑memory first; Redis Streams if you need cross‑instance resumption).
  • Implement Last-Event-ID resume and fast replay. Debounce duplicate starts.
  • Fan‑out to multiple subscribers. Verify multi‑device switch without double billing.

Week 6–8: Put guardrails and metrics in place

  • Stream‑scoped JWTs, rate limits, and single‑flight per conversation.
  • Dashboards for TTFT, resume success, cancel latency, duplicate suppression.
  • Chaos tests with artificial packet loss and intermittent disconnects. Fix until metrics hold.

Trade‑offs and when not to do this

  • Complexity: A broker adds moving parts. If your app is internal‑only on stable networks and call volumes are low, you might live with naïve SSE.
  • Latency vs. consistency: Replay windows and normalization add a few milliseconds. Worth it for durability; measure to be sure.
  • State management: Persisting deltas adds storage and PII risk. Keep short TTLs, encrypt at rest, and avoid logging payloads.
  • Provider constraints: Some vendors’ streaming semantics are inconsistent. Normalize aggressively and build contract tests per model API.

Why now

Two things changed in 2026. First, LLM features are moving to mobile and low‑bandwidth contexts, not just desktop web. Second, model “voice intelligence” and richer tool events increase the surface area for partial delivery and cancellation. You don’t need more agents; you need control flow at the stream layer. Get this right and you cut waste, gain speed, and — most critically — make your AI features feel reliable.

Key Takeaways

  • Put a Stream Broker between clients and providers to multiplex, resume, and cancel centrally.
  • Use deterministic stream_id + monotonically increasing event.id; resume with Last-Event-ID.
  • Implement explicit /cancel and propagate abort server‑side; don’t rely on socket closes.
  • Disable buffering from CDN to origin; send heartbeats and correct SSE headers.
  • Fan‑out one upstream generation to multiple devices; debounce duplicate starts.
  • Track TTFT, resume success, cancel latency, and duplicate suppression to prove impact.
  • Expect 15–35% fewer output tokens per turn and 30–50% faster perceived latency when done right.

Ready to scale your engineering team?

Tell us about your project and we'll get back to you within 24 hours.

Start a conversation