2026-04-22 · 11 min read

Async Or Die: A CTO’s Playbook for Durable Agent Orchestration in 2026

By Diogo Hudson Dias

Senior engineer in a São Paulo office reviewing an async workflow dashboard showing queues, retries, and job progress on a large monitor.

Your web stack was built for 200 ms CRUD. Your agents need 2–20 minutes of retries, callbacks, and human approvals. That’s why production agent systems stall, double-charge customers, or leak tokens. HN is right: all your agents are going async. The question is whether you will do it deliberately.

This is a practical playbook for CTOs: how to design durable async orchestration for AI agents—idempotent by default, observable, and with predictable cost. It draws on the lessons behind two themes in the week’s feeds: “All your agents are going async” and the postmortems on what async promised vs. what it delivered. If you run a product team today, you don’t have quarters to experiment. You need a blueprint you can implement in 30–90 days.

Start with a workload taxonomy (not a tool)

Before you buy a workflow engine, label your work. AI agents magnify four distinct execution classes. Each class pushes you to a different architectural choice.

Class A — Short I/O (≤5s): Single API calls, vector lookups, lightweight tool invocations. Keep synchronous over HTTP. Guard with timeouts and retries-in-client.
Class B — Medium jobs (5–60s): Multi-step I/O, light compute, fan-out to a few tools. Use a queue + workers with at-least-once semantics and idempotency keys.
Class C — Long-running tasks (1–30 min): Agent loops, multi-API backoffs, human approvals. Use a durable workflow engine (Temporal/Step Functions/Durable Functions) or a homegrown saga coordinator with state checkpoints and heartbeats.
Class D — Very long / stateful (30 min–days): Research crawls, multi-agent negotiations, batch enrichment. Use a workflow engine + durable storage for checkpoints, human-in-the-loop steps, and resumability after deploys.

Your first decision is mapping features to classes. If >30% of your roadmap is Class C/D, you’ll outgrow ad hoc cron + queues fast. If your work is 90% Class B, a lean SQS/Redis + workers stack with an outbox/inbox pattern will beat heavyweight orchestrators on simplicity and cost.

Delivery semantics: pick your poison up front

Agents are stochastic. Networks fail. Exactly-once delivery is a fairy tale at scale. Choose effectively-once and implement it ruthlessly.

At-least-once queueing: SQS, Pub/Sub, Kafka will redeliver. Design every step to be idempotent.
Idempotency keys: Generate a stable key per user-intent (e.g., hash of input + tool + version). Persist a tombstone record with status and response checksum. Reject duplicates within a bounded dedupe window (e.g., 24–72 hours).
Outbox/inbox pattern: For database mutations, write the intent to an outbox table in the same transaction as the state change. Ship with a relay to the queue. On the consumer, write into an inbox table before processing to dedupe redeliveries.
Compensations over rollbacks: Use the saga pattern. When a later step fails, schedule a compensating action (refund, revoke token, delete file), not a database rollback you can’t apply to external systems.

Teams that adopt idempotency at the boundary cut duplicate side-effects by >95% in practice. Without it, your support queue becomes your dedupe service.

The orchestration options that actually ship

There are a thousand frameworks. Only a few survive real production constraints (multi-minute tasks, backpressure, observability, mixed human/automated steps).

Option 1: DIY queue + workers (SQS/Redis/Kafka)

When to use: ≤20 task types, flows ≤5 steps, mostly Class B. You need speed and cost efficiency.
How: SQS or Pub/Sub for queueing, a Postgres outbox/inbox for dedupe, workers on ECS/Kubernetes, dead-letter queues (DLQs) with alarms.
Pros: Cheap and simple. SQS is ~$0.40 per million requests; 10M jobs/month ≈ $4 in queue ops.
Cons: No native human steps, no graphical visibility into flows, you build your own retries, backoff policies, and correlation IDs.

Option 2: Managed workflow services (AWS Step Functions, Azure Durable, GCP Workflows)

When to use: 10–100 step flows, mix of short and long tasks, need visual state, service integrations, and SLAs.
How: Model flows as states. Use task tokens for callbacks and heartbeats for long work. Embrace sub-workflows for parallel tool calls.
Pros: Durable, visual debugging, built-in backoff. Step Functions Standard is priced at $0.025 per 1,000 state transitions, so 10M transitions ≈ $250/month. Integrates natively with AWS services.
Cons: Vendor lock-in, JSON DSL ergonomics, express variants charge by GB-second which can spike on heavy payloads, limited local dev story.

Option 3: Temporal (or Cadence) for durable execution

When to use: >100 distinct task types, human-in-the-loop, versioned workflows you can replay deterministically, frequent deploys without losing progress.
How: Write workflows in code (Go/Java/TypeScript/Python). Temporal handles retries, timers, heartbeats, replays, and state persistence.
Pros: “Write blocking code, get durable async.” Deterministic replays enable time travel debugging. Strong cancellation and compensation patterns.
Cons: Operational surface area if self-hosted (Cassandra/Postgres + matching/history services). You need engineering discipline on workflow versioning.

Rule of thumb: if you have two or more of 1) human approvals, 2) steps spanning >15 minutes, 3) must resume mid-flight after deploys, 4) 50+ distinct activities, choose a workflow engine. Otherwise, keep it lean with queues.

Secure egress: treat agent HTTP like production payments

Agents call the internet. That’s a blast radius problem. This month’s headlines about OAuth and environment variable leaks should be a wake-up call. Treat outbound HTTP as dangerous until proven safe.

Zero-trust egress: Route all agent HTTP through a proxy that enforces an allowlist by domain and scheme. Deny by default. For dynamic tools, use signed, short-lived egress tokens with scope and rate limits.
LLM-as-judge guardrails: A judge proxy can score or redact prompts/responses to strip secrets and PII before requests leave your VPC. It’s not your only line of defense, but it reduces leakage from prompt injection.
Secret hygiene: No long-lived tokens in environment variables. Use per-workflow, time-scoped credentials from a vault (AWS STS + Secrets Manager, HashiCorp Vault) with automatic rotation. Pass credentials by reference, not value.
RBAC for tools: Every tool has a service role. Map agent capabilities to roles, not raw keys. Principle of least privilege is non-optional when agents compose tools.

Backpressure, timeouts, and cost-aware retries

Retries are where bills go to die. Your retry math should be budget-based, not hope-based.

Exponential backoff with jitter: Start at 250–500 ms, multiply by 2, cap at 30–60 seconds, add ±20% jitter to avoid thundering herds.
Retry budgets: Define a max compute budget per user-intent (e.g., 30 seconds CPU, 3 minutes wall time, $0.02 LLM spend). Abandon or escalate to human once exhausted.
Queue depth signals capacity: Workers scale up when depth > N messages or age > T seconds. Scale down slowly to avoid oscillation.
Per-endpoint rate limits: Maintain client-side token buckets per external API to honor vendor SLAs and avoid account bans.

Observability for humans, not just machines

If your team can’t answer “where is this job and why is it stuck?” in 60 seconds, you don’t have observability—you have logs. Build first-class tooling.

Correlation IDs end-to-end: One ID per intent, propagated across HTTP, queue messages, and workflow steps. Index by ID in logs and traces.
Heartbeats and liveness: Long tasks must send heartbeats every 10–60 seconds. Kill and reschedule on missed heartbeats after a grace window.
DLQ hygiene: Dead-letter queues are not a graveyard. They’re a paged, triaged queue. Auto-aggregate by root cause and expose remediation buttons (retry, compensate, escalate) in your internal console.
SLAs and SLOs by class: Publish 50/95/99th percentiles by workload class. Typical targets we see work: Class B 95th ≤ 10s, Class C 95th ≤ 5m, Class D time-bound by human step, with timers so nothing rots.

Concrete cost models you can defend to Finance

Cost is where async wins or loses credibility. Here are reference points using public pricing as of 2026. Adjust for your region and free tiers.

Queue + workers (AWS example)

SQS: ~$0.40 per 1M requests. 10M messages/month ≈ $4.
Compute (Lambda example): 10M jobs at 256MB for 5 seconds each = 10,000,000 × 5s × 0.256 GB = 12,800,000 GB-seconds. At ~$0.00001667/GB-s, ≈ $213. Add request fees and some overage; call it $230.
Total ballpark: ~$234/month plus egress bandwidth and storage. If your jobs run longer or need containers, budget accordingly (ECS/K8s nodes, autoscaling).

Workflow engine (Step Functions Standard)

State transitions: $0.025 per 1,000. For a 10-step workflow executed 1M times/month: 10M transitions ≈ $250.
Task compute: Same math as above for Lambda or container tasks.
Total ballpark: $250 for orchestration + compute. For human-in-loop, you’ll add storage for state and a modest DB footprint (~$50–$200/month).

Temporal

Self-hosted: Infra is heavier (database + services). Expect low five figures/year in infra and ops time once you’re at scale (>100M activity executions/year).
Cloud: Vendor pricing varies by execution volume and storage; it can be cheaper than human toil if you need deterministic replay and complex flows. The ROI is in defect reduction and developer velocity, not raw compute cost.

Sanity test: If you’re spending more on reprocessing failures and manual cleanups than on orchestration, you’re under-invested in async.

Human-in-the-loop that doesn’t block the world

Agents fail in subtle ways—hallucinated API fields, partial updates, policy violations. Human approvals should be asynchronous, high-signal, and capped in time.

Timed approvals: Every manual step has a timer (e.g., 15 minutes). On timeout, either auto-approve using a risk score threshold or auto-reject with a compensating action.
Thin review UIs: Build a single panel showing the prompt, tool calls, diffs on proposed changes, and a one-click approve/deny with reason capture.
Sampling: Start with 100% review for risky actions. Ratchet down to 5–10% sampling as your metrics stabilize and false negative rate drops.

Testing and deploys: determinism or bust

Most agent bugs are “works on my laptop” but fail after a deploy or a retry. You need determinism where it counts.

Replayable workflows: Prefer engines that can replay decisions. If you’re DIY, snapshot state at step boundaries and support re-run from step N with the same inputs.
Property-based tests for tools: For each external tool, generate perturbed inputs and assert side-effects. Catch IDORs, rate limit gaps, and missing auth early—a recurring theme as AI-generated APIs ship faster than AppSec can review.
Canary on orchestrator changes: Version your workflows. Run canaries (1–5%) before full cutover. Old executions should finish on old versions; new start on new.

A pragmatic 30-60-90 plan

Days 0–30: Classify, contain, and stop the bleeding

Inventory all agent flows. Label them Class A–D. Draw a swimlane diagram with tools and egress per step.
Implement idempotency keys at API boundaries. Add an outbox table. Turn on DLQs with paging.
Introduce a simple egress proxy with an allowlist. Move secrets to a vault and rotate any long-lived keys.

Days 31–60: Stand up durable execution for Class C/D

Pick your orchestrator: Step Functions/Durable if you’re all-in on a cloud; Temporal if you need language-native flows and replay.
Migrate the noisiest two flows. Add heartbeats, correlation IDs, and human approvals with timers. Instrument 50/95/99th percentiles.
Set retry budgets and per-endpoint rate limits. Tune backoff and worker autoscaling on queue age, not CPU.

Days 61–90: Make it boring

Build an internal “Jobs” console: searchable by correlation ID, with retry/compensate/escalate controls.
Write a playbook for DLQ triage and weekly reviews. Track top 3 root causes and burn them down.
Harden security: expand the egress allowlist, enforce per-tool roles, enable prompt/response redaction at the proxy.

What about nearshore teams?

Distributed teams actually benefit from durable async: a São Paulo team hands off to a New York team with the same correlation IDs, dashboards, and retry semantics. You get 6–8 hours of overlap, plus 16–18 hours/day of progress as long jobs continue safely while teams sleep. The trade-off: you must invest in observability and one-click remediation so no one is spelunking logs at 3 a.m.

Common anti-patterns to kill now

Long polling over HTTP for minutes: This ties up infra and invites timeouts. Use callbacks or webhooks with task tokens.
Saving model state in memory between steps: A deploy will kill it. Persist checkpoints in a store (S3, Postgres) and pass references.
Global retries in client SDKs: Retries belong server-side with budgets and visibility, not hidden in SDK defaults.
Environment variables as the “secret store”: They leak via logs and platform env inspectors. Use a vault and short-lived credentials.

Final word

Async isn’t a style preference. It’s the only way to ship agentic features that survive reality—network flakiness, flaky APIs, human approvals, deploys, and the occasional model meltdown—without lighting money on fire. The right answer is rarely a brand-new platform wholesale. It’s a sharp taxonomy, idempotency, a boring queue where you can, and a real workflow engine where you must. Done right, your agents become dependable teammates, not expensive chaos machines.

Key Takeaways

Classify workloads A–D before picking tools; queues for Class B, workflow engines for C/D.
Adopt effectively-once delivery with idempotency keys and outbox/inbox patterns.
Secure egress with an allowlist proxy, scoped short-lived credentials, and optional LLM judge.
Use retry budgets, heartbeats, DLQs, and correlation IDs to make failures cheap and visible.
Model costs: queues are pennies per million ops; workflow engines add ~$25 per 1M transitions.
Version workflows and enable replay/canaries; approvals should be timed and minimal.
Nearshore teams gain continuity if observability and one-click remediation are in place.