Most agent demos look great until they leave the developer’s laptop. The UX tanks, costs spike, and no one can explain why an agent pushed a broken change on Friday at 5 p.m. With Cloudflare’s new AI platform for agents making headlines and desktop-level control from the latest Codex-like features re-emerging, you’re being asked a fair question: what is our production architecture for agents?
Here’s the answer we ship: treat agents as distributed, event-driven systems at the edge—not as monolithic chatbots. Below is a pragmatic reference architecture you can put in front of your team today. It’s vendor-agnostic, biased toward measurable latency, enforceable policy, and cost predictability. It also scales from a single product squad to org-wide automation without a platform rewrite.
From laptop bots to edge-native agents
Two things changed in 2026:
- Edge runtimes got agent-aware. Cloudflare announced an inference layer designed for agents (think Workers + Queues + Durable Objects + Vectorize + AI inference), making it realistic to keep state and tools near users. Vercel, Fly.io, and the hyperscalers followed with their flavors.
- Desktop agents are back. OpenAI’s beefed-up Codex-style features can control your computer in the background. That’s powerful—and dangerous—without policy, audit, and proper scoping. You need server-side guardrails even if your UI runs on a laptop.
Your job as CTO isn’t to pick the flashiest demo—it’s to design a system where an agent’s decisions are observable, reversible, and cheap enough to run at scale.
Production requirements (the bar you should set)
- Latency: Sub-1s perceived latency for common turns in North America; edge compute should stay <100 ms per function, retrieval <150 ms in-region, small-model reasoning 300–800 ms, large-model reasoning 1.5–4 s. Keep the user waiting only when it matters.
- Safety: Explicit policy on what tools the agent can call, with resource scoping and human approval on high-risk actions.
- Observability: OpenTelemetry traces and structured events for every action, with replay. If you can’t replay an incident, you didn’t build a platform—you built a demo.
- Cost control: Hard token budgets per turn/session, admission control for long prompts, and backpressure for bursty workloads. Assume AI compute scarcity; design for it.
- Portability: Swap models and tool hosts without refactoring your agent logic. Avoid glue-code that bakes in one LLM vendor’s quirks.
An edge-native reference architecture
Map these concepts to your stack of choice. I’ll use Cloudflare terms in parentheses and note equivalent components on AWS/GCP to keep it grounded.
1) Front door and identity
- Edge API: A lightweight gateway at the edge (Workers, API Gateway + CloudFront, or Fastly Compute@Edge) terminates TLS, authenticates requests (OIDC/JWT), and normalizes payloads.
- Session metadata: Attach user/org IDs, risk tier, and a per-session token budget. Store short-lived session state in an edge key-value or session object (Durable Objects; AWS: DynamoDB + Lambda@Edge).
2) Orchestrator and scratchpad
- Event bus: Each user turn emits an event to a queue (Cloudflare Queues; AWS SQS; GCP Pub/Sub). This decouples UI latency from downstream work.
- Orchestrator: A stateless worker consumes events, runs the control loop (function calling, planning), and persists an Action Graph—a DAG of tool calls with inputs/outputs—into a session object (Durable Objects; or Redis/DynamoDB + Step Functions).
- Scratchpad: Store chain-of-thought-like data and plans as private agent state not exposed to users, and redact it from logs. The model gets the plan; logs get the result and rationale summary.
3) Reasoning engine (model abstraction)
- Model router: A thin abstraction that can target closed models (OpenAI/Anthropic), edge-hosted inference (Cloudflare AI), or open weights (e.g., Qwen 35B/A3B, Llama 3.x) on serverless GPUs (Fireworks, Modal) based on policy: data sensitivity, cost, speed, and accuracy requirements.
- JSON-only interfaces: Use structured outputs (JSON Schema) for tool selection, arguments, and summaries. Reject malformed responses at the router and re-prompt with a smaller budget rather than letting garbage flow downstream.
4) Tool execution and sandboxing
- WASM sandbox: Run quick, untrusted tools in WASI/WASM with time and memory caps. For longer actions, dispatch to serverless functions or short-lived jobs (Workers/Pages Functions; AWS Lambda/Fargate Jobs; GCP Cloud Run Jobs).
- Credential policy: Issue ephemeral, scoped credentials per tool invocation using OIDC and a secret manager (Vault, AWS STS + Secrets Manager). No long-lived tokens embedded in prompts.
- Desktop actions: If you let an agent “use a computer,” proxy through a broker that exposes a restricted tool surface (e.g., open URL in sandboxed browser, read selected window text, simulate click within bounding box) and logs every action with screenshots for audit. Never give it the full OS.
5) Memory layers
- Short-term: Per-session state (Durable Objects; DynamoDB/Redis). TTL minutes to hours.
- Long-term: RAG corpora in a vector store near the agent (Cloudflare Vectorize; pgvector on Neon; Pinecone; Weaviate). Keep embeddings close to where inference runs to avoid egress latency and cost.
- Transactional: A relational database (Postgres/PlanetScale) for entitlements, approvals, and durable audit.
6) Policy and approvals
- Policy engine: Evaluate each planned tool call against policy (OPA/Rego or Amazon Cedar). Inputs: user role, environment (prod/stage), dataset sensitivity, tool risk tier, token budget remaining.
- Human-in-the-loop: For risky actions (merging to main, touching PII, spending >$X), block and emit a notification to Slack/Email with a one-click approval flow. Record the approver and rationale.
7) Observability and replay
- Traces: Emit OpenTelemetry spans for each step: prompt build, model call, retrieval, tool call, policy check. Correlate with a session and user.
- Event log: Persist a compact, redacted event stream (Queue → object storage). Store inputs/outputs, model version, cost, and latency. This is your replay source of truth.
- Redaction: PII scrubbing before anything leaves the edge runtime. Keep a non-redacted, access-controlled vault only if legally required.
8) Cost controls
- Budgets: Hard caps per turn and per session. Example: 8K input + 2K output tokens per turn, 40K session cap, with a backoff strategy when you hit 80% of cap.
- Admission control: Reject prompts >N KB, suggest chunking, or switch to a cheaper/smaller model with explicit user consent.
- Backpressure: When queues back up, degrade gracefully: smaller models, fewer tools, or batch retrievals.
Concrete example: a PR-prep agent for your monorepo
Goal: The agent drafts a pull request that closes a Jira ticket by scaffolding code changes, tests, and a migration plan—without ever pushing to main without approval.
- User posts a ticket link. Edge API authenticates, allocates a 40K token session budget, and writes a session object.
- Orchestrator fetches the ticket, runs RAG over relevant code paths using a vector index (nearest edge region). Target latency: 80–150 ms for retrieval.
- Reasoning engine (open 14B or closed premium) plans: read files A/B, generate patch, run tests, propose DB migration, request approval to open a PR.
- Tool layer executes in order: read files (WASM), draft patch (model), run tests (serverless job), and prepare a PR. Each tool call carries a scoped token valid for 5 minutes.
- Policy engine blocks git push and emits an approval card to Slack with a diff summary and cost/latency so far. Approver clicks “Allow” or “Revise.”
- On approval, the agent opens the PR and posts a trace link. All actions are captured in a replayable event log.
Latency budget (median):
- Edge request + auth: 30–60 ms
- RAG: 80–150 ms
- Planning LLM (small model): 300–600 ms
- Patch generation (model): 700–2,000 ms depending on size/model
- Tests (serverless job): parallelized; user sees streaming updates
- Total time to useful output: <1 s for plan preview; 2–5 s for first patch draft
Cost sketch (update with your vendor pricing):
- Define cost per 1M tokens for each model tier. Compute per-turn: (input + output tokens) / 1,000,000 × price.
- Small open-weight model turns can be an order of magnitude cheaper than frontier models. Use them for planning and tool selection; reserve frontier models for code synthesis or hard reasoning.
- Egress still matters. Public cloud egress is often around $0.05–$0.09/GB. If your agent shuttles large artifacts (logs, binaries), co-locate compute with data or move artifacts to edge storage.
Build vs. buy: what to outsource to the platform
You don’t need to build everything. Here’s a pragmatic split.
- Buy/Leverage: Edge runtime, queues, object storage; model hosting for at least two tiers (fast/cheap and accurate/expensive); vector DB with regional placement; OTel-compatible logging; secret management.
- Build: Orchestrator (your control loop), policy integration (your risk), tool catalog and sandbox policy, redaction, replay UI, and cost controls. These encode your business rules and are your moat.
Cloudflare’s integrated agent stack reduces glue-work; AWS/GCP give you granular control with more assembly. Either way, keep the model router and policy layer yours to avoid vendor lock-in.
Team shape and timeline (nearshore-friendly)
- Core team: 1 platform engineer (edge/runtime/queues), 1 AI engineer (prompting, router, evals), 1 application engineer (tools, domain logic), 0.5 SRE/observability. Add a security engineer for desktop-agent scope.
- Timeline: A governed pilot in 4 weeks is realistic with a focused scope:
- Week 1: Front door, identity, tracing, and a stubbed orchestrator. Choose two models and wire a router.
- Week 2: Implement three tools in WASM (read-only), add vector search, and the policy engine in “report-only” mode.
- Week 3: Turn on blocking policy for one risky action, add Slack approval, and redaction. Start cost dashboards.
- Week 4: Ship the PR-prep agent to one squad. Define SLOs: p95 agent response <2.5 s, cost per turn <$X, zero policy bypasses.
Brazilian nearshore teams can cover build hours with 6–8 hours of overlap with US time zones and keep the integration work moving while your core team focuses on policy and model selection.
Risk management and trade-offs
- Accuracy vs. cost: Use small models for planning and tool routing; escalate selectively. You’ll cut spend 30–70% without hurting outcomes.
- Latency vs. portability: Edge-hosted inference accelerates UX but can reduce model choice. Keep a backdoor to a second provider for failover and regression checks.
- Safety vs. autonomy: More approvals reduce incidents but hurt throughput. Start strict, then relax with confidence metrics and post-hoc audits.
- Desktop agent scope: It’s tempting to let agents “just do it.” Don’t. Mediate every desktop action through a narrow tool surface and record everything.
How to decide where each component runs
Don’t blindly push everything to the edge. Use this simple rubric:
- Edge: Authentication, request shaping, lightweight planning, retrieval when the corpus is already cached or replicated. Rule of thumb: work <100 ms and <256 KB I/O.
- Regional: Heavy RAG against transactional data, long context windows, and anything that would incur cross-region egress if executed at the edge.
- Background: Tools that run >2 s or need isolation (tests, compiles, indexing). Always checkpoint to the event log; stream partials to the UI.
Evaluation: prove it works before scaling
- Golden tasks: 20–50 real tasks representative of production. Track success rate, average turns, cost per success, and time-to-first-useful-output.
- Guardrail tests: Simulate risky prompts and ensure the policy engine blocks them. Include prompt-injection scenarios and data exfiltration attempts.
- Load: Soak test with realistic concurrency. Watch queue depth and backpressure behavior; verify graceful degradation policies kick in.
What’s next: prepare for scarcity
Agent demand is outpacing supply. Expect periodic compute scarcity and model throttling. Design now for multi-model routing, budget-aware fallbacks, and coarse-grained SLAs (“fast-and-good-enough” vs. “slow-but-accurate”). When you get squeezed, the teams with cost levers and observability will keep shipping while everyone else pauses rollouts.
Key Takeaways
- Stop treating agents like chatbots; treat them as edge-first, event-driven systems.
- Build your own orchestrator, policy layer, model router, and replay; buy the plumbing.
- Hit sub-second UX by keeping planning and retrieval at the edge; push long work to jobs.
- Use small models for planning and escalate selectively to control spend.
- Enforce scoped, ephemeral credentials and human approvals for risky actions.
- Instrument everything with OpenTelemetry and keep a redacted, replayable event log.
- Design for multi-model, multi-provider from day one to survive AI compute scarcity.