2026-05-13 · 11 min read

Self‑Host Observability for AI Agents: A CTO Decision Framework

By Diogo Hudson Dias

DevOps engineer in a São Paulo office viewing observability dashboards on an ultrawide monitor with server racks behind them.

Your AI agents are drowning you in telemetry. Long-lived token streams, tool calls, retries, prompts, streaming SSE — it all adds up. If you keep piping every span and log line into a SaaS APM, you will pay through the nose and you will leak prompts you never intended to store off-VPC. This week, Hacker News surfaced Traceway — an MIT-licensed, self-hosted observability stack you can deploy in minutes. That’s not a curiosity. It’s a signal. In 2026, AI-heavy stacks need an observability strategy that is cost-aware, PII-aware, and portable.

What changed: AI agent telemetry isn’t like web microservices

Observability vendors optimized for microservices: short requests, uniform spans, low-entropy logs. Agents flip the assumptions.

Sessions are long-lived. A user kicks off an agentic workflow that streams 30–180 seconds of tokens, issues a dozen tool calls, and writes back to the UI the whole time.
Payloads are sensitive. Prompts, attachments, and tool results often carry PII, PHI, code, or contracts. If you cannot prove where that data lives, legal will eventually prove you wrong.
Signal is tail-heavy. The interesting failures are rare: timeouts on the 99.5th percentile, one mis-specified tool call out of a hundred. Head-based sampling discards what you need. Tail decisions matter.
Clients are first-class. Mobile and desktop apps now run inference and tools locally or on the LAN. You need cross-device session traces, not just server spans.

Put bluntly: agent observability is closer to analytics and eDiscovery than to server logs. Treat it that way.

The cost model you should actually run

Skip vendor sticker prices; they change monthly and are intentionally opaque. Model your own data gravity. Here is a defensible baseline for an AI-enhanced product:

Daily agent sessions: 250,000
Per session: 10 tool-call spans (~1 KB metadata each), 1 long-lived model span with prompt + response (~10–50 KB compressed), 200–500 token/telemetry events (~5–15 KB compressed total), 5–10 log entries (~2–5 KB compressed)

Conservative raw-per-session footprint: 30–70 KB. At 250k sessions, that’s 7.5–17.5 GB/day before enrichment. With additional attributes, HTTP headers, and context baggage, 2× is common. Call it 15–35 GB/day of observability data.

Now the money:

Hot storage compute (ClickHouse or Tempo/ClickHouse for traces; Loki for logs; VictoriaMetrics for metrics) on modest instances can compress 8–12× for trace/log payloads. 20–35 GB/day raw typically lands at 2–4 GB/day on disk for traces/logs and 0.5–1 GB/day for metrics after rollups.
Object storage (S3/GCS) for cold retention runs ~$0.021–0.026/GB-month in 2026. 1 TB of monthly cold traces/logs is <$30/month, plus retrieval when you need to dig.
Compute: a 3-node ClickHouse cluster on general-purpose instances often handles 10–50k spans/sec aggregate with sub-second query latency. Expect $600–$2,000/month depending on region and instance class. Loki + VictoriaMetrics add hundreds, not thousands.

Compare that to shipping 15–35 GB/day to a SaaS APM and keeping 7–14 days hot. Ingest and retention charges commonly scale into the low five figures monthly at this volume, and you still have privacy externalities. Self-hosting is not “free,” but the crossover point is lower than most teams think.

Architecture: one stack that won’t fight you

You do not need a science project. You need four things that run in your VPC and speak OpenTelemetry end-to-end.

1) Instrumentation and data contracts

Standardize on OpenTelemetry for traces, logs, and metrics. Adopt the emerging OpenTelemetry semantic conventions for AI/LLM spans: model name, tokens in/out, tool calls, retry counts.
Define a data contract for prompts and tool payloads. By default, store hashes, not raw text. Store full prompts/responses only when a per-session “debug” flag is set. Attach a pii=true attribute when any detector fires.
Instrument client apps (Android, iOS, desktop) to emit spans and logs with the same trace and user/session IDs. For Android, prefer OTLP over gRPC to reduce head-of-line blocking in poor networks.

2) Ingest and control plane

OpenTelemetry Collector everywhere. Use processors for attribute redaction, tail-based sampling, and routing. Tail sampling is table stakes for agents; you need to keep outliers even when overall QPS is high.
Event budgets: enforce max spans/logs per session at the collector, with hard caps and backpressure metrics. A misbehaving agent should not DDoS your observability tier.
PII guardrails: use processors to drop attributes matching detectors and to fork unredacted payloads to a sealed S3 bucket with a 7-day TTL, no UI access, IAM-only retrieval for incidents.

3) Storage and query

Traces: ClickHouse-backed tracing (e.g., via SigNoz, HyperDX, or a lightweight router) or Grafana Tempo paired with ClickHouse for search indices. ClickHouse gives you SQL over spans and can double as your prompt-analytics warehouse.
Logs: Loki for cost-effective, index-lite logs; object storage-backed chunks keep infra simple. If you must query structured logs at scale, stream to ClickHouse tables with partitioning by day and service.
Metrics: VictoriaMetrics for a single-binary, high-cardinality-friendly time series store. It compresses numeric series aggressively; expect 2–5× better storage vs raw Prometheus TSDB. If you emit high-resolution floating-point streams, consider a lossless compressor purpose-built for float telemetry; research like “fc, a lossless compressor for floating-point streams” shows additional 2–4× gains on some signals.

4) UI and workflows

Grafana dashboards over metrics and logs, trace explorer on top of ClickHouse/Tempo. Keep UIs simple and focused on your SLOs, not a vendor catalog of 200 widgets.
Session replay if you need it: self-host rrweb, link replays to trace IDs. Keep replays on a short TTL and never store PII in the DOM without masking.

Privacy and compliance: stop pretending prompts are “just logs”

If your agents touch PII, PHI, or source code, your observability layer is a regulated data store. Act accordingly.

Redact by default. Hash prompts and responses. Keep only token counts, durations, and model metadata unless a debug flag is present. For flagged sessions, fork full payloads into a sealed bucket with 7-day TTL.
Regional residency. Keep data in-region. If you operate in Brazil and the US, satisfy LGPD and state privacy laws by pinning hot and cold tiers to the correct regions and accounts.
Access control. No shared logins. Use SSO + SCIM + per-project RBAC. Mask attributes in the UI. Every access to unredacted data should leave an audit span of its own.
Data retention. 7 days hot for traces/logs, 30–90 days cold in object storage, 15 months for rollup metrics. You can justify this to legal and still run effective incidents.

Performance: latency budgets for observability itself

Collectors and exporters run in-process or sidecar; they must not jeopardize your p99s.

Non-blocking OTLP exporters with backpressure and bounded queues. Drop on the floor beyond event budgets; never let observability stalls impact user requests.
Tail-based sampling windows of 1–3 seconds capture slow traces deterministically without adding user-facing latency. This is crucial for QUIC/WebRTC and SSE flows; a recent Linux networking optimization caused a QUIC bug that would have been invisible with head sampling.
Shard by session ID in your tracing store to improve cache locality and keep queries snappy for live debugging.

Decision framework: when to self-host vs. buy

Use triggers, not vibes.

You should self-host if any of these are true:

Telemetry volume exceeds ~15 GB/day across traces/logs/metrics, or you expect to double within two quarters.
PII/Sensitive prompts cannot leave your VPC or region, or legal requires audit trails on who viewed raw prompts.
Agent debugging needs session-level correlation across mobile/desktop + backend + tools, and you cannot express this easily in your SaaS vendor.
Cost volatility from SaaS has already forced you to turn down retention or sampling to the point engineers avoid the tool.

Stick with SaaS (for now) if all of these are true:

Sub-5 GB/day telemetry with no sensitive payloads, and
Low-cardinality metrics/logs, and
No dedicated SRE/DevEx capacity for the next two quarters.

There is a middle path: dual-write to SaaS and self-hosted while you mature your pipelines, then flip the default route.

Implementation: a 0–90 day execution plan

Days 0–30: Baseline and dual-write

Adopt OpenTelemetry SDKs in your API gateway, agent orchestrator, and top two tools. Emit model spans with standardized attributes.
Deploy a reference stack in your staging VPC: OpenTelemetry Collector → Loki + VictoriaMetrics + ClickHouse. Tools like Traceway show you can stand this up in minutes; production-hardening takes longer, but the basics are fast.
Dual-export to your current SaaS and to your self-host stack. Validate parity on three golden signals and one known incident.
Write the data contract for redaction and debug forks. Turn it into a CI check: new spans without a classification attribute fail builds.

Days 31–60: Production cutover for hot paths

Enable tail sampling by service and error rate. Keep 100% of traces for sessions with errors or p99 > SLO, sample 1–5% otherwise.
Enforce event budgets per session in the collector. Alerts fire when drops exceed thresholds.
PII guardrails live: masking in SDKs, redaction in collectors, sealed-bucket fork with 7-day TTL. Audit spans on every raw-data access.
Dashboards and runbooks for your top 3 SLOs: prompt latency, tool-call success, end-to-end session time. Train on two mock incidents and one live game day.

Days 61–90: Optimize and de-risk

Tune storage partitions and retention tiers. Push everything older than 7 days to object storage. Roll up metrics to 1–5 minute resolution past 7 days.
Cost guardrails: monthly budget alerts on object storage growth and ClickHouse CPU. Block merges and compactions from starving queries.
Disaster recovery: snapshot catalogs daily, test restore to a scratch VPC. Document your RTO/RPO and prove it.
Turn down SaaS ingestion for logs/traces in a controlled window. Keep metrics mirrored for one more sprint before the final cut.

Tactics that pay for themselves fast

Redact at source: do not rely on downstream processors to scrub secrets in prompts. Hash + token counts answer 90% of debugging questions.
Session-scope sampling: keep entire failing sessions intact; partial traces are wasted disk.
Structured logs only: JSON with a trace_id and session_id. Unstructured logs get dropped in 90 days; structured logs fuel analytics in ClickHouse.
Compress the right things: object storage with zstd or parquet; metrics with delta-of-delta; floating-point heavy streams benefit from specialized, lossless compression researched for scientific data.
On-call ergonomics: a single session view that links prompts → tool calls → UI events resolves incidents faster than three vendor tabs ever will.

Trade-offs and real risks

You are on the hook for SRE. Someone has to own patching, scaling, and backups. Kernel and dnsmasq CVEs keep coming; bake your collectors and stores into standard images with auto-updates in maintenance windows.
You will over-collect without ruthless budgets. AI teams like visibility. Give them sampling dials and clear costs; otherwise, your “cheap” stack becomes expensive.
ClickHouse is powerful and sharp. Bad schema and merges will eat CPU. Start with known-good layouts for spans and logs; do not improvise in prod.

Nearshore execution: how we see teams get this done

The heaviest lift is not software; it is policy and plumbing. Teams that succeed usually pair a small platform core with nearshore horsepower. In practice, a two-person platform team in the US plus a nearshore pod (Brazil is strong here) can stand up the stack, IaC, and data contracts in two sprints with 6–8 hours of overlap. After that, it is incremental tuning and a quarterly cost/retention review.

What “great” looks like by the end of Q1

Costs: Hot tier at $1–2k/month, cold tier under $100/month per TB. No surprise SaaS invoices tied to traffic spikes.
Privacy: Prompts hashed by default, sealed-bucket access audited, regional residency enforced by policy.
Reliability: Trace queries under 1 second for the last 24 hours. Tail sampling keeps all slow/error sessions. Game days prove RTO < 2 hours, RPO < 24 hours.
Dev velocity: Engineers use a single session view to resolve incidents. Product and data teams query ClickHouse for model performance without asking SREs for CSVs.

Bottom line

AI agents make observability both more valuable and more dangerous. If you are pushing tens of gigabytes a day and any of it is sensitive, the default of “ship it to a vendor and hope” no longer holds. Self-hosting with OpenTelemetry, ClickHouse/Loki/Tempo, and strict data contracts is not contrarian — it is responsible engineering. Start dual-writing this month. You will sleep better by next quarter and so will your legal team.

Key Takeaways

Agent telemetry is tail-heavy, long-lived, and often contains PII; treat it like analytics, not just logs.
The cost crossover for self-hosting arrives around 15+ GB/day of traces/logs/metrics or any PII constraints.
Adopt OpenTelemetry with tail sampling, event budgets, and redaction-by-default data contracts.
Use ClickHouse/Tempo for traces, Loki for logs, and VictoriaMetrics for metrics; cold-store everything else in object storage.
Execute a 0–90 day plan: dual-write, cut over hot paths, enforce guardrails, test DR, then turn down SaaS.
Expect trade-offs: you own SRE and cost discipline; the payoff is lower spend, lower risk, and better incident response.