2026-05-01 · 11 min read

Docker Compose in Production in 2026: A CTO’s Decision Framework

By Diogo Hudson Dias

DevOps engineer in a São Paulo office at night monitoring container deployments on a large screen while working on a laptop.

You don’t need Kubernetes. Not yet. If your product is sub–Series B, your traffic is measured in tens of millions of requests per month (not billions), and your team is fewer than a dozen engineers, Docker Compose in production can be the right call—if you treat it like production, not an extended dev environment.

The debate flared again this month after yet another round of posts asking “Should I run plain Docker Compose in production in 2026?” and a week of security headlines: a cPanel authentication bypass actively exploited, Ubuntu properties getting pounded by DDoS, and kernel disclosures with no advance heads-up to distributions. Translation: your operations surface will be tested whether you run k8s or Compose. The question is which complexity you want to pay for today.

The decision you actually need to make

You’re not choosing between “toy Compose” and “real Kubernetes.” You’re choosing between:

One-to-few hosts, simple networking, predictable workloads, 99.9% SLO — Compose + a hardened Linux host + a reverse proxy + disciplined deploys.
Many hosts, spiky or specialized workloads (GPU/ML/batch), multi-tenant isolation, 99.99%+ — a scheduler (Kubernetes, ECS, or Nomad) with the tooling and team maturity to match.

Everything else is detail. Here’s the framework we use with US startups and scale-ups (and that our Brazil-based SRE teams operate daily).

A CTO’s 8-axis rubric for Compose viability

1) Team size and on-call maturity

Compose-friendly: 2–6 engineers, 1–2 SREs or strong generalists, a single on-call rotation, and a clear “one person can understand the whole stack” bar.
Scheduler territory: 3+ squads, separate platform team, >30 services, or a mandate for self-service infra by feature teams.

2) Workload shape

Compose-friendly: Mostly long-running web APIs, a couple workers, cron-like jobs, one database, one cache.
Scheduler territory: Spiky batch jobs, GPU/accelerators, per-tenant sandboxes, or hundreds of ephemeral tasks/day.

3) SLOs and downtime budget

Compose-friendly: 99.9% availability (≈43 minutes/month). Blue/green or staggered restarts keep within budget.
Scheduler territory: 99.99% (≈4.3 minutes/month) or higher, multi-AZ HA, automated rescheduling across nodes.

4) Release cadence and rollback

Compose-friendly: Daily releases with blue/green or canary-by-header in the edge proxy, easy rollback by swapping a compose project name.
Scheduler territory: Multiple deploys/hour across dozens of services with auto canarying and traffic shaping.

5) Networking and service discovery

Compose-friendly: A small number of internal networks, Traefik/Caddy in front, DNS-based discovery across 1–3 hosts.
Scheduler territory: Complex mesh, network policies, mTLS between services, per-tenant virtual networks.

6) Stateful workloads

Compose-friendly: One or two stateful services (Postgres, Redis) on dedicated VMs with managed backups and PITR.
Scheduler territory: Dozens of stateful sets, dynamic volumes across nodes, or strict multi-AZ HA for databases.

7) Compliance and isolation

Compose-friendly: SOC2 with single-tenant infra per environment, straightforward secrets handling, audit from the host.
Scheduler territory: HIPAA/PCI with network policies, pod security levels, and fine-grained runtime constraints.

8) Cost and focus

Compose-friendly: You want to ship features, not operate a platform. Expect 2–4 engineer-hours/month of infra maintenance.
Scheduler territory: You already pay for a platform team or vendor; infra is a competitive moat.

A production-grade Compose blueprint (2026)

If your answers skew “Compose-friendly,” run it like you mean it. Here’s a blueprint we harden for teams running 10–150 rps average (peaks 5–10x) with 99.9% SLO.

Hosts and OS

Start with two production VMs (compute-optimized, e.g., c7a.large or equivalent) behind a managed load balancer. Put Postgres on its own VM. This yields isolation, simple recovery, and avoids noisy neighbors.
Enable unattended security updates and schedule a weekly patch window. With kernel vulnerabilities shipping without advance notice, assume you’ll need to reboot at least monthly. Design rolling restarts (see below).
Harden Docker: user namespaces, seccomp default, no-new-privileges, drop NET_RAW, set memory/cpu limits for every container. Consider rootless Docker or Podman if your team is comfortable with the trade-offs.

Networking and ingress

Run Traefik (or Caddy) as the edge proxy. Auto TLS via Let’s Encrypt, sticky sessions if you need them, and per-service canarying by header or cookie.
Use compose networks to isolate groups of services. Only the proxy is published to the internet. Everything else is internal-only.

Deploys and zero-downtime

Blue/green by project name: run two stacks on the same host network (e.g., project “blue” and “green”). On deploy, start the new stack, pass health checks, then switch Traefik to the new labels and stop the old one. Rollback is a label flip.
Health checks matter: use start_period and interval/timeout/retries to encode readiness. Don’t pretend liveness is readiness.
Automate deploys with a CI job that does: build → push → ssh → docker compose pull → bring up the new color → run smoke tests → flip labels.

Secrets and config

Prefer a managed secret store (AWS SSM, GCP Secret Manager, 1Password Connect). At deploy time, render env files and mount them as read-only into containers.
If you must keep .env files on the host, store them encrypted-at-rest with sops and decrypt only in memory during deploy.

Observability

Metrics: node_exporter + cAdvisor → Prometheus → Grafana dashboards. Alert on saturation (CPU steal, memory pressure), error rates, and 95th latency.
Logs: Docker → Loki via Promtail or to your cloud provider’s log service. Keep 14–30 days hot; archive older to object storage.
Tracing: OpenTelemetry to a managed backend (Tempo, Honeycomb, or Datadog). Sample at 5–10% to keep costs sane.

Backups and disaster recovery

Postgres: pgBackRest or a managed Postgres with PITR. Aim for RPO ≤ 5 minutes (WAL shipping) and RTO ≤ 60 minutes.
Assets/state: Prefer object storage (S3/compatible). For small volumes, back up with restic nightly. Test restores monthly.
Snapshot the VMs nightly. Keep 14 days of snapshots; script a rebuild to new hosts.

Security posture

Firewall: default deny; only the load balancer and SSH open. Consider Cloudflare or Fastly in front for DDoS absorption. The Ubuntu DDoS this month is yet another reminder that you don’t want your origin exposed.
SBOM and image scanning: build minimal images (distroless or slim), scan in CI, pin digests in compose, and rotate base images biweekly.
Runtime: enable read-only root filesystems where possible; mount volumes with the minimum needed permissions; drop Linux capabilities.

Where Compose breaks down (and what to do instead)

Be explicit about the edge of the envelope. If any of these show up on your roadmap, start planning a move:

Multi-host scheduling and auto-recovery: If you want containers to reschedule automatically on node failure without orchestrating blue/green yourself, look at ECS on Fargate (least ops) or Kubernetes (most control).
Horizontal autoscaling: You can script scale-up/down in Compose, but if traffic is highly spiky or cost-sensitive, native HPA/ECS Service auto-scaling wins.
Network policies and mTLS-by-default: You can fake it with iptables and sidecars, but it’s brittle. If you need fine-grained network isolation, move.
Dozens of teams and services: Self-service infra needs guardrails, quotas, and templated resources. That’s a platform problem, not a compose.yaml problem.
Heavy GPU/ML scheduling or batch job orchestration: Use a scheduler designed for accelerators and ephemeral jobs.

Costs: the part most teams misprice

At small scale, Kubernetes’ tax is primarily people time. Even with managed control planes, expect:

Managed k8s baseline: control plane fees ($70–$150/mo), NAT gateways ($30–$70/mo each), plus the overhead of node groups. In practice, most teams we see spend $800–$2,000/month before app load.
Engineer time: 8–20 hours/month on cluster hygiene, upgrades, and the “it’s always DNS” class of issues. If your fully loaded engineer rate is $150/hour, that’s $1,200–$3,000/month.
Compose baseline: two mid-size VMs at $80–$120 each, a managed database $150–$400, a load balancer $20–$30. Call it $350–$700/month infra and 2–4 hours/month of care if you keep the stack small and disciplined.

These aren’t theoretical. We run production Compose for startups processing 10–30M requests/month at 99.9% SLO for 20–40% lower infra cost and 50–70% lower ops time than their previous “starter k8s.” The trade-off is capacity planning and simpler failure modes you must encode yourself.

Concrete deployment pattern (that actually works)

Repo layout: app code + infra/compose with per-env overrides (compose.prod.yaml, compose.staging.yaml). Keep images versioned and pinned by digest.
Build: multi-stage Dockerfiles, distroless/slim bases, SBOM generation. Push to a private registry.
Release candidate: CI job tags image with git SHA and writes a release manifest with image digests.
Deploy: CI connects via SSH to the target host(s), pulls the release manifest, starts the new color with docker compose up -d, waits for health, flips Traefik labels.
Smoke test: run a short e2e suite against the canary route (header-based); if clean, promote to 100%.
Post-deploy: verify dashboards, error budgets, and roll back by flipping labels if needed.

Skip “watchtower auto-updates.” That’s trading repeatability for surprise. You want intentional deploys, even on Compose.

Operating practice: patching and incident response

The last few weeks were a reminder: attackers love the long tail. The cPanel auth bypass and the Ubuntu DDoS did not care whether teams ran k8s or Compose; they punished slow patching and exposed origins. Treat patching as a first-class workflow:

Weekly patch window: schedule and announce it. In blue/green mode, patch the idle color, flip, then patch the other.
Kernel realities: with “no heads-up” kernel CVEs, assume reboots. Design for them. Your blue/green process should make a reboot a non-event.
Front a CDN: absorb volumetric DDoS and rate-limit bad actors. Terminate TLS at the edge and keep origins private.

When to plan the migration (before it hurts)

Don’t wait for pain to force you into a 90-day platform rewrite. Start a migration radar when any of these become true:

Two or more teams ask for self-serve environments.
You need autoscaling faster than your CI/CD can roll a new color.
Security asks for network policies you can’t comfortably enforce at the host.
Your services count crosses ~30 and your compose files feel like a Rube Goldberg machine.

At that point, pick a scheduler based on your cloud and talent pool: ECS on Fargate for the least ops (especially on AWS-centric stacks), Kubernetes if you need portability or deep ecosystem features, or Nomad if you value simplicity and can live without k8s ubiquity. Budget the migration like a feature: 6–10 weeks for an experienced team, longer if you’re also redesigning networking and CI.

A realistic example

A Series A fintech-ish SaaS: 6 microservices (Go + Node), a worker, a scheduled job, Postgres, Redis, 10M requests/month (avg 4 rps, peak 50–100 rps), 2–3 engineers on-call. Targets: 99.9% SLO, SOC2, US/EU customers.

Infra: two c7a.large app VMs + managed Postgres + Redis on a small VM. Traefik on each app VM; cloud load balancer distributes to both.
Deploys: blue/green per host, label flip via Traefik, 1–2 deploys/day.
Security: SBOM in CI, image digest pinning, secrets from SSM, read-only roots, no-new-privileges, NET_RAW dropped. CDN in front, origins private.
Ops: 2 hours/week: patching, dashboard checks, monthly restore test. Observability to Grafana + Loki + Tempo.
Results: 99.93% availability over a quarter; one incident (15 minutes) traced to an unbounded worker queue, fixed by setting Compose memory limits and backpressure.

Where nearshore fits

If you choose Compose, you’re choosing a smaller ops surface—good. But patch windows and midnight incidents still happen. A Brazil-based SRE crew with 6–8 hours of US overlap can run the weekly patch cycle, manage blue/green flips, and handle after-hours incidents while your core team sleeps. In our experience, a 0.5–1.0 FTE nearshore SRE engagement is enough to keep a production Compose stack compliant, current, and boring—exactly what you want.

Bottom line

Docker Compose in production is not a guilty secret; it’s a deliberate trade. If you’re under 30 services, at 99.9% SLO, and don’t need per-tenant isolation or multi-AZ rescheduling, you can move faster and spend less with Compose—provided you harden the host, automate blue/green, front a CDN, and treat patching as product work. When your roadmap demands a scheduler, you’ll know. Until then, keep your stack simple and your error budgets green.

Key Takeaways

Compose is production-viable in 2026 for small teams, predictable workloads, and 99.9% SLOs—if you run it like production.
Use a hardened Linux host, Traefik/Caddy, blue/green by project name, digest-pinned images, and real health checks.
Expect 20–40% lower infra cost and 50–70% less ops time than “starter k8s” at small scale; spend regained time on features.
Security is table stakes: default-deny firewall, CDN in front, read-only roots, dropped capabilities, weekly patch windows, and tested backups.
Plan a migration when you need autoscaling, fine-grained network policies, multi-tenant isolation, or you cross ~30 services.
Nearshore SRE (Brazil) covers patching and incidents with 6–8 hours overlap, keeping your Compose stack boring and compliant.