2026-07-04 · 11 min read

Stop Betting the Company on CUDA: A 2026 CTO Playbook for AI Portability

By Diogo Hudson Dias

Engineers in a São Paulo data center reviewing AI inference benchmarks in front of mixed GPU servers

If your AI roadmap assumes NVIDIA forever, you’re one supplier meeting away from a board-level incident. Performance per dollar is improving fast, availability is spiky, and even model vendors are exploring their own chips. Anthropic is reportedly discussing a custom accelerator with Samsung; meanwhile, a fresh wave of posts shows performance-per-dollar getting better almost monthly. The point isn’t who “wins.” The point is that you can’t afford to care. Your stack must run wherever the cheap, available silicon is next quarter.

The hardware market just de-commoditized — again

Two years ago, “GPU” meant CUDA, and CUDA meant NVIDIA. In 2026, you have choices — and that’s exactly why you need a portability plan.

NVIDIA H100: Gold standard for throughput and ecosystem maturity. 80–94 GB HBM, mature kernels (FlashAttention, Triton), wide cloud availability. On-demand 8x boxes typically price in the high two digits to low three digits USD/hour depending on region and commitment.
AMD MI300X: 192 GB HBM per GPU is the headline. That memory headroom means bigger batch sizes, longer contexts, and fewer tensor sharding games. ROCm has matured enough that mainstream inference stacks are viable, not science projects.
Intel Gaudi (Gaudi2/3): Surprisingly competitive price/perf in some clouds, strong BF16 throughput, an alternative interconnect story, and a vendor motivated to discount.
Google TPU (v5e/v5p): Still a good story for model training at scale and for inference if you can live inside GCP and target XLA well.
Edge and on-device: Apple Silicon (Metal), Vulkan-class GPUs, and CPUs with AVX-512 or AMX. Not your training workhorse — but critical for private, low-latency inference where data can’t leave the device.

Memory, not flops, is usually the gating factor in inference. MI300X’s 192 GB can keep multi-tenant serving honest by avoiding Frankenstein sharding. H100’s kernel maturity wins on raw tokens/sec in many cases. TPUs can look great until you hit an unsupported op. Your costs swing with batch size, context length, and the presence (or absence) of a good fused attention kernel for your exact model. This is why portability beats vendor partisanship.

What “portable” actually means for a CTO

Most teams think portability is “we can export ONNX.” That’s step one, not the destination. Real portability spans six layers. Audit each.

Weights format: safetensors for safety and speed; ONNX or OpenXLA compilation paths for broad targets; GGUF for CPU/edge via llama.cpp.
Graph IR: ONNX or StableHLO/XLA let you retarget without living in PyTorch eager forever. TVM/IREE can compile to Vulkan/Metal/ROCm/CUDA for non-NVIDIA paths.
Runtime: vLLM, Text Generation Inference (TGI), llama.cpp, or MLC LLM. Choose at least two with different backend dependencies.
Kernels: FlashAttention, Triton, fused MLPs. These are often CUDA-first. You need ROCm or HIP equivalents, or an IR compiler that fuses ops well enough without bespoke CUDA.
Quantization and memory layout: FP16/BF16 as the common denominator; INT8/INT4 (AWQ, GPTQ, SmoothQuant) boost throughput but narrow portability unless you pick methods supported across backends.
Serving layer and scheduler: Continuous batching and paged attention (e.g., in vLLM) drive cost down but can be backend-sensitive. Your serving layer must expose the same API across hardware.

A decision framework for 2026: pick your portability lane

Lane 1: Dual-source inference (CUDA + ROCm) for 80% of use cases

If you run 7B–70B class models for chat, retrieval-augmented generation, and code, this is the pragmatic default. Keep NVIDIA first for kernel maturity, and keep AMD as a fully validated, production-capable second source.

Runtime: vLLM as the primary serving stack. It supports CUDA and increasingly ROCm, delivers strong continuous batching, and integrates cleanly with tokenizers and OpenAI-style APIs.
Fallback: TGI or llama.cpp for specific models or CPU/Metal fallbacks.
Quantization: Maintain two blessed variants per model: BF16/FP16 and INT8. Treat INT4 as opportunistic until your non-CUDA backend proves stable.
Infra: Keep two golden container images: one with CUDA and vendor libs, one with ROCm. Lock driver-toolkit versions and publish SBOMs.

When to choose: You want to buy what’s on the spot market next quarter, not what your roadmap assumed last year. You need to shave 20–40% inference cost without retraining models or rewriting business logic.

Lane 2: IR-first compilation (ONNX/OpenXLA + IREE/TVM) for edge and exotic accelerators

If you ship mobile or embedded inference, or you want optionality across TPU and future bespoke silicon, invest in IR-first. You’ll trade some absolute throughput for the freedom to retarget quickly.

Compiler stack: Export to ONNX or StableHLO, then compile with IREE or TVM to Vulkan, Metal, CUDA, or ROCm.
Runtime: MLC LLM for a batteries-included path to phones and desktops; it rides TVM to multiple backends.
APIs: Freeze your public API at the serving layer. Keep the compiler choices invisible to product teams.

When to choose: You must run on-device for privacy or latency, or you foresee non-GPU accelerators in your procurement funnel. You want a single code path for Vulkan/Metal/ROCm/CUDA.

Lane 3: CPU-first safety net for SLOs and DR

There will be days you cannot get a GPU. A robust CPU path keeps your P99s and SLAs honest during GPU supply shocks, upgrades, or security incidents.

Runtime: llama.cpp or MLC LLM targeting CPU, with GGUF weights. Target AVX-512 or AMX where available.
Scope: Smaller models (3B–7B) with task-specific fine-tunes for “good enough” quality when GPUs are unavailable.
Use: Canary and emergency capacity, plus private on-device inference for regulated segments.

The only two numbers that matter: cost per 1K tokens and first-token latency

Ignore generic “TFLOPs” comparisons. Your buyers will feel two things: how long until the first token appears, and what each thousand tokens costs you to deliver at your SLOs.

Cost per 1K tokens: Approximate as (instance $/hour) divided by (tokens/second × 3600) for a fixed context/batch depth. Continuous batching skews this in your favor; tiny, interactive prompts do not.
First-token latency (FTL): Driven by scheduler, model size, fused kernels, and memory movement. A CUDA stack with mature FlashAttention may beat a larger-memory ROCm stack at FTL, even if total throughput is comparable.

Here’s the punchline: with the right batching and kernels, internal cost per 1K tokens for a 7B model can drop under a cent on either H100 or MI300X. Cloud APIs still charge a premium per 1K tokens for simplicity and availability. Your portability plan is the lever that lets you capture the delta — when another vendor is cheaper this quarter.

Audit your stack for CUDA assumptions in one afternoon

Before you talk about multi-vendor, prove you’re not CUDA-hardcoded. Assign one senior to run this checklist and report back by EOD.

Code grep: Search for explicit CUDA-only imports (torch.cuda, cupy, triton kernels), hardcoded device strings, and nvidia/cuda base images in Dockerfiles.
Wheel lock: List pinned wheels. If you’ve pinned torch and xformers to CUDA builds, note the ROCm equivalents and whether your Python resolver can swap them cleanly.
Custom ops: Inventory Triton or CUDA extensions. For each, identify ROCm/HIP parity or an IR-compiler alternative.
Kernels: Confirm your attention kernel story on both sides. FlashAttention-like parity exists on ROCm now, but not every model variant is covered.
Serving: Can your serving binary run on ROCm today? If not, what’s missing — kernel, driver, or packaging?
CI: Do you have even one ROCm job? If the answer is “no,” you’ve answered your portability question.

Build a portability harness — once

Portability is not a slide; it’s a test suite. Treat it like release engineering.

Standard model bundle: Zip weights (safetensors), tokenizer, prompt templates, and a manifest recording quantization and expected outputs for golden prompts.
Golden prompts: 100–200 prompts that exercise your real workloads: short chats, 8–32K contexts, RAG with long contexts, code completion, streaming.
Metrics: For each backend, record tokens/sec steady-state, first-token latency, VRAM, and vCPU utilization. Derive $/1K tokens using your actual instance pricing.
Containers: Publish artifacted images per backend: CUDA, ROCm, CPU/Metal. Pin driver-toolkit combos. Include SBOMs and known-CVE scans.
Gate on numbers: A PR fails if any backend regresses >10% on cost or FTL without a waiver. Bake these gates into your CD.

Serving stacks that survive across hardware

Pick two, not one.

vLLM: High-throughput batching, multi-backend trajectory (CUDA first, ROCm increasingly viable), and an OpenAI-style API. Excellent default for server inference.
TGI: Solid CUDA path, reasonable ROCm story for many models, strong Hugging Face integration. Good second source.
llama.cpp: The Swiss Army knife. CPU, Metal, Vulkan backends with GGUF. Lower throughput, but invaluable as a fallback and for edge.
MLC LLM: Best way to ship to mobile and diverse desktop GPUs via TVM. Useful when your product roadmap requires on-device.

Whatever you pick, stabilize your public API and auth in front of them. Your app shouldn’t know or care what backend is live this week.

Quantization without surprises

Quantization gives you perf, and it’s also a portability minefield.

Standardize on BF16/FP16 for baseline and INT8 for throughput. Treat INT4 as an optimization tier with explicit acceptance tests for accuracy and hallucination rates.
Pick methods that travel: SmoothQuant and AWQ are broadly supported; GPTQ variants can be backend-fragile.
Test long contexts: Quantization interacts badly with long-context attention in some kernels. Include 32K+ context tests in your harness if your product supports it.

Training and fine-tuning: be honest about scope

Full pretraining portability is a research budget. For startups and scale-ups, scope to inference and parameter-efficient tuning.

LoRA/QLoRA fine-tunes on both CUDA and ROCm are viable in 2026. Plan to run them where capacity is available, but keep your serving portable first.
Gaudi and TPU can be excellent training plays. Just don’t let their training win force you to serve exclusively on them if your unit economics want H100 or MI300X later.

Security and operability don’t get a pass

Multi-backend means multi-driver and multi-runtime exposure. Tighten hygiene.

Drivers: Use vendor-maintained driver containers or kernel modules with known-CVE tracking. Don’t install drivers ad hoc on hosts.
Repro: Hermetic builds with pinned compilers (Triton, TVM) and verified wheels. Generate SBOMs and attestations.
Isolation: For multi-tenant nodes, use MIG on NVIDIA and cgroup isolation everywhere. Don’t let kernel panics in one runtime take down another.
Observability: Export hardware counters (HBM bandwidth, SM occupancy), not just tokens/sec. Alert on first-token latency regressions, not just 95th throughput.

Who runs this? Build a two-pizza “compiler team”

If you centralize all AI work on your feature teams, portability dies from a thousand TODOs. Staff 2–3 engineers whose job is to make models run anywhere.

Skills: PyTorch 2.x graph capture, Triton, FlashAttention, ROCm/HIP basics, ONNX/XLA, and one IR compiler (TVM or IREE).
Mandate: Own the portability harness, golden model bundles, container images, and perf gates. Own vendor benchmarks and procurement inputs.
Cadence: Quarterly “futures day” to test new SKUs and clouds, with a written buy/no-buy recommendation based on $/1K tokens and FTL.

Nearshore angle: use geography to your advantage

If you’re US-based, a Brazilian nearshore pod gives you 6–8 hours of overlap and access to alternative capacity. In 2026, MI300X and Gaudi often show up first in non-US regions and regional clouds. Use that arbitrage.

Topology: Keep interactive traffic in-region (US-East/West) to protect P95 latency. Route batch inference and fine-tunes to the cheapest region — South America, Europe, wherever the spot market smiles.
Ops: Nearshore team runs the portability harness and certifies new regions/hardware while your core team ships features.

The provocation

Portability is not an academic ideal. It’s the mechanism that lets you buy whatever is cheap and available next quarter without rewriting your product or renegotiating SLAs. NVIDIA might still be your best option most of the time. Good — a portability plan makes it easier to prove that to procurement with numbers. And when it isn’t, you won’t be the company paying 30% more because your codebase can’t spell ROCm.

A 30-day action plan

Week 1: Run the CUDA audit. Produce a one-page gap list. Stand up a second backend (ROCm or CPU) in CI.
Week 2: Build the portability harness: golden model bundles, prompts, and perf gates. Containerize CUDA and ROCm images with SBOMs.
Week 3: Certify two runtimes (e.g., vLLM on CUDA and ROCm, llama.cpp on CPU/Metal). Generate the first $/1K tokens and FTL report across hardware you can rent this week.
Week 4: Present a procurement brief with three options: “Stay,” “Shift 25% to AMD,” and “Add CPU fallback.” Include cost, risk, and timeline. Make a decision and execute.

Key Takeaways

Performance-per-dollar is shifting quarterly; don’t tie your cost curve to one vendor’s roadmap.
Real portability spans weights, IR, runtime, kernels, quantization, and serving — not just “we can export ONNX.”
Pick two serving stacks and two backends. vLLM + CUDA/ROCm, with llama.cpp as a CPU/Metal safety net, covers most needs.
Measure what matters: cost per 1K tokens and first-token latency under your real workloads.
Quantize deliberately: keep FP16/BF16 and INT8 as blessed tiers; treat INT4 as optional until non-CUDA parity is proven.
Staff a small “compiler team” to own the portability harness, CI gates, and vendor benchmarks.
Use nearshore capacity to qualify new hardware and exploit regional price arbitrage without sacrificing US user latency.