2026-05-29 · 11 min read

Stop Letting the Scheduler Tax Your Database: Cache‑Aware CPU Pinning for Postgres and Valkey

By Diogo Hudson Dias

Site reliability engineer reviewing CPU topology and cache performance graphs on dual monitors in a São Paulo office.

Your database isn’t slow. Your CPU topology is billing you a tax you never approved. On modern servers—especially AMD EPYC with multiple core complexes—letting Linux schedule Postgres or Valkey across the whole machine often kills your L3 cache locality and p95 latency. The fix is not a bigger instance. It’s aligning your processes with the silicon you already pay for.

Recent public benchmarks and community write-ups have shown double-digit gains by keeping database workers inside a single cache domain (CCD/CCX on AMD, NUMA node boundaries in general). We’ve seen similar wins in the field: 15–35% more throughput and tighter p95 just by getting deliberate about CPU and memory placement. No code changes. No new hardware. Just treating cache as a first-class resource.

Why this matters now

Two 2026 realities collide here:

Core counts outran application awareness. A 64–128 vCPU box looks efficient on a cloud invoice but hides multiple last-level caches and memory controllers. Blindly spreading hot threads across them is self-sabotage.
Databases are latency-sensitive and cache-loving. Postgres and Valkey reuse hot pages and metadata like crazy. Starve their locality and you multiply misses across every call path.

If you operate in pricier regions (for example, AWS sa-east-1 in Brazil often carries a double-digit premium over us-east-1), squeezing 20% more per node is not a micro-optimization—it’s budget control.

Modern CPU topology in 90 seconds

AMD EPYC (Zen 3/4/5): Cores are grouped into Core Complex Dies (CCDs). Each CCD has its own L3 cache. Crossing CCD boundaries means higher latency and more L3 misses. Many high-vCPU instances are multiple CCDs stitched together.
Intel Xeon (Skylake and later): A mesh interconnect with a shared but partitioned LLC. Locality still matters; data movement across the mesh costs cycles.
Arm (AWS Graviton): Typically a single NUMA domain with generous L2/L3, which reduces but doesn’t eliminate locality concerns, especially under high thread counts.

Key idea: keep cooperating threads (the ones that touch the same data) near the same last-level cache and the memory controller that backs it. That means fewer cache misses, fewer cross-die trips, better Instructions Per Cycle (IPC), and happier tail latency.

Why Postgres and Valkey are especially sensitive

Postgres

Process-per-connection: Each backend thrashes its own stacks plus the shared buffer mapping the same tables and indexes.
Hot paths: btree lookups, visibility map checks, catalog access—these love cache. Spread across CCDs and those hot pages keep getting evicted.
Auxiliary processes: walwriter, checkpointer, background writer, and autovacuum contend for CPU and memory bandwidth. Collocating backends plus shared memory on the same NUMA node improves coherence.

Valkey (Redis)

High QPS, small objects: The workload is “tiny-hot,” exactly what L3 exists to accelerate.
Threaded I/O: Modern builds use I/O threads; misplacement causes cross-core bouncing and socket thrash.
Shards/instances: Running multiple shards on one big host is common. If those shards don’t respect the cache domains, you eliminate the benefit of sharding.

Should you do this? A quick decision framework

Say yes to a cache-aware project if at least two are true:

Your db CPU is frequently >50% while p95 spikes under load.
perf stat shows cache-miss rates stay high (>10–15%) and IPC sagging (<1) during peak.
The working set fits in RAM and storage isn’t your bottleneck (disk latency p95 is fine).
You’re on instances with 32+ vCPUs or known multi-CCD/NUMA layouts (common in c6a/c7a, c3d, Dav5/Dv5, etc.).

If storage is your bottleneck, go fix that first. If you’re mostly idle, you won’t notice the gains. This pays off when you actually push the box.

Fast path: a 10-day rollout plan

Day 1–2: Topology audit

Map the CPU and cache: run lscpu -e and numactl -H. On AMD, look for CCD/CCX hints; tools like hwloc draw this nicely.
Capture baseline perf: perf stat -e cycles,instructions,cache-misses during peak; log Postgres p95/99 and Valkey latency/QPS.
Confirm storage isn’t the villain: use iostat and ebpf tools (e.g., perf-tools cachestat, biolatency) to check.

Day 3–4: Baseline benchmarks

Postgres: pgbench with realistic scaling (think scale factor equal to active connections, not toy defaults). Measure TPS and p95.
Valkey: redis-benchmark or valkey-benchmark with your key sizes and pipeline depth.

Day 5–6: Prototype pinning on staging

Create a cpuset for the database that maps to one cache domain (for example, cores 0–15 that share L3). Keep a separate “housekeeping” set for the kernel, NIC interrupts, and background tasks.
Bind memory local: launch the service with numactl --cpunodebind=X --membind=X so shared memory favors the same node.
Disable irqbalance for the test and manually set IRQ affinities for the NIC to housekeeping cores; set rps_cpus similarly.

Day 7: Canary in production

Move a subset of traffic to the pinned instance(s). Compare throughput and p95/p99. Aim for 15–35% TPS gain or meaningfully tighter p95.
Watch for regressions: autovacuum lag, checkpoint spikes, or packet drops from mispinned interrupts.

Day 8–9: Operationalize

Systemd: add a drop-in for Postgres with AllowedCPUs= and MemoryNUMAPolicy=local. Persist IRQ masks in boot scripts.
Kubernetes: enable CPU Manager static policy, run the DB pod as Guaranteed (requests=limits), and request hugepages if applicable. Use Topology Manager to align CPU and memory. Keep OS “reserved CPUs” separate.

Day 10: Document and dashboard

Pin the topology map, cpuset masks, and IRQ assignments in your runbooks.
Add Grafana panels for cache misses, IPC, run queue latency, and DB p95. If it isn’t on a graph, it didn’t happen.

Concrete how-to: bare metal or VM

1) Identify the cache domain you’ll use

Use lscpu -e to list CPU, socket, node, and L3 ID columns. Pick a contiguous set that shares the same L3 ID.
Confirm with hwloc or by inspecting /sys/devices/system/cpu/cpuN/cache/index3/id.

2) Build cpusets

Create two cgroups: db and housekeeping. Assign the db service to the L3-local CPUs and everything else (irq, systemd-oomd, journald) to the housekeeping set.
In systemd, use AllowedCPUs= for the service and CPUAffinity= for OS services you want off the hot cores.

3) Bind memory and enable huge pages

Postgres: set huge_pages = on (ensure vm.nr_hugepages is provisioned). Launch under numactl --membind to keep the buffer pool local.
Valkey: keep shard processes aligned to their cpuset and use SO_REUSEPORT with one listener per shard if you front with a single VIP.

4) IRQ and network

Disable or restrict irqbalance. Manually set NIC IRQs to housekeeping cores (check /proc/interrupts, set /proc/irq/*/smp_affinity_list).
For high packet rates, set rps_cpus and xps_cpus for the NIC queues to match the housekeeping set to avoid polluters on DB cores.

5) Postgres knobs that benefit

shared_buffers: 25–40% of RAM is a sane range for many OLTP workloads. Bigger isn’t always better; locality matters more than raw size.
max_parallel_workers and max_worker_processes: keep total active workers within your pinned core count. Blowing past the L3-local core budget destroys the win.
checkpoint_timeout, max_wal_size, autovacuum_work_mem: spread heavy work to avoid synchronized stalls that punish cache.

Kubernetes variant (what most of you actually run)

Use Guaranteed QoS: requests equal limits for CPU and memory. Otherwise, the kubelet can time-slice you across CPUs and defeat locality.
Enable CPU Manager static policy: this gives exclusive CPUs to Guaranteed pods.
Topology Manager: set to restricted or best-effort so CPU and memory allocations land on the same NUMA node.
Reserve CPUs for the OS and system daemons: set --reserved-cpus on the kubelet so node plumbing doesn’t camp on your DB’s cache domain.
Huge pages: request hugepages-2Mi in the Pod spec; configure at the node level first.
Anti-affinity and node selectors: keep DB pods away from noisy neighbors and ensure they schedule on the intended instance type/topology.

If you run a Postgres operator (Crunchy, Zalando), you can still use these primitives: set resource requests=limits, add node selectors/taints, and ensure the operator doesn’t reschedule you onto mismatched hardware.

Cloud instance notes you should care about

AWS AMD (c6a/c7a/m7a/r7a): Many sizes stitch together multiple CCDs. Favor fewer large boxes only if you commit to pinning; otherwise, mid-size instances can win due to simpler topology.
AWS Graviton (c7g/m7g): Locality benefits are smaller but still real under heavy multithreading. Gains tend to be in p95 stability more than peak TPS.
GCP C3D (AMD) / C3 (Intel): Same guidance: map the cache domains before you scale out a fleet that fights itself.
Azure Dav5/Dv5: Like AWS AMD/Intel families; the NUMA map is your friend. Don’t schedule blind.

What kind of gains are realistic?

Public reports from practitioners show:

Postgres: 15–25% TPS improvement on pgbench-style OLTP by pinning to a single CCD’s worth of cores with local memory, versus letting backends roam across two or more CCDs. p95 often tightens by a similar factor due to fewer cross-die excursions.
Valkey: 20–35% QPS gains when running multiple shards, each pinned to its own cache domain, plus I/O threads aligned. Pipelines with small payloads benefit the most.

Your mileage will vary, but if you’re not seeing double-digit deltas, re-check your CPU mapping and IRQ placement; a single stray interrupt on a hot core can erase the win.

Trade-offs and gotchas (read this twice)

Ceiling vs. efficiency: Confining to a single cache domain may cap absolute max throughput if your workload truly scales linearly with more cores. Many OLTP workloads don’t past a point; they tail out under coherency storms—pinning helps there.
Operational complexity: You’re now in the business of pinning and IRQ management. Template it in Ansible/Terraform or your cluster layer so it survives reboots and node rotations.
Container orchestration fights back: Without CPU Manager static and Guaranteed QoS, K8s will time-slice you into mediocrity. Don’t half-implement.
Hypervisor noise: On shared hosts, you can’t fully control placement. Prefer dedicated host flavors if you chase every microsecond.
Thermals and boost: Pinned hot cores can sustain higher steady-state temps; watch for throttling. Don’t assume boost clocks will bail you out.

Verification: how to know you actually won

Micro-metrics: IPC should go up; cache-miss rate should go down. Compare perf stat at equal offered load.
App SLOs: p95/p99 latencies should tighten and drift less with traffic surges.
System noise: Run queue latency on hot cores should flatten; fewer involuntary context switches; NIC IRQs should no longer appear on DB cores in /proc/interrupts.
Cost per TPS: Track TPS per dollar-hour. If you can downsize one instance class or consolidate nodes post-tuning, you’ve got cash in hand.

Make it repeatable

Topology as code: Store the cpuset masks and node selectors next to your infra code. For K8s, gate deployments on node labels that advertise the right topology (Node Feature Discovery can help).
Golden images: Bake irq configs, hugepage settings, and systemd drop-ins into AMIs/images. Don’t rely on ad-hoc shell scripts.
Observability baked in: Dashboards for cache metrics are part of the feature. Alert on cache-miss explosions and rising runqlat on pinned cores.

Where nearshore fits

This is perfect nearshore work: time-bound, infrastructure-heavy, and measurable. A pod with deep Linux and database chops can ship this in two weeks with 6–8 hours overlap to run canaries and iterate. For teams operating in Brazil’s AWS sa-east-1, a 20% throughput lift can offset regional cost premiums immediately; for US teams, it’s a hedge against scaling pressure while you defer a costly sharding or rearchitecture project.

When not to bother

Your workload is clearly I/O-bound (storage p95 is ugly) or network-bound (tiny packets saturate the NIC before CPU).
You already run on small, single-CCD/NUMA VMs—there’s no topology to fight.
You’re scheduled on a noisy, shared hypervisor and can’t control IRQs or CPU exclusivity—fix tenancy first.

Bottom line

Linux’s default scheduler is a miracle of general-purpose engineering. Your database is not general purpose. Treat cache and NUMA locality as a budget line item and you’ll stop paying a hidden tax every time your traffic spikes. This isn’t wizardry; it’s discipline: measure, pin, bind, verify. Do that, and you’ll buy 15–35% headroom for pennies.

Key Takeaways

Your database may be paying a hidden cache/NUMA tax—expect 15–35% wins from locality.
Target one cache domain (CCD/NUMA node) for Postgres/Valkey workers; bind CPU and memory together.
Use cpusets/systemd on VMs; CPU Manager static and Guaranteed QoS on Kubernetes.
Fix IRQ and housekeeping placement; a stray interrupt on a hot core can erase your gains.
Verify with IPC and cache-miss metrics plus p95/p99 latency; bake settings into infra code.