You’re paying to store air. A 1,536‑dimensional float32 embedding is ~6 KB. At 50 million items, that’s 300 GB of raw vectors—before index overhead and replicas. Most teams then add an in‑RAM HNSW graph for speed, ballooning memory into the high hundreds of gigabytes. Monthly cloud bills follow. Meanwhile, recent results on asymmetric quantization show you can shrink that footprint by 90–97% with near‑lossless retrieval quality. Translation: fewer, cheaper boxes; faster cold starts; and the option to run on commodity CPUs or even at the edge.
This is not a research curiosity. It’s production‑ready in 2026 with mature libraries like FAISS and ScaNN, and supported in popular vector systems. If you own a RAG system, similarity search, recommendations, or dedup at scale, asymmetric quantization (AQ) should be on your immediate roadmap.
What asymmetric quantization actually is (and why it works)
Quantization compresses high‑dimensional float vectors into compact codes. With asymmetric quantization, you keep the query in float but store the database vectors as compressed codes, then compute distances using asymmetric distance computation (ADC). Because you don’t quantize the query, you avoid the largest accuracy cliff while still reaping almost all of the storage and bandwidth savings.
The workhorses are:
- Product Quantization (PQ): Split a D‑dim vector into m sub‑vectors; learn a small codebook per subspace (e.g., 256 centroids = 8 bits). A vector becomes m bytes of codes. Typical configs: D=1,536, m=96, 8 bits → 96 bytes per vector.
- Optimized PQ (OPQ): Learn a rotation of the space to spread signal more evenly across subspaces, improving recall for the same code size.
- IVF‑PQ: A two‑stage index: coarse quantizer (IVF) narrows search to a few clusters; within each, PQ compresses the residuals. This yields CPU‑friendly latency at massive scale.
Recent public results show up to ~97% storage reduction at minimal recall loss on standard retrieval tasks. In practice, we regularly see 90–95% reduction at 0.95–0.99 recall@10 relative to the float baseline, if you:
- Train OPQ with enough data (≥1–5M representative samples).
- Tune IVF list count and probes to your latency budget.
- Use reranking (e.g., dot‑product or a cross‑encoder) on the top‑K candidates.
Why this matters for your TCO
Use a boring, common setup to illustrate the math:
- Embedding: 1,536‑d float32 (≈6 KB/vector).
- Corpus: 50M items → 300 GB raw vectors.
- HNSW index overhead: frequently 1–2× vector size, depending on M/ef settings. Assume 1.5× → 450 GB in RAM.
- HA/replication factor 2–3× across nodes → 900–1,350 GB effective footprint.
The result: you’re renting multiple 256–512 GB RAM instances (CPU or GPU) year‑round to keep latency sub‑100 ms under load. Depending on region and generation, that is easily $6k–$12k/month in compute alone, often more than the vector DB storage bill.
Now flip the design:
- PQ with m=96, 8 bits → ~96 bytes/vector (plus small IVF metadata). That’s ~4.8 GB for 50M vectors (raw codes), typically 10–20× smaller than the float baseline once you include index overheads.
- Use IVF‑PQ on CPU. Keep codes on NVMe; cache hot lists and LUTs in memory. With sensible tuning, expect p95 latency under 20–40 ms on modern CPUs at 95% recall@10 for 50M–200M scale.
- Compute footprint drops to one or two 64–128 GB instances per replica, often $1.5k–$3k/month per node. Cold starts are faster; rebuilds don’t require petabyte‑class ephemeral volumes.
Not every workload will see this exact curve, but the direction is consistent: PQ moves you from memory‑bound to cache‑and‑storage‑balanced designs, which are cheaper, more elastic, and easier to run across regions (or on‑prem) without GPU scarcity.
When you should (and shouldn’t) use asymmetric quantization
Adopt AQ if you check most of these boxes:
- Corpus ≥ 10M items, or multi‑tenant search with many active tenants.
- Latency SLOs ≥ 20 ms p95 for the vector stage (you can shave more with caching and narrower probes).
- Stable embedding model for at least 3–6 months. Frequent model churn increases codebook maintenance cost.
- Dot‑product or L2 distance. If you need a specialty metric, check library support.
- Reranking step available to recover long‑tail quality (e.g., top‑200 candidates → cross‑encoder@K=50).
Think twice if:
- You run small (≤1M) corpora that already fit in RAM with HNSW at trivial cost.
- Your SLO is ultra‑low latency (e.g., 5 ms p95) and you can afford GPU/hand‑tuned RAM indexes.
- Your embeddings change weekly and you can’t afford dual‑index rebuilds.
The architecture: IVF‑PQ with ADC, plus reranking
For most CTOs, a safe default in 2026 looks like this:
- Coarse partition (IVF): Train k‑means with k in the 16k–262k range depending on corpus size. Rule of thumb: start around k ≈ sqrt(N) and refine based on probes/latency.
- Residual coding with OPQ+PQ: Learn an OPQ rotation; train PQ on residuals with m chosen so that m bytes/vector hits your storage target. Commonly m=64–128 with 8‑bit codes.
- Asymmetric distance computation: Keep the query in float; for each probed list, build a lookup table (LUT) of distances between the query and each sub‑quantizer codebook. ADC distances reduce to LUT additions, which are cache‑friendly on CPUs.
- Candidate set + rerank: Return top 500–2,000 candidates from ADC; rerank with the original floats if you store a small float cache for hot items, or use a cross‑encoder for precision@K.
This pattern is implemented in FAISS (IndexIVFPQ, IndexIVF{HNSW,PQ}) and ScaNN’s partitioning + asymmetric hashing flow. Many vector DBs expose it under the hood. You don’t have to rewrite your stack.
A concrete 60–90 day rollout plan
Days 0–15: Baseline and target
- Fix your yardstick: Lock a representative benchmark: 10–20 real queries per head topic, 10k–100k annotated pairs, evaluation as recall@10 and nDCG@10.
- Record current costs: RAM footprint of HNSW (or current ANN), instance SKUs, QPS, p95/p99 latency, and monthly $/QPS.
- Pick a risk budget: e.g., ≥0.97 recall@10 of baseline and ≤10 ms extra p95 for the vector stage. Put it in writing.
Days 15–45: Train, build, tune
- Sample training data: 1–5M vectors spanning tenants/locales; keep a held‑out set.
- Train OPQ and IVF: Start with OPQ m=96 (8‑bit). For IVF, start at k ≈ sqrt(N); test nprobe 10–200.
- Build IVFPQ: Store codes on NVMe; precompute per‑list statistics; enable mmap where supported.
- Measure trade‑offs: Sweep m ∈ {64, 96, 128}, code bits ∈ {6, 8}, nprobe ∈ {8, 16, 32, 64, 128}. Plot recall vs p95 vs CPU%.
- Rerank strategy: Try a two‑stage: IVFPQ@1k → cross‑encoder@50 (or float dot‑product if you can cache floats for head items). Validate lift on long‑tail queries.
Days 45–75: Shadow and safety
- Dual‑run: Serve users from the baseline path; in parallel, shadow the PQ path for a random 5–10% of traffic. Log interleaved results to estimate online recall and click‑through deltas.
- Drift monitors: Alert on cosine similarity distribution shift between query and top‑K; it’s an early indicator your codebooks are aging.
- Hot rebuild plan: Build codebooks off the side weekly; flip traffic with a feature flag. Keep last two codebooks live for rollback.
Days 75–90: Cutover and cost capture
- Gradual cutover: 10% → 50% → 100% of tenants. Keep baseline ANN as fallback for a defined window.
- Right‑size instances: Move from 512 GB RAM to 64–128 GB; scale horizontally for QPS, not memory pressure.
- Bank the savings: Track $/QPS reduction. In our experience, well‑tuned PQ can cut vector search compute by 2–5× and storage by 10–30× with no measurable user harm.
Common failure modes (and how to dodge them)
- Training on the wrong distribution: Your codebooks learn whatever you feed them. If you exclude certain languages, tenants, or modalities, expect recall cliffs in production. Fix: stratified sampling and periodic retraining.
- Under‑provisioned IVF: Too few lists or too few probes starve recall. Fix: increase k or nprobe; cache LUTs and hot lists; check NUMA pinning.
- Skipping rerank at high compression: Below ~64–96 bytes/vector, expect more approximation error. Fix: always rerank a small candidate set.
- Model churn without dual‑indexing: Changing embedding models invalidates codebooks. Fix: maintain two complete indices; rebuild in the background; cut over with flags.
- MIPS vs L2 mismatch: If you use dot‑product similarity, ensure your index and library optimize for MIPS (maximum inner product search). Some stacks convert MIPS→L2 with tricks (e.g., norm augmentation). Verify.
Tooling choices that won’t age badly
- FAISS: Battle‑tested, GPU and CPU, deep PQ/IVF support, HNSW hybrids. Use for maximum control and portability.
- ScaNN: Strong on CPU performance with partitioning + asymmetric hashing; good defaults.
- Vector DBs: Milvus, Qdrant, Weaviate, and commercial services expose IVF‑PQ/HNSW‑PQ. Beware hidden configuration caps (e.g., max codebook sizes) and opaque recalls.
- Rerankers: For multilingual or domain‑heavy queries, a small cross‑encoder (e.g., MiniLM‑class) at K=50–200 often recovers the last 1–2 points of precision for a few milliseconds.
A simple cost model you can explain to finance
Frame the business case clearly:
- Baseline: 3× r5b.16xlarge (512 GB) for HNSW + 2× m6i.8xlarge for API + storage replication → $8k–$14k/month all‑in, plus managed vector DB fees.
- With PQ: 2× m7i.4xlarge (64 GB) for IVFPQ + 2× m6i.8xlarge for API → $3k–$6k/month, storage 10–30× smaller. Same traffic, same user metrics.
- Payback: 30–60 days including one‑off engineering time, assuming a 2–5× compute reduction and modest infra rework.
Even if your exact SKUs and prices differ, finance will understand a move from memory‑bound replicas to storage‑balanced nodes with the same or better SLOs.
What about GPUs?
GPUs shine when you need ultra‑low latency at very high QPS or you’re already on‑GPU for upstream tasks. But with IVFPQ+ADC, modern CPUs can meet sub‑40 ms p95 for tens to hundreds of millions of vectors. If you do keep GPUs, you can still quantize to fit larger corpora in device memory or to reduce PCIe bandwidth.
Edge and on‑device: the optionality dividend
Compression is an architecture choice, not just a cost hack. With 96 bytes/vector, a 5M‑item index is ≈480 MB of codes. Suddenly:
- Multi‑region replication becomes reasonable without cross‑region data transfer drama.
- Private on‑prem deployments in regulated accounts fit on 1–2 commodity servers.
- On‑device or in‑browser search for small corpora (≤500k) is viable with WASM or mobile NN libraries, opening new privacy‑preserving product surfaces.
Brazilian nearshore angle: who will actually do the boring tuning?
None of this is rocket science, but it is systems work: sampling, training, building indices, wiring observability, and iterating until the recall/latency frontier looks right. It is also the kind of engineering that gets deprioritized until the bill arrives. Nearshore teams with FAISS/ScaNN mileage can hand you a working PQ rollout in 6–10 weeks, then stick around quarterly to retrain codebooks and right‑size clusters as your corpus grows. You get the savings without building a permanent “vector infra” squad.
Compliance and privacy: smaller can be safer
Compressed codes leak less raw signal than floats. That’s not a compliance silver bullet, but it reduces the risk surface for vector exfiltration. You should still:
- Encrypt at rest and in transit; manage keys per tenant.
- Scrub PII before embedding where feasible.
- Build deletion pipelines that propagate to both float caches and PQ stores.
A note on the hype vs. reality
Hacker News is excited about “97% storage reduction with near‑lossless retrieval.” That’s achievable—but only if you treat it like a production feature, not a benchmark party trick. The difference is in your sampling, drift monitors, and a reranker that covers the last mile. Do that, and you’ll wonder why you ever paid to keep 600 GB of floats hot in RAM.
Key Takeaways
- Asymmetric quantization (IVF‑PQ + ADC) can shrink vector storage by 90–97% with 0.95–0.99 recall@10 of float baselines.
- Expect 2–5× compute savings and a move from RAM‑bound clusters to CPU‑friendly, NVMe‑backed nodes.
- Adopt with a 60–90 day plan: baseline, train OPQ/PQ, shadow dual‑run, and cut over with reranking.
- Watch for distribution drift, undersized IVF, and model churn; fix with stratified sampling, probe tuning, and dual‑index rebuilds.
- Compression unlocks edge/on‑prem options and can reduce data leakage risk relative to raw floats.