Expiry by Design: Build Global TTLs for Your SaaS Data Before Regulators Do

By Diogo Hudson Dias
Platform engineer in a São Paulo office reviewing S3 lifecycle settings with a dashboard of data deletion metrics visible on the wall

OneDrive just put an expiry date on files. That’s not a feature—it’s an admission: the data you keep is the data you’ll lose. After 1,000+ breaches and growing disclosure lags, hoarding customer data is no longer a neutral decision; it’s a liability. If your SaaS still relies on “soft delete” and unbounded backups, you are one subpoena, one ransomware crew, or one misconfigured bucket away from an avoidable incident.

This post is a CTO playbook to implement expiry by design: global time-to-live (TTL) policies with deletion SLOs, legal holds, and auditable receipts across every store you touch—OLTP, blob, search, analytics, logs, backups, and vendors. You’ll get a 30-60-90 day plan and concrete implementation guidance for Postgres, S3, Elasticsearch, Kafka, BigQuery/Snowflake, and backup systems—plus the trade-offs you need to confront as an engineering leader.

Why expiry by design now

  • Breaches keep coming, disclosure lags worsen. Public tallies show slower, not faster, disclosures. That means more stale data is exposed for longer. The only way to shrink blast radius is to store less, for less time.
  • Vendors are moving first. Microsoft’s OneDrive adding file expiry is a signal: customers now expect retention controls. If your app can’t match, you’ll lose deals to security-savvy competitors.
  • Regulators already wrote the rulebook. GDPR Art. 5(1)(e) (storage limitation) and Art. 17 (erasure), Brazil’s LGPD Arts. 15–16, and state privacy laws all align: keep only what you need, only as long as needed, and prove it.
  • AI increases the penalty for hoarding. Context windows and vector stores spread sensitive text/images everywhere. Even with “Lockdown Modes,” the surest defense is to not have the data in the first place.

The decision framework: five hard choices

Before you write a line of code, you need crisp answers to these:

1) Object classes and default retention

Every byte maps to an object class with a default TTL. Start with a simple taxonomy and adjust later:

  • Operational events/metrics: 7–30 days
  • Product logs/traces: 14–60 days
  • End-user messages/comments: 90–365 days
  • File attachments/uploads: 90–365 days
  • Transactional records (orders, invoices): per contract/tax law (often 3–7 years)
  • Audit/security logs: 365–730 days (or per SOC2/ISO timeline)

Make these per-tenant, plan-gated defaults. SMBs usually accept 90–180 days; enterprise will ask for 7 years and legal hold. Price accordingly.

2) Deletion SLOs and your “expiry budget”

Commit to an internal SLO for how fast data disappears after it becomes eligible:

  • p95 time-to-hard-delete: under 6 hours
  • p99 time-to-hard-delete: under 24 hours
  • Backups cryptoshredded: under 7 days (or your backup SLA)

Publish these in your DPA. If you can’t meet them yet, set an initial target and instrument the gaps. This becomes your “expiry budget.”

3) Source of truth: the Tombstone Ledger

Policy without a system-of-record becomes drift. Create an append-only Tombstone Ledger that contains object_id, object_type, tenant_id, created_at, expiry_at, legal_hold_flag, reason_code. Deletion workers across all stores subscribe to this ledger and act idempotently. The ledger is your audit trail and your receipt generator.

4) Legal holds and exceptions

Engineers hate exceptions; regulators love them. You need a legal_hold_flag that stops expiry everywhere. Ties to your ticketing/GC workflow. Holds must be time-bounded and visible in product UI. Every exception is a cost; track them.

5) Customer guarantees

Expose retention settings per object class in-app. Show upcoming expiry counts, allow opt-down (shorter TTL) and plan-gated opt-up (longer TTL). Send deletion receipts with object counts, p95/p99 times, and coverage across stores. This is now a competitive feature in RFPs.

The architecture: turn policy into code

Implement a simple, durable pattern:

  • Retention Policy Service: Evaluates policies and writes (object_id, expiry_at) to the Tombstone Ledger when data is created or updated. Re-evaluates on access if you use last_accessed TTLs.
  • Tombstone Ledger: Append-only table (e.g., Postgres partitioned by month) or event stream (Kafka). Immutable, signed records.
  • Deletion Orchestrator: Consumes tombstones, fans out to store-specific workers, tracks per-store SLOs, and emits receipts.
  • Store Workers: Postgres, S3, Elasticsearch, Kafka, BigQuery/Snowflake, CDN, and Backup workers. Idempotent, retryable, with dead-letter queues.
  • Metrics + Audits: Deletion backlog length, p95/p99 time-to-delete by store, % coverage by object class, legal hold counts, and random sampling audits (prove absence).

Implementation, store by store

Postgres (OLTP)

  • Schema: Add expiry_at and tombstoned_at to expirable tables. Avoid cascades doing surprise deletes; explicitly delete children first.
  • Partitioning: Partition by expiry_at or created_at. Expiry by partition drop is O(1) and avoids vacuum storms. If partitioning by expiry is hard, maintain a separate partitioned “shadow” table of IDs to delete.
  • Worker: Batch deletes in small chunks (e.g., 1–5k rows) to keep locks short. Use DELETE ... USING join on a temp table of eligible IDs. Monitor bloat; run autovacuum aggressively on hot tables.
  • Foreign keys: If you must keep parents longer than children, invert the relationship: child references parent, but child can be removed without ON DELETE RESTRICT deadlocks.

Object storage (S3/GCS)

  • Lifecycle: Use S3 Lifecycle rules to expire objects at expiry_at. If you use versioning, set NoncurrentVersionExpiration too, and remove delete markers periodically.
  • Multipart uploads: Expire incomplete multipart uploads (common leak). Enable AbortIncompleteMultipartUpload at 7 days.
  • Encryption: Per-tenant KMS keys let you cryptoshred backups by dropping keys. Rotate keys annually at minimum.
  • CDN: Purge on delete. Set Cache-Control max-age to be below your shortest TTL unless content is immutable.

Search (Elasticsearch/OpenSearch)

  • ILM (Index Lifecycle Management): Roll over by size/time, then delete at TTL. For mixed-tenant indices, partition by time to make delete cheap.
  • Deletes: Hard-delete docs by ID from the Tombstone feed to avoid ghost hits during the warm phase.
  • Pitfall: Snapshots retain deleted segments; align snapshot retention to your backup SLO.

Event/logs (Kafka/Pulsar)

  • Retention: Set retention.ms per topic. For user-generated payload topics, prefer log compaction + short retention, or encrypt at producer with per-tenant keys.
  • DLQs: Dead-letter queues must inherit the shortest applicable TTL; they’re a common dark-data sink.

Analytics (BigQuery/Snowflake)

  • Partitioning: Partition tables by event_date and cluster by tenant_id. Use table TTLs (BigQuery) or scheduled DROP PARTITION (Snowflake) to enforce expiry.
  • Materialized views: Recompute over current partitions only. Backfill jobs should not resurrect expired data.
  • Model training sets: Track provenance. If your TTL invalidates a training cohort, log and retrain. Better than explaining to counsel why you used data you promised to delete.

Backups

  • Tiered retention: Hot backups (7–14 days), warm (30 days), no cold unless contractually required. Shorten by default.
  • Cryptoshredding: Encrypt per-tenant or per-bucket. When an object expires, mark its key for revocation at the backup SLO boundary.
  • Restore discipline: Restores should not re-introduce expired data. On restore, run a post-restore expiry job before the system goes live.

Verification: don’t just delete—prove absence

  • Receipts: For each tenant, emit monthly receipts: objects expired, p95/p99 deletion times, stores covered, exceptions (holds). This wins RFPs.
  • Random audits: Sample 100 expired IDs monthly and attempt to fetch from each store and analytics system. Zero should return. Alert on any hit.
  • Vendor attestations: For critical vendors (e.g., observability, support SaaS, LLM providers with file storage), obtain SOC/ISO reports that cover deletion controls; add DPA language requiring TTLs and proof within 10 business days.

What it saves: cost and blast radius

Don’t frame expiry as only compliance. It’s also dollars and uptime.

  • S3 cost example: A mid-stage SaaS with 400 TB on S3 Standard pays roughly 400,000 GB × $0.023 = $9,200/month. If 60% of attachments older than 180 days are never re-accessed (typical), a 180-day TTL plus deletes removes ~240 TB, saving ~$5,520/month before request and replication savings.
  • Search and analytics: Smaller indices and partitions mean fewer nodes, faster queries, and cheaper clustering. It’s not unusual to see 20–40% infra reduction after aggressive TTLs.
  • Incident blast radius: If an attacker lands at T0, everything older than your TTL doesn’t exist to be exfiltrated. That’s the only guaranteed “mitigation.”

Trade-offs you must own

  • Product expectations: Customers who love “reopen a ticket from 2018” will complain. Offer exports and plan-gated extended retention, but don’t backslide to immortality.
  • ML utility vs. privacy: Short TTLs reduce historical training data. Counter with sampling, synthetic augmentation, or rolling windows.
  • Complexity: Cross-store deletion orchestration is real engineering work. But it’s bounded and testable—unlike the liability of keeping everything forever.

30-60-90 days: ship it without boiling the ocean

Days 0–30: Inventory and primitives

  • Inventory: Enumerate object classes and stores (OLTP, blob, search, analytics, logs, backups, CDN, LLM/vector stores, support tools). Expect 12–20 material sinks in a typical SaaS.
  • Defaults: Pick tenant-facing TTL defaults (e.g., 90 days messages/files, 365 days audit, 30 days logs). Write them into your DPA.
  • Tombstone Ledger: Build it in Postgres, partitioned monthly. Add writing hooks in code paths that create/modify data.
  • Legal hold: Create API and UI to set/clear holds per tenant/object class, with expiration and reasons.
  • Implement S3 + Postgres: Ship lifecycle rules, Postgres batch deletion worker, and initial receipts. Set internal SLOs (p95: 6h; p99: 24h).

Days 31–60: Extend and instrument

  • Search + Analytics: Apply ILM to indices. Partition analytics tables by date, add TTL scripts. Block queries that span expired partitions unless explicitly allowed.
  • Logs/Event buses: Align Kafka/Pulsar retention. Ensure DLQs inherit TTLs. Stop shipping PII into logs without a defined TTL.
  • Metrics + Alerts: Dashboards for deletion backlog, time-to-delete per store, legal hold counts, and audit failures. Page on p99 breaches.
  • Customer UI: Expose retention controls per object class and monthly receipts. Train Support to answer “where did my 2-year-old file go?”

Days 61–90: Close the loop

  • Backups: Shorten retention. Implement cryptoshred by per-tenant keys. Validate restore pipelines do not resurrect expired data.
  • Vendors: Update DPAs to require TTLs and attestations. Swap vendors that won’t play ball.
  • Chaos deletion: Quarterly exercise: pick a tenant, simulate a mass expiry event, and verify deletion end-to-end within SLOs.
  • Roadmap: Add per-object-class last_accessed tracking so you can move from fixed TTLs to activity-based TTLs where appropriate.

Common pitfalls (and how to dodge them)

  • Soft delete is not delete. If the row lives and indices reference it, users can still infer it (counts, search hits). Soft delete only as a transition state; schedule hard delete soon after.
  • Ghost copies in caches. CDN and application caches often outlive data. Tie cache TTLs to your shortest data TTL and purge on tombstones.
  • Search reindex resurrects data. Reindex jobs that read old snapshots will bring back expired docs. Scope reindex to live partitions only.
  • Unbounded exports. CSV exports sitting in S3 “exports/” buckets become a new dark-data lake. Make exports expirable objects with short TTLs (7–30 days).
  • Observability sprawl. Your APM and log vendors often store PII by accident. Scrub at source, mask aggressively, and enforce vendor TTLs in contracts.
  • LLM/vector stores. If you embed user content for RAG, the vector DB is a new copy. Make vectors first-class objects with TTLs and deletion tied to the original’s tombstone.

Tie-in with today’s AI risk posture

Even top model vendors are rolling out “Lockdown Modes” to fence off data and reduce prompt injection blast radius. That’s necessary, but not sufficient. A prompt can’t leak data you already deleted. Expiry-by-design is the cheapest, most reliable guardrail for AI features because it shrinks the context that could ever be exposed.

What good looks like in production

  • Coverage: 100% of object classes mapped to TTLs, with legal hold support.
  • Latency: p95 time-to-hard-delete under 6 hours across every store, auditable.
  • Receipts: Tenants can pull deletion receipts and export a policy report for auditors in one click.
  • Backups: Cryptoshred within 7 days; restores do not reintroduce expired data.
  • Vendors: DPAs with TTL clauses and annual attestations; red flags escalated to procurement.

How to staff it

You don’t need a new platform. You need a small, senior pod—1 staff backend, 1 infra/SRE, 1 data engineer, 0.5 product/GC—to own the ledger, workers, and vendor alignment. For US teams, a nearshore pod in Brazil gives you 6–8 hours overlap, and, in our experience, 20–30% lower TCO for this kind of plumbing-heavy, cross-system work. The key is giving them the authority to say “no” to immortal features.

Bottom line

OneDrive’s expiry toggle and the never-ending breach ticker are telling you the same thing: deletion is a product requirement now. You can either codify it with a ledger, SLOs, and receipts—or keep paying compounding interest on data you don’t need. Ship expiry by design in a quarter and make the next incident meaningfully smaller before it starts.

Key Takeaways

  • Define default TTLs per object class, with plan-gated extensions and legal holds.
  • Implement a Tombstone Ledger and deletion orchestrator; measure p95/p99 time-to-delete.
  • Enforce expiry across Postgres, S3, search, analytics, logs, backups, CDN, and vendors.
  • Prove absence: send deletion receipts, run random audits, and require vendor attestations.
  • Expect trade-offs: some features and ML utility will shrink; your risk and costs will too.

Ready to scale your engineering team?

Tell us about your project and we'll get back to you within 24 hours.

Start a conversation