Infra Atlas logo

Infra Notes

Infra Atlas
VPN DNS BGP

Field notes on the systems that quietly run the internet.

Backups Don’t Fail During Writes: They Fail During Restores

The uncomfortable truth about disaster recovery, testing gaps, and why “we have backups” isn’t a recovery plan.

Jan 3, 2026 8 min read

Backups are one of the most comforting graphs in infrastructure: green checkmarks, steady schedules, storage growing right on plan. It’s easy to feel safe because the backup job is running.

The problem is that confidence is earned in the wrong place. Most backup systems are very good at writing data somewhere. The failure happens later — when you need to turn that data back into a working system, under pressure, in less time than you have.

The scenario: the comforting lie

You open your dashboard and see last night’s backup completed. The storage bucket has objects. The vendor says “99.999999999% durability.” The team says “we have backups.”

Then an incident hits: database corruption, region outage, accidental deletion, compromised credentials, ransomware. Suddenly the real question appears:

Can we restore, fast enough, into a clean environment, with the right permissions, and actually resume business?

That’s where most backup strategies quietly fall apart.

Why writes “almost never fail”

Backup writes are engineered to be easy:

  • They are repetitive and automatable.
  • They happen in stable, known environments.
  • They can retry without anyone noticing.
  • They can be “successful” even if the content is incomplete or inconsistent (because success is often defined as “uploaded a file,” not “captured a recoverable point-in-time state”).

Even when a write fails, it’s usually noisy in a benign way: the job errors, the alert fires, someone reruns it, and life moves on.

Restores aren’t like that.

The real restore failure modes

1) The backup exists, but the data is unusable

Common reasons:

  • You backed up the wrong thing (missing WAL/binlogs, missing encryption keys, missing config).
  • The snapshot is inconsistent (crash-consistent when you needed application-consistent).
  • The archive is corrupted, partially uploaded, or truncated.
  • You can’t decrypt it (lost KMS access, rotated keys, wrong key IDs, missing key material).
  • The restore process depends on proprietary metadata or tooling that no longer exists.

“Backup completed” doesn’t mean “restore works.”

2) Restore speed is too slow for reality

Your RTO is “hours.” Your restore throughput is “days.”

This happens when:

  • You never measured real end-to-end restore time.
  • Your storage tier is cold (retrieval delays, staged restore jobs).
  • Your egress path is bottlenecked (network, quotas, throttling).
  • Your restore requires rebuilding infrastructure first (clusters, IAM, VPC, DNS, secrets).

Most teams test “can we restore a file,” not “can we restore the business.”

And when this fails, who pays? Customers pay in downtime, the business pays in revenue and reputation, and operators pay with sleepless nights and postmortems that last longer than the outage.

3) Identity and access failures

In emergencies, access control becomes the primary system.

Failure modes include:

  • The backup account is locked, suspended, or requires MFA you can’t satisfy.
  • The IAM role that used to restore no longer exists.
  • The person who has access is on vacation or left the company.
  • Your KMS policy changed, your org SCP changed, or a conditional policy blocks the restore.
  • Cross-account restore paths are broken (trust policies, external IDs, missing grants).

Backups are data. Restores are permissions.

4) The environment changed

Tooling works until it meets time.

Restores often fail because the restore target isn’t the same world you backed up from:

  • Different region or account (DR account, new org structure).
  • Different instance families, AMIs, base images, or Kubernetes versions.
  • Different network topology (VPC CIDRs, peering, private endpoints).
  • Different storage classes/tiering defaults.
  • Deprecated APIs or behaviors (SDK updates, backup agent changes, provider feature changes).
  • New compliance controls that block “old” restore practices.

If your restore plan depends on “the environment staying the same,” it’s already broken.

5) Ransomware changed the restore game

Ransomware isn’t just “we need data back.” It’s:

  • We must assume the environment is compromised.
  • We must restore into a clean room.
  • We must prevent re-infection during restore.
  • We must prove integrity (what’s clean, what’s not).
  • We must rotate credentials, keys, tokens, sometimes while restoring.

The restore process becomes incident response + forensics + infrastructure rebuild + business continuity. A “click restore” fantasy doesn’t survive this.

Tooling creates a gravity well

Backup tooling is rarely neutral. Over time, your backups become tightly coupled to:

  • a specific vendor’s snapshot format
  • a specific agent version and configuration
  • a specific cloud’s IAM/KMS model
  • a specific control plane (regions, accounts, APIs, quotas)

That coupling creates a tooling gravity well: you can keep “writing backups” for years with low friction, but the moment you try to restore somewhere else (different account, different cloud, clean-room, audit sandbox), you discover you didn’t just back up data. You backed up dependencies.

This is why “we can migrate later” often turns into “we can migrate storage, but not recovery,” and why restore drills should include at least one non-default restore path.

The human failure: restores are organizational problems

Restores aren’t a single command. They are a cross-team negotiation under stress:

  • Who decides what to restore first?
  • Who approves the cost (mass retrieval, egress, emergency capacity)?
  • Who owns the runbook and who executes it at 3 a.m.?
  • Who is allowed to change DNS, rotate secrets, or disable security controls temporarily?
  • Who communicates status to leadership and customers?

When restores fail, it’s often because the organization can’t coordinate quickly, not because the storage lost bits.

Why restore testing is expensive (and politically risky)

Restore testing has costs that are hard to justify until after an incident:

  • It consumes compute, bandwidth, and engineering time.
  • It can trigger real bills (retrieval fees, egress, request costs).
  • It can create downtime risk if done carelessly.
  • It can reveal uncomfortable truths:
    • “Our RTO is a lie.”
    • “We don’t know how to rebuild.”
    • “Only one person can do this.”
    • “This will take a week.”

That last part is the political problem. Testing restores creates accountability.

Governance anchor: in regulated environments, restore testing becomes evidence

In regulated environments (or any org living under audits), restore testing isn’t just “a good practice.” It becomes governance evidence that continuity controls exist and actually work.

That changes the tone of the conversation: you’re no longer arguing about “engineering time,” you’re demonstrating operational capability. The painful part is that evidence-based recovery tends to expose gaps because auditors don’t accept “the backup job is green” as proof you can recover the service.

Restores create accountability

A backup job can be “owned” by one team. A restore event exposes every hidden dependency:

  • IAM design
  • network design
  • infra-as-code maturity
  • documentation quality
  • vendor lock-in assumptions
  • data lifecycle policies
  • on-call readiness
  • leadership expectations

A successful restore is a full-stack audit you didn’t schedule. That’s why many organizations unconsciously avoid practicing it.

What mature teams do differently

Mature teams don’t treat backups as a storage problem. They treat recovery as a product with SLAs.

They build systems and habits like:

  • Multiple restore paths (same-region, cross-region, cross-account).
  • “Break glass” access that is audited, tested, and time-bound.
  • Immutable backups (object lock/WORM) and isolated backup credentials.
  • Infrastructure-as-Code rebuilds (so restore doesn’t require “remembering how”).
  • Measured recovery:
    • RPO and RTO are stated, tested, and updated based on reality.
  • Regular restore drills that include the boring parts:
    • DNS cutovers
    • secret rotation
    • app smoke tests
    • data validation

They optimize for confidence under change, not just “data exists somewhere.”

Cold storage makes restores harder (even if it saves money)

Cold storage is great economics for backups you hope to never touch. But it adds friction when you do:

  • retrieval delay (minutes to hours to days)
  • retrieval fees (big restores can be expensive)
  • restore staging workflows (restore job, then download)
  • minimum storage durations and lifecycle complexity
  • more places for “we never tested that” to hide

Cold storage is fine, if your restore plan explicitly accounts for it and your drills include it.

This is also why “cheap backup storage” decisions can backfire: providers optimize for durability and write success, not for large-scale restore throughput, egress cost, or cross-account recovery. The cheapest storage tier is often the most expensive moment during an incident.

If you’re thinking about cold storage as a cost optimization, the economics change again once you model restore fees, egress, and retrieval delays:

The checklists nobody wants (but the ones that work)

If you want to move from “we have backups” to “we can recover,” this is the uncomfortable list.

Recovery design checklist

  • Define RPO/RTO per system (not per company).
  • Classify systems by restore priority (revenue-first, compliance-first, internal-first).
  • Decide restore destinations:
    • Same region?
    • Cross-region?
    • Cross-account?
    • Clean-room?
  • Ensure backups include everything needed to restore:
    • data + logs + config + secrets strategy + encryption keys access plan
  • Make restore steps executable from documentation, not memory.

Access & security checklist

  • Separate backup credentials from production credentials.
  • Implement and test “break glass” roles:
    • time-limited
    • audited
    • stored securely
  • Verify key access paths (KMS, HSM, passphrases) during drills.
  • Protect backups from deletion/modification (immutability/object lock where appropriate).
  • Plan credential rotation as part of recovery (especially post-ransomware).

Operational testing checklist

  • Run restore drills on a schedule (quarterly is a common starting point).
  • Measure end-to-end time:
    • retrieval + infrastructure rebuild + app readiness + validation
  • Test the “day 2” pieces:
    • DNS cutover
    • background jobs
    • cron/schedulers
    • integrations (email, payments, webhooks)
  • Validate data correctness (not just “service starts”).
  • Capture learnings as changes:
    • update runbooks
    • fix IAM
    • fix automation
    • adjust RTO promises

Tooling & drift checklist

  • Pin and periodically update restore tooling (backup agents, SDKs, images).
  • Verify restores still work after:
    • major platform upgrades
    • region/account changes
    • network refactors
    • lifecycle/tiering changes
  • Maintain a “restore from scratch” path that assumes zero existing infrastructure.
  • Test one “escape the gravity well” restore path:
    • restore into a different account/region, or using an alternate toolchain, so you can see your hidden dependencies before an incident does.

Closing

Backups are necessary but they’re not the finish line. Recovery is the product — the thing that has to be tested, owned, and measurable. Having a backup job that runs is not the same as having the ability to recover the business.

Backups are optimism. Restores are truth.

Written by the Infra Atlas author

I work on infrastructure and software systems across layers: writing code, shipping products, and dealing with the practical trade-offs of hosting, memory, and network behavior in production. When this site says it covers “layer 3 to layer 9,” it’s half a joke and half a truth: from routing and packets, up through operating systems, applications, and the human decisions that actually cause outages.

Infra Atlas is a collection of field notes from that work. Some pages may include affiliate or referral links as a low-key way to support the site. Think of it as buying me a coffee while I write about why systems behave the way they do.