Vivian Voss

The Backup That Wasn't

postgresql devops architecture unix

Tales from the Bare Metal ■ Episode 01

« Thou shalt not trust a backup thou hast not restored! »

At half past eleven on the night of Tuesday, 31 January 2017, an engineer at GitLab.com typed rm -rf on what they believed was the secondary PostgreSQL database. The terminals on their screen were visually identical, save for the hostname in the prompt. Two seconds later, when they realised the prompt did not say what they thought it said, they killed the command. By that point, three hundred gigabytes of production data had been removed.

That was the easy part. The hard part came over the next eighteen hours, as the team discovered, in a sequence that has since become teaching material, that none of their five backup mechanisms had been working.

This is a long-documented incident. GitLab's response, by industry standards, was extraordinary. They live-streamed the recovery on YouTube. They published their internal chat logs. They wrote a postmortem so detailed and so honest that it remains, nearly a decade later, one of the most widely cited operational documents in software engineering. The point of revisiting it now is not the story; the story is well-known. The point is the pattern.

What Happened, in Sequence

The day had not been routine. From around 19:00 UTC, GitLab's primary database had been under unusual load, suspected at the time to be coordinated spam-account creation. The on-call engineers had been working through the load issue for hours. By 23:00 UTC, replication between the primary and the secondary had stalled: the secondary's WAL receiver could not keep up, and the secondary fell sufficiently far behind that recovery would require re-seeding it.

The engineer in question had earlier that day created an LVM snapshot of the production database, intending to use it to set up a staging instance for testing pgpool-II. That snapshot, taken around 17:20 UTC, was a side effect of unrelated work; it had nothing to do with backups, and was scoped for staging.

At around 23:30 UTC, while attempting to clean up the secondary's data directory in preparation for re-seeding, the engineer ran the cleanup command on the wrong host. The terminals were visually identical. The prompt difference was a hostname they had been working in for several hours. They killed the command within seconds. Approximately 300 GB had been removed from the primary. Affected: roughly 5,000 projects, 5,000 comments, and 700 newly created user accounts created between 17:20 and 23:30 UTC.

Service was taken offline at 23:30. Recovery began immediately. The team turned to their backups in sequence. Each one, in turn, did not work.

Eighteen Hours, in Order 17:20 UTC staging LVM snapshot unrelated pgpool work 19:00 UTC unusual load spam-account stress 23:00 UTC replication stalls re-seed needed 23:30 UTC rm -rf on primary 300 GB in 2 seconds 17:00 UTC +1d service restored live on YouTube staging snapshot copied at ~60 Mbps for eighteen hours A six-hour window of unrecoverable user data, and a public restoration broadcast.

The Five Mechanisms, in Order

pg_dump backups to S3. GitLab's primary off-site backup mechanism was a wrapper script that took daily pg_dump exports and uploaded them to Amazon S3. The script had been working for a long time. It stopped working silently after a PostgreSQL upgrade: the wrapper script invoked an older pg_dump binary which produced an error against the upgraded server, but the script swallowed the error and produced empty output files. The S3 bucket was full of zero-byte files going back several months.

Email alerts for backup failures. The pg_dump script sent failure-notification emails when something went wrong. Those emails would have caught the upgrade-mismatch problem on day one. They did not arrive. A change to the email infrastructure (DMARC) elsewhere in the organisation, made for unrelated reasons, caused the alert emails to be silently rejected at the receiving end. They had been rejected for months. No one noticed because the absence of failure emails was the same signal as success.

LVM snapshots. GitLab had no scheduled LVM snapshot strategy for production databases. The single snapshot that existed on 31 January was the one the engineer had taken six hours earlier for the unrelated staging-pgpool work. By coincidence, this snapshot was the most recent operational backup of the database that existed anywhere in the organisation.

Azure disk snapshots. The cloud platform on which GitLab.com was hosted at the time offered automated disk snapshots. They had not been enabled for the database servers. The decision was deliberate: cost considerations, plus a stated intention to rely on PostgreSQL replication and pg_dump. Restoring from Azure snapshot, when investigated during recovery, was estimated to take days rather than hours.

WAL archiving. PostgreSQL supports continuous archiving of write-ahead log segments, which would have allowed point-in-time recovery to any moment within the archive window. WAL archiving had never been configured.

Recovery used the LVM snapshot. Copying the data from the staging host back to production took roughly eighteen hours, bottlenecked by the network storage's 60 Mbps throughput. By 17:00 UTC on 1 February, GitLab.com was back online, missing the data that had changed between 17:20 and 23:30 UTC the previous day.

Five Mechanisms, Five Different Silent Failures pg_dump → S3 wrapper used old pg_dump after upgrade months of zero-byte files, no error caught Email failure alerts DMARC change rejected the alerts silence misread as success for months LVM snapshots no schedule for production a single one existed by coincidence Azure disk snapshots never enabled on database hosts cost-trade decision, deliberate WAL archiving never configured point-in-time recovery unavailable Five mechanisms felt like resilience. None had been exercised end-to-end against the same restore scenario.

The Context, in Fairness to the Build

Each of these five mechanisms had been a reasonable design at the time it was built. pg_dump-to-S3 worked correctly the day it was deployed. LVM snapshots had a clear scope (staging) that was honoured. Azure snapshots were a deliberate cost-trade decision with documented reasoning. WAL archiving was on the long roadmap. The DMARC change, which silently severed the alerting chain, was made by another team, in another part of the organisation, for a reason that had nothing to do with PostgreSQL.

Three systemic conditions made the outcome likely.

First: redundancy of mechanism is not redundancy of recovery. Five backup mechanisms feel like resilience. In practice, none of them had ever been exercised end-to-end against the specific scenario "the primary is gone; restore from the most recent good source within an hour". The drills that would have surfaced each mechanism's silent failure were not part of routine operations.

Second: the absence of a failure signal is not the same as success. The team's confidence in pg_dump rested on the steady absence of failure emails. That signal had been broken for months. Monitoring that depends on negative signals (silence as good news) is a class of monitoring that fails open. The fix is positive monitoring: a periodic heartbeat that says "this thing ran today, here is its output, here is its size, here is its checksum".

Third: the path between backup and restore had no single owner. Backup configuration sat with one team; restore procedures sat with another; the email infrastructure sat with a third. No one team owned the integration test that would have walked through the entire path on a regular cadence. The handoffs were the gap.

These are not excuses. The wrong directory was deleted, by an engineer who knew which directory they intended to delete, on a host whose hostname they could read. The error happened. But the consequences of the error (six hours of unrecoverable data, eighteen hours of downtime, public restoration on YouTube) were architectural, not behavioural. A different architecture would have absorbed the same error in minutes.

The Principle

Backups are not backups until they have been restored.

This is older than every database. The unixoid expression of it is small enough to fit in a few lines of shell:

# /etc/cron.weekly/restore-test
set -euo pipefail

backup=$(latest_backup)                       # find the most recent
restore_to_sandbox "$backup"                  # restore to a throwaway env
verify_checksum "$backup" sandbox_env         # compare row counts or checksums
report_success "restore-test passed: $backup" # broadcast success on slack/email/wiki

# anything failing aborts via set -e and triggers an oncall page.

The shape is what matters. The signal is positive: the cron job is required to broadcast success, every week. If a week passes with no success message, on-call is paged. Silence is treated as failure, not as the absence of failure.

The further structural change is to put backup configuration and restore verification under one team's ownership. The path from "backup runs nightly" to "we just restored last week's backup and it worked" has to belong to one group of humans who own the whole sequence.

Negative Silence vs Positive Heartbeat Failure-mode (GitLab, 2017) Heartbeat (the unixoid fix) backup script runs no error email arrives silence read as success chain broken for months restore-test runs weekly success heartbeat broadcast missing heartbeat → page silence is failure, not success

Where the Pattern Travels

The principle does not depend on PostgreSQL. It does not depend on Unix. It applies anywhere data is being protected against loss.

Cloud-managed databases. AWS RDS, Google Cloud SQL, Azure Database for PostgreSQL all offer automated snapshots. The snapshot is taken; the question is whether one has ever restored one to a sandbox database and confirmed that the application can connect to it and read every table. If one has not, one is guessing.

Kubernetes StatefulSets with PVC snapshots. PVC snapshots are convenient. They are also untested by default. The drill is identical: weekly, restore the snapshot to a sandbox cluster, run the application's startup health checks against it.

Object storage. S3 versioning, Backblaze B2, Cloudflare R2. These are good systems. They protect against accidental deletion if and only if one has, at some point, recovered an object from them by version, with the production-side application code, and verified it was correct. Otherwise it is faith.

Git-LFS, large media stores, vendored binary archives. The same logic. The cost of weekly verification is small. The cost of discovering, mid-incident, that the chain has been broken for six months is everything.

Filesystem and host backups. The classic case. Backup tapes that have been rotated for years, never read. The drill is to read one back, mount it, diff against a recent snapshot, and confirm the bytes are present.

In every case, the same shape: the backup is provisional until proven by restore. The cost of the proof is hours. The cost of relying on faith is the whole company.

What to Take Home

If your operational situation reminds you of any of the following, treat it as a thing to investigate this week:

  • You have multiple backup mechanisms and trust their combination.
  • Your backup-status monitoring relies on the absence of failure messages.
  • Backup configuration and restore responsibility live with different teams.
  • You have not actually restored a production-grade backup to a clean environment in the last quarter.

Each of these was true at GitLab on 30 January 2017. Each of them is true at many organisations now.

The fix is not a new tool. The fix is a one-shell-script weekly drill, a positive heartbeat, and a single person whose job description includes "the backup chain works end-to-end". The cost is small. The alternative is broadcasting your recovery on YouTube to a peak audience of five thousand strangers, which GitLab handled with grace, and which most organisations would not.

Do not push this into a maintenance ticket; the ticket will be deferred each sprint until the next outage promotes it for you. Listen to the critics in your own ranks before you listen to the velocity-celebrants.

31 January 2017, 23:30 UTC: rm -rf on the wrong host, 300 GB removed in two seconds. Five backups, each broken differently: pg_dump silent after a PostgreSQL upgrade, alert emails rejected by DMARC, LVM snapshots scoped for staging, Azure snapshots never enabled, WAL archiving never configured. Recovery from a six-hour-old staging snapshot, eighteen hours at 60 Mbps, broadcast on YouTube. The principle is older than every database: backups are not backups until they have been restored.