Tales from the Bare Metal ■ Episode 01
« Thou shalt not trust a backup thou hast not restored! »
At half past eleven on the night of Tuesday, 31 January 2017,
an engineer at
GitLab.com
typed rm -rf on what they believed was the
secondary PostgreSQL database. The terminals on their screen
were visually identical, save for the hostname in the prompt.
Two seconds later, when they realised the prompt did not say
what they thought it said, they killed the command. By that
point, three hundred gigabytes of production data had been
removed.
That was the easy part. The hard part came over the next eighteen hours, as the team discovered, in a sequence that has since become teaching material, that none of their five backup mechanisms had been working.
This is a long-documented incident. GitLab's response, by industry standards, was extraordinary. They live-streamed the recovery on YouTube. They published their internal chat logs. They wrote a postmortem so detailed and so honest that it remains, nearly a decade later, one of the most widely cited operational documents in software engineering. The point of revisiting it now is not the story; the story is well-known. The point is the pattern.
What Happened, in Sequence
The day had not been routine. From around 19:00 UTC, GitLab's primary database had been under unusual load, suspected at the time to be coordinated spam-account creation. The on-call engineers had been working through the load issue for hours. By 23:00 UTC, replication between the primary and the secondary had stalled: the secondary's WAL receiver could not keep up, and the secondary fell sufficiently far behind that recovery would require re-seeding it.
The engineer in question had earlier that day created an LVM snapshot of the production database, intending to use it to set up a staging instance for testing pgpool-II. That snapshot, taken around 17:20 UTC, was a side effect of unrelated work; it had nothing to do with backups, and was scoped for staging.
At around 23:30 UTC, while attempting to clean up the secondary's data directory in preparation for re-seeding, the engineer ran the cleanup command on the wrong host. The terminals were visually identical. The prompt difference was a hostname they had been working in for several hours. They killed the command within seconds. Approximately 300 GB had been removed from the primary. Affected: roughly 5,000 projects, 5,000 comments, and 700 newly created user accounts created between 17:20 and 23:30 UTC.
Service was taken offline at 23:30. Recovery began immediately. The team turned to their backups in sequence. Each one, in turn, did not work.
The Five Mechanisms, in Order
pg_dump backups to S3. GitLab's primary
off-site backup mechanism was a wrapper script that took
daily
pg_dump
exports and uploaded them to Amazon S3. The script had been
working for a long time. It stopped working silently after a
PostgreSQL upgrade: the wrapper script invoked an older
pg_dump binary which produced an error against
the upgraded server, but the script swallowed the error and
produced empty output files. The S3 bucket was full of
zero-byte files going back several months.
Email alerts for backup failures. The
pg_dump script sent failure-notification emails
when something went wrong. Those emails would have caught
the upgrade-mismatch problem on day one. They did not
arrive. A change to the email infrastructure
(DMARC)
elsewhere in the organisation, made for unrelated reasons,
caused the alert emails to be silently rejected at the
receiving end. They had been rejected for months. No one
noticed because the absence of failure emails was the same
signal as success.
LVM snapshots. GitLab had no scheduled LVM snapshot strategy for production databases. The single snapshot that existed on 31 January was the one the engineer had taken six hours earlier for the unrelated staging-pgpool work. By coincidence, this snapshot was the most recent operational backup of the database that existed anywhere in the organisation.
Azure disk snapshots. The cloud platform on which GitLab.com was hosted at the time offered automated disk snapshots. They had not been enabled for the database servers. The decision was deliberate: cost considerations, plus a stated intention to rely on PostgreSQL replication and pg_dump. Restoring from Azure snapshot, when investigated during recovery, was estimated to take days rather than hours.
WAL archiving. PostgreSQL supports continuous archiving of write-ahead log segments, which would have allowed point-in-time recovery to any moment within the archive window. WAL archiving had never been configured.
Recovery used the LVM snapshot. Copying the data from the staging host back to production took roughly eighteen hours, bottlenecked by the network storage's 60 Mbps throughput. By 17:00 UTC on 1 February, GitLab.com was back online, missing the data that had changed between 17:20 and 23:30 UTC the previous day.
The Context, in Fairness to the Build
Each of these five mechanisms had been a reasonable design at
the time it was built. pg_dump-to-S3 worked
correctly the day it was deployed. LVM snapshots had a
clear scope (staging) that was honoured. Azure snapshots
were a deliberate cost-trade decision with documented
reasoning. WAL archiving was on the long roadmap. The DMARC
change, which silently severed the alerting chain, was made
by another team, in another part of the organisation, for a
reason that had nothing to do with PostgreSQL.
Three systemic conditions made the outcome likely.
First: redundancy of mechanism is not redundancy of recovery. Five backup mechanisms feel like resilience. In practice, none of them had ever been exercised end-to-end against the specific scenario "the primary is gone; restore from the most recent good source within an hour". The drills that would have surfaced each mechanism's silent failure were not part of routine operations.
Second: the absence of a failure signal is not the
same as success. The team's confidence in
pg_dump rested on the steady absence of failure
emails. That signal had been broken for months. Monitoring
that depends on negative signals (silence as good news) is a
class of monitoring that fails open. The fix is positive
monitoring: a periodic heartbeat that says "this thing ran
today, here is its output, here is its size, here is its
checksum".
Third: the path between backup and restore had no single owner. Backup configuration sat with one team; restore procedures sat with another; the email infrastructure sat with a third. No one team owned the integration test that would have walked through the entire path on a regular cadence. The handoffs were the gap.
These are not excuses. The wrong directory was deleted, by an engineer who knew which directory they intended to delete, on a host whose hostname they could read. The error happened. But the consequences of the error (six hours of unrecoverable data, eighteen hours of downtime, public restoration on YouTube) were architectural, not behavioural. A different architecture would have absorbed the same error in minutes.
The Principle
Backups are not backups until they have been restored.
This is older than every database. The unixoid expression of it is small enough to fit in a few lines of shell:
# /etc/cron.weekly/restore-test
set -euo pipefail
backup=$(latest_backup) # find the most recent
restore_to_sandbox "$backup" # restore to a throwaway env
verify_checksum "$backup" sandbox_env # compare row counts or checksums
report_success "restore-test passed: $backup" # broadcast success on slack/email/wiki
# anything failing aborts via set -e and triggers an oncall page.
The shape is what matters. The signal is positive: the cron job is required to broadcast success, every week. If a week passes with no success message, on-call is paged. Silence is treated as failure, not as the absence of failure.
The further structural change is to put backup configuration and restore verification under one team's ownership. The path from "backup runs nightly" to "we just restored last week's backup and it worked" has to belong to one group of humans who own the whole sequence.
Where the Pattern Travels
The principle does not depend on PostgreSQL. It does not depend on Unix. It applies anywhere data is being protected against loss.
Cloud-managed databases. AWS RDS, Google Cloud SQL, Azure Database for PostgreSQL all offer automated snapshots. The snapshot is taken; the question is whether one has ever restored one to a sandbox database and confirmed that the application can connect to it and read every table. If one has not, one is guessing.
Kubernetes StatefulSets with PVC snapshots. PVC snapshots are convenient. They are also untested by default. The drill is identical: weekly, restore the snapshot to a sandbox cluster, run the application's startup health checks against it.
Object storage. S3 versioning, Backblaze B2, Cloudflare R2. These are good systems. They protect against accidental deletion if and only if one has, at some point, recovered an object from them by version, with the production-side application code, and verified it was correct. Otherwise it is faith.
Git-LFS, large media stores, vendored binary archives. The same logic. The cost of weekly verification is small. The cost of discovering, mid-incident, that the chain has been broken for six months is everything.
Filesystem and host backups. The classic case. Backup tapes that have been rotated for years, never read. The drill is to read one back, mount it, diff against a recent snapshot, and confirm the bytes are present.
In every case, the same shape: the backup is provisional until proven by restore. The cost of the proof is hours. The cost of relying on faith is the whole company.
What to Take Home
If your operational situation reminds you of any of the following, treat it as a thing to investigate this week:
- You have multiple backup mechanisms and trust their combination.
- Your backup-status monitoring relies on the absence of failure messages.
- Backup configuration and restore responsibility live with different teams.
- You have not actually restored a production-grade backup to a clean environment in the last quarter.
Each of these was true at GitLab on 30 January 2017. Each of them is true at many organisations now.
The fix is not a new tool. The fix is a one-shell-script weekly drill, a positive heartbeat, and a single person whose job description includes "the backup chain works end-to-end". The cost is small. The alternative is broadcasting your recovery on YouTube to a peak audience of five thousand strangers, which GitLab handled with grace, and which most organisations would not.
Do not push this into a maintenance ticket; the ticket will be deferred each sprint until the next outage promotes it for you. Listen to the critics in your own ranks before you listen to the velocity-celebrants.
31 January 2017, 23:30 UTC:
rm -rfon the wrong host, 300 GB removed in two seconds. Five backups, each broken differently: pg_dump silent after a PostgreSQL upgrade, alert emails rejected by DMARC, LVM snapshots scoped for staging, Azure snapshots never enabled, WAL archiving never configured. Recovery from a six-hour-old staging snapshot, eighteen hours at 60 Mbps, broadcast on YouTube. The principle is older than every database: backups are not backups until they have been restored.