Vivian Voss

The Snapshot That Travels

freebsd zfs unix devops

The Unix Way ■ Episode 13

At three in the morning, on a production cluster whose state is measured in terabytes, the question is not which backup tool you happen to have installed. It is whether any of them will finish before breakfast.

FreeBSD's answer has been sitting in the base system since FreeBSD 7.0 in 2008, and arrived first-class with FreeBSD 8.0 in 2009. ZFS treats snapshots, replication, and copy-on-write not as features bolted onto the filesystem, but as properties of the filesystem itself. Three separate workflows, in consequence, collapse into three uses of a single primitive. One does, on first encounter, find this a little disorientating. On second encounter, one finds everything else rather ornate.

One Primitive, Three Workflows zfs snapshot atomic, block-level Backup @before-deploy milliseconds no locks, no copy Replication send | ssh | receive incremental, block-level 40 TB delta in seconds Branching zfs clone @A tank/exp copy-on-write zero cost until divergence Backup, replication, and branching are not three tools. They are three uses of the same primitive.

Local Snapshots

The command, complete with what one might fairly call a proper commit message:

zfs snapshot tank/data@before-i-try-something-clever

Atomic. Takes milliseconds, even on an active dataset. No locks. No files copied. The snapshot is an immutable view of every block at that instant, held open for as long as one pleases. Reverting to it is a single command and, again, takes the time one needs to type it.

The distinction worth being rather precise about: a snapshot is not a copy, it is a reference to the blocks as they were. Writing to the live dataset allocates new blocks; the snapshot quietly continues to point to the old ones. This is copy-on-write at the filesystem layer, not a housekeeping task bolted on top of it. The consequence is that snapshots are cheap enough to create generously and keep forever, which is the unspoken precondition for everything that follows.

Linux has btrfs, which offers broadly similar subvolume snapshots, and it is, in fairness, the only serious counterpart on that side of the aisle. Btrfs has genuine strengths: it ships with the mainline Linux kernel, is well integrated into Ubuntu and Fedora, and its single-disk and mirrored-RAID configurations are perfectly serviceable for a great many workloads. Alas, btrfs RAID 5/6 is still formally discouraged for production use in 2026, which one does find a trifle awkward for a system intended to hold long-term state. It is not, in short, what one chooses for a 40 TB cluster with parity RAID.

Across a Cluster

The same primitive, unchanged, becomes replication:

zfs send tank/data@A | ssh standby zfs receive tank/data
zfs send -i tank/data@A tank/data@B | ssh standby zfs receive tank/data

Block-level, not file-level. The first command ships the entire dataset as a stream. The second ships only the delta between snapshots A and B. Published benchmarks put full-send rates at 100 to 300 MB/s and incremental sends at roughly 60 MB/s on commodity hardware. The real surprise, however, is not the rate but the volume: a 1 TB dataset with a 1 per cent daily change rate ships roughly 10 GB per replication cycle, and a 512 MB benchmark dataset in which only metadata changed ships 33 KB. Once the initial full send is done, ongoing replication is close to free for stable workloads.

This is how disaster-recovery standbys are kept in sync. This is how clusters replicate read replicas. This is how staging is rebuilt from production in the time it takes to boil a kettle.

The automation ecosystem around the primitive is mature. One does not, in practice, type zfs send by hand at 3 AM. Sanoid handles policy-driven snapshot scheduling: hourlies, dailies, monthlies, pruned on a declared retention. Syncoid orchestrates the incremental sends over SSH, typically from cron. zrepl runs as a long-lived daemon with TLS and bandwidth limiting, for links where one rather cares. Each of these sits on top of the same five or six zfs subcommands, and one could write a passable version over a weekend if one were so inclined.

Btrfs send and receive do exist. They (alas) require the source subvolume to be made read-only before the send, which one does find a trifle inconvenient for continuously active state, and its automation ecosystem has not reached the same maturity. The alternatives, inevitably, involve a scheduled read-only window, which is itself an accounting exercise one had rather hoped to avoid.

Something Rather Like Git

If one insists on the metaphor, and one usually does: every snapshot is an immutable commit, and every clone a branch.

zfs clone tank/data@A tank/branch-experiment
The Filesystem Is the Version Control @A @B @C feature-A experiment-B staging parent dataset shared baseline Each clone costs nothing until it diverges. Reads come from the shared baseline. Fifty engineers, fifty branches of a multi-terabyte database, for the price of the blocks they actually change.

The clone costs no disk space until it diverges. Reads still come from the shared baseline. Git's copy-on-write semantics, effectively, applied to the entire filesystem, except the unit of comparison is the block rather than the line, and the system has no particular opinion about whether the content is source code, a PostgreSQL data directory, or 40 TB of scientific observations. One can spin up fifty clones of a multi-terabyte database for fifty engineers, each with their own writable branch, for the price of the blocks they actually change. One does appreciate the economics of it.

The pattern has been quietly productised. AWS FSx for OpenZFS ships Oracle databases at near-petabyte scale, cloned in seconds, with zero additional capacity consumed until divergence. CI pipelines increasingly use the same mechanism to give every test run its own writable snapshot of the production dataset, torn down automatically when the run completes. One does find it charming that the pattern is old enough to be boring, and new enough to still surprise people.

zfs diff reports changes at file level between two snapshots. zfs rollback returns a dataset to a prior snapshot, undoing every write since. zfs promote swaps a clone with its origin, turning the branch into the main line. Replication via send is, in practical effect, git push. The filesystem is the version control. One is not, at this point, extending an analogy; one is describing the semantics.

Btrfs, to be fair, offers subvolume snapshots and clones too. Its RAID 5/6 is still officially marked as "should not be used for production", its deduplication is offline-only (batch, not inline), and its quota accounting has historically slowed noticeably with many snapshots. One would, all things considered, prefer rather more predictable ground under the version-control layer of one's infrastructure.

Feature Comparison

ZFS vs btrfs — The Inventory ZFS / OpenZFS btrfs Snapshots inline, atomic inline, atomic send on writable source yes no — must be read-only Deduplication inline offline (batch) RAID 5/6 in production production-grade discouraged (2026) Quota accounting, many snapshots stable historically slow btrfs is the only serious counterpart. That is the compliment it deserves. The rest is the inventory.

The Point

On FreeBSD, ZFS is the installer default. Ticking a box during setup gets one the filesystem, the snapshots, the replication primitive, and the version-control semantics. No third-party backup agent. No scheduled read-only window. No separate replication daemon. No weekend dedicated to ensuring that your snapshots are, in fact, snapshots rather than best-effort approximations.

There is no separate "backup layer", "replication layer", or "branching layer" waiting to be integrated. There is the pool, the dataset, the snapshot, and the stream. snapshot, send, receive, clone, diff, rollback, and promote are variations on the same model. Everything else one does with ZFS is a composition of those few operations. This is the Unix philosophy applied to storage: small, orthogonal operations that compose. The value is not in any one of the commands. It is in the fact that they share a single mental model.

The snapshot-as-stream philosophy is not a backup strategy bolted on top of the filesystem. It is the filesystem.

Backup, replication, and branching are not three tools. They are three uses of the same primitive.

zfs snapshot: atomic, milliseconds, no locks. zfs send | ssh | zfs receive: block-level, 40 TB delta in seconds. zfs clone: copy-on-write branching, zero cost until divergence. btrfs is the only serious counterpart; its RAID 5/6 remains discouraged for production in 2026. One primitive. Three workflows. The filesystem is the version control.