Vivian Voss

Technical Beauty: ZFS

zfs freebsd security

Technical Beauty ■ Episode 06

In 2005, the storage industry had a dirty secret. Filesystems trusted hardware. Hard drives reported writes as complete when they were still in a volatile cache. Controllers silently flipped bits. RAID arrays rebuilt from corrupted parity without telling anyone. The entire tower of abstraction, from application to spinning platter, was built on a gentleman’s agreement that the gentleman had stopped honouring somewhere around 1998.

Jeff Bonwick and Matthew Ahrens at Sun Microsystems looked at this arrangement and decided the correct response was not to trust harder, but to stop trusting entirely. The filesystem they built does not believe your hardware. It does not believe your controller. It does not believe your cable. It verifies. Every block, every read, every time. The name is ZFS, and it has been proving hardware wrong for twenty years.

The Engineers

Bonwick was not new to solving problems that others preferred to ignore. Before ZFS, he designed the Slab Allocator for the Solaris kernel, a memory allocation strategy so effective that it was adopted by Linux, FreeBSD, and virtually every operating system that takes performance seriously. The man understood what happens at the boundary between software and silicon. Ahrens, his co-architect, had the same instinct for reduction. Together, they spent four years building a filesystem that merged volume management, RAID, and data integrity into a single coherent layer. The industry had kept these concerns separate for decades. Bonwick and Ahrens argued, correctly, that keeping them separate was the reason they kept failing together.

The Problem

Silent data corruption. The term sounds like a minor inconvenience. It is not. A bit flips in a file, and no one notices. The backup runs, dutifully copying the corrupted file. The previous backup is rotated out. Three weeks later, someone opens the file and discovers it is damaged. The backup is damaged too. Every copy, on every server, in every data centre, is silently, irreversibly wrong.

Traditional filesystems had no defence against this. Ext4 does not checksum data blocks. XFS does not checksum data blocks. NTFS does not checksum data blocks. They trust the hardware to deliver what was written. When the hardware lies, they have nothing to cross-reference. The corruption propagates, undetected, through every layer of redundancy that was supposed to prevent exactly this outcome.

The Integrity Problem Traditional Filesystem Trusts the hardware Application writes ext4 / XFS / NTFS Disk (bit rot) Backup (corrupted) No checksums Corruption propagates silently Detected weeks later, if ever ZFS Trusts mathematics Application writes Checksum (Merkle) Copy-on-Write Self-healing mirror Every block verified Corruption detected on read Correct copy restored instantly The filesystem that assumes nothing and verifies everything.

The Architecture

ZFS solves this with end-to-end checksumming. Every data block carries a Merkle tree checksum stored in its parent block pointer, not alongside the data itself. This distinction is crucial. A traditional filesystem that checksums a block and stores the checksum next to it has achieved nothing: if the disc corrupts the data, it can just as easily corrupt the checksum. ZFS stores the checksum one level up in the tree. To fool ZFS, you would need to corrupt both the data block and its parent’s pointer simultaneously, in a way that produces a valid checksum. The probability is, for practical purposes, zero.

On top of this sits Copy-on-Write. ZFS never overwrites data in place. A write operation allocates a new block, writes the data there, updates the parent pointer, and only then frees the old block. If power fails mid-write, the old data is still intact at the old location. The tree is always consistent. There is no fsck. There is no journal replay. There is no anxious wait after an unexpected reboot while the filesystem attempts to reconstruct what it was doing when the lights went out. ZFS simply loads the last consistent root pointer, and the tree is exactly as it was before the interruption.

Snapshots

Copy-on-Write has a side effect so useful it would justify the entire architecture on its own: snapshots are free. A snapshot is not a copy. It is a preserved root pointer. The blocks do not move. The data is not duplicated. ZFS simply marks the current set of block pointers as immutable and continues writing new data to new blocks. The snapshot is instant, it takes milliseconds regardless of dataset size, and it costs zero additional storage until the live data diverges from it.

In an era of ransomware, this property is worth more than it appears. A ransomware attack encrypts your files. With a traditional backup, recovery means restoring from an off-site copy, a process that takes hours and assumes the backup was not compromised. With ZFS snapshots, recovery means zfs rollback. The encrypted files vanish. The original data reappears. The operation takes seconds. The snapshots were immutable. The ransomware could not touch them.

Snapshots: Preserved Pointers, Not Copies @daily-1 root ptr A @daily-2 root ptr B live root ptr C zfs rollback @daily-2 No data copied. Pointers preserved. Rollback in seconds. Ransomware encrypted your files? The snapshot never noticed.

The Numbers

128-bit block addressing. The theoretical maximum storage capacity is 256 quadrillion zettabytes. This is not a practical number. It is a statement of intent. ZFS was not designed for the storage capacities of 2005. It was designed so that storage capacity would never again be a constraint on filesystem design. Whether this ambition was prophetic or merely extravagant depends on your time horizon, but twenty years in, no one has yet complained that 128 bits was too many.

Transparent compression is built in. LZ4 compression, the default on modern deployments, typically achieves 2:1 compression on text-heavy workloads while adding negligible CPU overhead. On many storage systems, enabling compression makes I/O faster, not slower. The arithmetic is straightforward: compressed data means fewer blocks to read from disc, and disc I/O is orders of magnitude slower than the CPU time needed to decompress. The bottleneck moves from the spindle to the silicon, and the silicon is faster. It is one of those rare optimisations where you get more space and more speed, and the only cost is the courage to enable it.

The Adoption

FreeBSD adopted ZFS as a first-class citizen in 2008, with native kernel support. Root-on-ZFS became the default installer option. No external modules. No DKMS rebuilds after kernel updates. No licence gymnastics. ZFS is to FreeBSD what the foundation is to a house: it is not a feature you enable; it is the surface everything else stands on.

Linux adopted ZFS more cautiously. The CDDL licence under which ZFS was released is not compatible with the GPL, which means ZFS cannot be distributed as part of the Linux kernel. It runs as a DKMS module, rebuilt against each kernel update, functional but never quite native. Ubuntu offered it as an installer option in 2019. Proxmox supports it well. The integration works, but it carries an asterisk that FreeBSD users do not have to read.

Today, TrueNAS, the most widely deployed open-source storage platform, runs on FreeBSD with ZFS at its core. Millions of storage systems worldwide trust their data to the filesystem that trusts nothing.

The Fork

After Oracle acquired Sun Microsystems in 2010, the ZFS development community watched with the resignation of people who have seen this film before. Oracle closed the source. Two-thirds of the core development team left. In 2013, they founded OpenZFS, an independent project that unified the FreeBSD and Linux implementations under a single codebase. The architecture survived the corporate acquisition. The code survived the licence dispute. The community survived the diaspora. The checksums, as always, remained correct.

The Honesty

ZFS is software, and software has bugs. In 2023, a block cloning bug in OpenZFS 2.2.0 caused data corruption under specific conditions. In 2024, a native encryption bug affected send/receive operations. Both were identified, reported, and fixed. Neither was silent. Neither propagated undetected through backup chains. The checksumming architecture that protects against hardware failures also protects against software failures: the system that verifies everything includes verifying itself.

The Reduction

What makes ZFS technically beautiful is not any single feature. It is the architectural decision to unify filesystem, volume manager, and integrity verification into a single coherent system, and then to follow that decision through to every consequence. Copy-on-Write is not a feature; it is a consequence of never trusting in-place updates. Snapshots are not a feature; they are a consequence of Copy-on-Write. Self- healing is not a feature; it is a consequence of checksumming combined with redundancy. Each capability emerges from the one before it, like a proof where each line follows from the previous.

The storage industry spent decades building layers of redundancy on top of filesystems that could not verify their own data. RAID protected against drive failure but not against silent corruption. Backups protected against data loss but faithfully copied corrupted files. Checksums existed in transport protocols but stopped at the filesystem boundary. Bonwick and Ahrens extended the chain of verification from application to platter and refused to leave a gap for corruption to hide in.

128-bit addressing. End-to-end checksumming. Copy-on-Write. Instant snapshots. Self-healing mirrors. The filesystem that trusts mathematics, not hardware. Twenty years on, the mathematics has yet to disappoint.