Vivian Voss

Technical Beauty: tar

unix tooling

Technical Beauty ■ Episode 20

In 1979, at Bell Labs, someone wrote a utility to flatten a directory tree into a sequential stream of bytes. It was designed for magnetic tape, where rewinding was expensive and random access was a fantasy. The tool needed to do exactly one thing: take a tree of files, preserve their permissions, ownership, group, and timestamps, and write the lot out as a single linear stream. It did not compress. It did not encrypt. It did not verify checksums. It just streamed.

They called it tar. Tape Archive. Forty-seven years later, it remains the standard mechanism for packaging files on every Unix system on earth, and quite a few that would rather not admit it.

The Principle

tar does one thing. It serialises a directory tree into a flat byte stream and deserialises it back. That is the entire scope. Compression is someone else's problem. Encryption is someone else's problem. Integrity verification is someone else's problem. This is not a limitation. This is the design.

The Unix philosophy, distilled to its purest form, says: write programs that do one thing well, and write programs that work together. tar is the canonical example. It creates the archive stream. gzip compresses it. xz compresses it better. zstd compresses it faster. age encrypts it. Each tool does precisely one job. They compose through pipes.

tar cf - ./project | xz | age -r key > project.tar.xz.age

That single line archives a directory, compresses it with LZMA2, and encrypts it with a public key. Three tools. Three responsibilities. One pipeline. The dash after cf means stdout. It is the hinge on which the entire composition turns: tar writes to the pipe, the next tool reads from it, and so on until the stream lands in a file. No temporary files. No intermediate state. No coordination protocol between the tools. Just bytes flowing downhill.

The Pipeline tar cf - ./project | xz | age -r key > project.tar.xz.age ./project/ tree of files perms, owner group, mtime tar archive | xz compress | age encrypt > .tar .xz .age 1.2 MB tree 1.2 MB flat stream 240 KB compressed 240 KB encrypted Each tool does one thing. The pipe does the rest. No temporary files. No intermediate state. Just bytes flowing downhill.

The Tape

The format was designed for tape drives. Magnetic tape is a sequential medium. You read forward. You write forward. Rewinding is slow and wears the mechanism. Random access does not exist in any meaningful sense. tar's format reflects this constraint: headers and data blocks alternate in a flat sequence, each file preceded by a 512-byte header containing its metadata. The entire archive can be written and read in a single pass.

This constraint, imposed by hardware that most developers alive today have never touched, produced a format with remarkable properties. You can append to a tar archive without reading the existing contents. You can extract a single file by scanning forward and stopping. You can update files in place. These are not features that were added later. They are consequences of designing for a medium where going backwards was expensive.

The tape drives are gone. The design remains. It turns out that designing for constraints produces software that outlives the constraints. A format optimised for sequential access works beautifully on spinning disks, on solid-state storage, on network streams, on anything that reads bytes in order. The constraint was local. The benefit was universal.

The Silence

tar does not compress. This is the fact that causes the most confusion and the most arguments on the internet, which is saying something. People write tar czf and believe that tar is compressing their files. It is not. The z flag tells tar to invoke gzip as a filter on its output stream. tar still does exactly one thing. It has simply learned, over the decades, a polite convention for calling its neighbours.

tar does not encrypt. tar does not compute checksums. tar does not deduplicate. tar does not do delta encoding. tar does not sign. tar does not do any of the things that a modern archive tool is expected to do, because tar is not a modern archive tool. tar is a stream serialiser. Everything else is composition.

This silence, this refusal to absorb adjacent responsibilities, is the source of tar's longevity. Every feature you add is a feature you must maintain. Every capability you absorb is a dependency you cannot remove. tar chose to do nothing beyond its core purpose, and in doing so, ensured that its core purpose would never be compromised by the failure of an unrelated feature. gzip had a security vulnerability in 2010. tar was unaffected. tar does not know what gzip is. That is the point.

Responsibility Matrix One tool per job. No exceptions. CONCERN TAR? TOOL Flatten directory tree tar Preserve permissions, owner, timestamps tar Sequential byte stream tar Compression gzip, xz, zstd, bzip2 Encryption age, gpg, openssl Integrity verification sha256, md5 Signing signify, gpg, minisign The rows tar does not own are the rows that cannot break it.

The Survivors

Forty-seven years after its creation, tar is everywhere. Every Docker image layer is a tarball. Every source release on every open-source project is a .tar.gz or .tar.xz. Every backup system that runs on Unix speaks tar. Every package manager that predates the app-store era uses tar as its container format. FreeBSD's pkg is tar. Debian's .deb is an ar archive containing tarballs. Red Hat's .rpm contains a cpio archive, but the source RPMs are tarballs. The OCI container image specification layers files as tarballs. The entire containerisation revolution, the one that was supposed to change everything, ships its filesystem layers in a format from 1979.

This is not nostalgia. This is not inertia. This is the natural consequence of getting the abstraction right the first time. tar solved "flatten a file tree into a stream" so completely that nobody has needed to solve it again. The attempts exist. They are called zip, rar, 7z, and a dozen others. They all bundle archiving with compression, which means they cannot participate in a Unix pipeline. They are closed systems. tar is an open stream.

The Dash

The character - in tar cf - means standard output. It is perhaps the most powerful single character in all of Unix. It transforms tar from a file-writing tool into a stream-emitting component. Without it, tar would write to a file and the pipeline would not exist. With it, tar's output becomes the input of whatever tool comes next. The entire composition model, the entire reason tar has survived for nearly half a century, rests on one convention: the dash means stdout.

This convention was not tar's invention. It is a Unix tradition that predates tar itself. But tar is the tool that demonstrates its power most clearly, because the alternative is so obviously worse. Without the dash, you would need: create the archive to a temporary file, compress the temporary file to another temporary file, encrypt that file to a third temporary file, then clean up. Four steps, three temporary files, and the hope that you have enough disk space for all of them simultaneously. With the dash, you have one line, zero temporary files, and constant memory usage regardless of archive size.

The Format

The original tar format used 512-byte header blocks. Each file in the archive is preceded by a header containing the filename (limited to 100 characters), permissions, owner, group, size, and modification time, all stored as octal ASCII strings. The file data follows, padded to the next 512-byte boundary. The archive ends with two blocks of zero bytes.

This format has been extended several times. POSIX.1-1988 introduced the ustar format with longer filenames (up to 256 characters via prefix splitting) and device number support. GNU tar added its own extensions for arbitrary-length filenames, sparse files, and incremental backups. The pax format (POSIX.1-2001) introduced extensible headers as key-value pairs, allowing essentially unlimited metadata without breaking backwards compatibility.

Every extension preserved the 512-byte block structure. Every extension remained readable by older implementations that simply ignored the fields they did not understand. This is how you evolve a format over forty-seven years without breaking anything: you make the extensions additive and the core immutable.

The Verdict

tar is not clever. It does not compress. It does not encrypt. It does not verify. It does not sign. It does not deduplicate. It flattens a tree into a stream, preserves the metadata, and gets out of the way. It has been doing this since 1979, when the magnetic tapes it was designed for were state of the art, and it will continue doing this long after the last person who remembers what a magnetic tape looked like has retired.

The tools that tried to do everything are the tools that had to be replaced. The tool that did one thing is the tool that survived. tar is not a relic of the Unix era. tar is the proof that the Unix era was right.

One job. One stream. One dash. Forty-seven years. The format designed for tape drives now ships every container layer on the internet. That is not survival. That is vindication.