Technical Beauty ■ Episode 11
In 1996, Andrew Tridgell had a problem that anyone who has waited for a file transfer will recognise: synchronising data over slow links. The naive approach is to copy everything. Every file, every time, even if only a single byte has changed. This is not synchronisation. It is capitulation. It is an admission that you cannot determine what has changed, so you transfer the lot and hope the pipe is wide enough.
Tridgell decided that the pipe was not wide enough and never would be. The correct solution was not a faster pipe. It was to stop sending data that the other end already had. The tool he built is rsync. It has been the default answer to file synchronisation on every Unix system for twenty-nine years, and the algorithm at its core remains, in the strictest sense, beautiful.
The Engineer
Tridgell is one of those quietly devastating contributors whom the industry relies upon without quite acknowledging. He wrote rsync for his PhD thesis at the Australian National University. He also wrote Samba, which allows Unix systems to speak Windows file-sharing protocols, a reverse-engineering effort of such thoroughness that Microsoft eventually published the specifications rather than continue losing the arms race. He co-developed the delta compression in Git with Linus Torvalds. He wrote ccache, a compiler cache that silently halves build times. Each project follows the same instinct: identify the redundant work and eliminate it.
The Problem
You have a file on machine A. You have an older version of the same file on machine B. The file is one gigabyte. The change is one kilobyte. The traditional approach transfers one gigabyte. The time required is one gigabyte divided by your bandwidth. If the link is slow, as most links were in 1996 and many still are, you wait. If the file is large, you wait longer. If both, you go for lunch.
The problem is that both machines have almost identical data, but neither knows which parts are identical without comparing them. And comparing them byte-by-byte requires transferring the entire file, which defeats the purpose. You need a way to determine what has changed without looking at every byte. You need checksums. But not just any checksums.
The Algorithm
The rsync algorithm operates in four steps, each elegant in its economy.
First, the receiving machine splits its version of the file into fixed-size blocks and computes two checksums for each block: a fast rolling checksum (cheap to compute, prone to false positives) and a strong cryptographic hash (expensive to compute, essentially certain). These checksums are sent to the sender. The payload is small: a few bytes per block, regardless of how large the file is.
Second, the sender takes its version of the file and slides a window across it, computing the rolling checksum at each byte position. The rolling checksum is the key innovation. It can be updated incrementally: when the window slides one byte forward, the new checksum is computed from the old checksum, the byte entering the window, and the byte leaving it. No re-reading the entire block. The cost per position is constant.
When the rolling checksum matches one of the receiver’s blocks, the sender computes the strong checksum and compares it. If both match, the block is identical. The sender records a reference: “use the receiver’s block N here.” If neither checksum matches, the byte is new. The sender collects these non-matching bytes and transmits them. The receiver reconstructs the file from its existing blocks and the new bytes. The transfer contains only what changed.
A one-gigabyte file with one kilobyte of changes transfers approximately one kilobyte. The rest is already there. rsync knows it, proves it, and does not waste the pipe.
The Contrast
Modern “sync” solutions want a cloud account. They want a subscription. They want a proprietary protocol. They want your data on their servers, encrypted with their keys, governed by their terms of service, available until the company pivots, merges, or discovers that your tier is no longer profitable.
rsync wants a source, a destination, and optionally SSH.
rsync -avz source/ destination/
That is the complete invocation. -a for archive
mode: preserve permissions, timestamps, symlinks, ownership.
-v for verbose: show what is happening.
-z for compression during transfer. No
configuration file. No daemon requirement. No account. No
vendor. No subscription. No terms of service. The man page,
one should note, is longer than most modern tools’
dependency lists.
The Legacy
rsync is not merely a tool. It is an algorithm that has been
adopted, adapted, and embedded across the industry.
Apple’s Time Machine
used rsync-style delta transfers. rclone, the cloud storage
synchronisation tool, implements the same algorithmic
principles. duplicity and borgbackup build on rsync for
incremental encrypted backups. The --link-dest
flag alone, which creates hard-linked snapshot trees for
zero-cost incremental backups, has saved more disk space
than most compression algorithms.
Eighteen thousand lines of C. No external dependencies beyond a C library and optionally SSH. Compiles on everything. Runs on everything. Handles interruptions: if the transfer fails halfway, rsync resumes from where it stopped, not from the beginning. This is not a feature that was requested. It is a consequence of the block-level architecture. When your algorithm tracks which blocks have been acknowledged, resume is free.
The Reduction
What makes rsync technically beautiful is the rolling checksum. Everything else, the resume capability, the bandwidth efficiency, the incremental backups, follows from a single insight: you can compute a checksum incrementally, and therefore you can find matching blocks without comparing every byte. The entire algorithm is a consequence of one mathematical property of one hash function.
Tridgell described this in his 1996 technical report, and the industry has been using the result for three decades without improving upon it. There have been faster implementations. There have been alternative checksums. There have been wrappers, frontends, and entire businesses built on top. But the core algorithm, split, checksum, compare, transfer only differences, has not changed because it does not need to. It was correct in 1996. It is correct now.
One algorithm. One command. Twenty-nine years. rsync does not synchronise your data. It synchronises only the differences. That distinction is the entire tool.