Vivian Voss

Technical Beauty: rsync

unix tooling architecture

Technical Beauty ■ Episode 11

In 1996, Andrew Tridgell had a problem that anyone who has waited for a file transfer will recognise: synchronising data over slow links. The naive approach is to copy everything. Every file, every time, even if only a single byte has changed. This is not synchronisation. It is capitulation. It is an admission that you cannot determine what has changed, so you transfer the lot and hope the pipe is wide enough.

Tridgell decided that the pipe was not wide enough and never would be. The correct solution was not a faster pipe. It was to stop sending data that the other end already had. The tool he built is rsync. It has been the default answer to file synchronisation on every Unix system for twenty-nine years, and the algorithm at its core remains, in the strictest sense, beautiful.

The Engineer

Tridgell is one of those quietly devastating contributors whom the industry relies upon without quite acknowledging. He wrote rsync for his PhD thesis at the Australian National University. He also wrote Samba, which allows Unix systems to speak Windows file-sharing protocols, a reverse-engineering effort of such thoroughness that Microsoft eventually published the specifications rather than continue losing the arms race. He co-developed the delta compression in Git with Linus Torvalds. He wrote ccache, a compiler cache that silently halves build times. Each project follows the same instinct: identify the redundant work and eliminate it.

The Problem

You have a file on machine A. You have an older version of the same file on machine B. The file is one gigabyte. The change is one kilobyte. The traditional approach transfers one gigabyte. The time required is one gigabyte divided by your bandwidth. If the link is slow, as most links were in 1996 and many still are, you wait. If the file is large, you wait longer. If both, you go for lunch.

The problem is that both machines have almost identical data, but neither knows which parts are identical without comparing them. And comparing them byte-by-byte requires transferring the entire file, which defeats the purpose. You need a way to determine what has changed without looking at every byte. You need checksums. But not just any checksums.

The Delta Algorithm 1 Receiver splits file into blocks, computes checksums Block A Block B Block C Block D ← rolling + strong 2 Checksums sent to sender (tiny payload) 3 Sender rolls window over its copy, finds matches Block A ✓ CHANGED Block C ✓ Block D ✓ 4 Only non-matching data is transferred Transfer: 1 block 1 GB file, 1 KB changed → transfer ≈ 1 KB Rolling checksum: fast but approximate. Strong checksum: slow but certain. Two-pass verification. Minimum bandwidth. Maximum correctness. The algorithm that made backups survivable since 1996.

The Algorithm

The rsync algorithm operates in four steps, each elegant in its economy.

First, the receiving machine splits its version of the file into fixed-size blocks and computes two checksums for each block: a fast rolling checksum (cheap to compute, prone to false positives) and a strong cryptographic hash (expensive to compute, essentially certain). These checksums are sent to the sender. The payload is small: a few bytes per block, regardless of how large the file is.

Second, the sender takes its version of the file and slides a window across it, computing the rolling checksum at each byte position. The rolling checksum is the key innovation. It can be updated incrementally: when the window slides one byte forward, the new checksum is computed from the old checksum, the byte entering the window, and the byte leaving it. No re-reading the entire block. The cost per position is constant.

When the rolling checksum matches one of the receiver’s blocks, the sender computes the strong checksum and compares it. If both match, the block is identical. The sender records a reference: “use the receiver’s block N here.” If neither checksum matches, the byte is new. The sender collects these non-matching bytes and transmits them. The receiver reconstructs the file from its existing blocks and the new bytes. The transfer contains only what changed.

A one-gigabyte file with one kilobyte of changes transfers approximately one kilobyte. The rest is already there. rsync knows it, proves it, and does not waste the pipe.

The Contrast

Modern “sync” solutions want a cloud account. They want a subscription. They want a proprietary protocol. They want your data on their servers, encrypted with their keys, governed by their terms of service, available until the company pivots, merges, or discovers that your tier is no longer profitable.

rsync wants a source, a destination, and optionally SSH.

rsync -avz source/ destination/

That is the complete invocation. -a for archive mode: preserve permissions, timestamps, symlinks, ownership. -v for verbose: show what is happening. -z for compression during transfer. No configuration file. No daemon requirement. No account. No vendor. No subscription. No terms of service. The man page, one should note, is longer than most modern tools’ dependency lists.

The Legacy

rsync is not merely a tool. It is an algorithm that has been adopted, adapted, and embedded across the industry. Apple’s Time Machine used rsync-style delta transfers. rclone, the cloud storage synchronisation tool, implements the same algorithmic principles. duplicity and borgbackup build on rsync for incremental encrypted backups. The --link-dest flag alone, which creates hard-linked snapshot trees for zero-cost incremental backups, has saved more disk space than most compression algorithms.

Eighteen thousand lines of C. No external dependencies beyond a C library and optionally SSH. Compiles on everything. Runs on everything. Handles interruptions: if the transfer fails halfway, rsync resumes from where it stopped, not from the beginning. This is not a feature that was requested. It is a consequence of the block-level architecture. When your algorithm tracks which blocks have been acknowledged, resume is free.

The Reduction

What makes rsync technically beautiful is the rolling checksum. Everything else, the resume capability, the bandwidth efficiency, the incremental backups, follows from a single insight: you can compute a checksum incrementally, and therefore you can find matching blocks without comparing every byte. The entire algorithm is a consequence of one mathematical property of one hash function.

Tridgell described this in his 1996 technical report, and the industry has been using the result for three decades without improving upon it. There have been faster implementations. There have been alternative checksums. There have been wrappers, frontends, and entire businesses built on top. But the core algorithm, split, checksum, compare, transfer only differences, has not changed because it does not need to. It was correct in 1996. It is correct now.

One algorithm. One command. Twenty-nine years. rsync does not synchronise your data. It synchronises only the differences. That distinction is the entire tool.