Vivian Voss

Technical Beauty: git

git architecture unix

Technical Beauty ■ Episode 02

In April 2005, Linus Torvalds sat down and, in what one might charitably describe as a fit of productive rage, wrote a version control system. Ten days later, git hosted its own source code. Twenty years on, it powers over 100 million repositories on GitHub alone. It has never shipped a breaking change. The binary is roughly 10 MB. It requires no server, no database, and no dependencies beyond libc. The rest of the industry, in the same twenty years, has built an entire civilisation of runners, registries, and orchestration layers on top of it. The foundation weighs less than the scaffolding.

The Provocation

The origin story is, by any reasonable measure, absurd. The Linux kernel had been using BitKeeper, a proprietary distributed version control system, since 2002. Larry McVoy, BitKeeper’s creator, offered free licences to open-source projects on the condition that nobody reverse-engineered the protocol. In 2005, Andrew Tridgell (of Samba fame) did precisely that, telnetted to a BitKeeper server, typed help, and recorded the output. McVoy revoked the free licence. The kernel community, the largest collaborative software project in human history, suddenly had no version control system.

Torvalds evaluated the existing alternatives. CVS was, in his own words, the example of what not to build. Subversion aimed to be “CVS done right,” which he considered to be missing the point entirely. Nothing met his requirements: performance, distributed operation, data integrity, and the ability to handle the kernel’s scale (tens of thousands of files, thousands of contributors, a history measured in hundreds of thousands of commits). So he wrote his own. He named it “git,” which in British English is slang for an unpleasant person, and which Torvalds has cheerfully acknowledged was self-referential.

The Architecture

What makes git remarkable is not the interface. The interface, as anyone who has attempted to explain git rebase --onto to a colleague will confirm, is not git’s strongest suit. The beauty is in the data model.

Git is, at its core, a content-addressable filesystem. Every piece of data is stored as an object, identified by the SHA-1 hash of its content. There are precisely four object types: blobs (file content), trees (directory listings), commits (snapshots with metadata), and tags (named references to commits). That is the entire data model. Four nouns. Everything else in git, every branch, every merge, every rebase, every cherry-pick, is an operation on these four objects.

The Object Model Four types. Every operation derives from these. ref Named pointer main, v2.0 commit Snapshot + metadata a1b2c3... tree Directory listing d4e5f6... blob Content 7a8b... tree Subdirectory c9d0e1... commit Parent f2e3d4... The principle: Same content → same hash → stored once. Change one byte → different hash → new object. Corruption changes the hash → instant detection. Content-addressable. Immutable. Verifiable. Four types.

The content-addressable design is what gives git its integrity guarantees. If a single byte changes anywhere in the history, the hash changes. If the hash changes, every object that references it must also change. Corruption is not merely detectable; it is structurally impossible to hide. You cannot tamper with a git repository without altering every hash from the point of tampering to the present. This is not a feature that was bolted on after a security audit. It is the architecture itself.

The Distribution

Before git, version control was centralised. Subversion, CVS, Perforce: they all operated on the assumption that there was one true repository, stored on a server, and developers would check out working copies. The server was the authority. If the server was down, you could not commit. If the server was lost, you had lost your history. The entire development workflow depended on network connectivity to a single point of failure.

Git inverted this. Every clone is a complete repository. Every developer has the full history, every branch, every tag, every commit, stored locally. You can commit, branch, merge, bisect, and search history on an aeroplane with the Wi-Fi off. The “server” in a git workflow (GitHub, GitLab, Bitbucket) is not architecturally special. It is simply another clone that everyone has agreed to treat as canonical. The protocol does not enforce this convention. The humans do.

Centralised vs Distributed Star topology vs full-copy mesh SVN / CVS / Perforce Server WC WC WC WC Server down? Work stops. git Full Full Full Full Full Any node down? Everyone has everything. WC = Working Copy (no history). Full = complete repository with all history.

This design decision was not academic. Torvalds built git to solve a specific problem: the Linux kernel has contributors across every timezone, many of them working offline, all of them generating patches that must eventually converge into a single tree. A centralised system would have been a bottleneck. A distributed system turned every contributor into a potential backup. The architecture matched the social structure of the project.

The Custodian

Torvalds wrote git. He did not maintain it. In July 2005, three months after the first commit, Junio C. Hamano took over as maintainer. He has held that position for twenty years. This is not a footnote; it is the central fact of git’s success.

Torvalds built a prototype that worked. Hamano turned it into infrastructure that lasts. He has reviewed every merge into the mainline. He has maintained backwards compatibility across twenty years of releases. He has shepherded the codebase through the transition from SHA-1 to SHA-256, a cryptographic migration that most projects would treat as a licence to ship a breaking v2. Git is doing it without breaking a single workflow. Code written against git in 2005 still works in 2025. Not approximately. Exactly.

The industry rarely celebrates maintainers. It celebrates creators, founders, people who “ship fast.” Hamano did not ship fast. He shipped correctly, repeatedly, for two decades. The Linux kernel may be Torvalds’s cathedral, but git is Hamano’s.

The Footprint

The git binary is approximately 10 MB. It depends on libc, zlib for compression, and OpenSSL or a compatible library for transport encryption. That is it. No runtime. No virtual machine. No garbage collector. No package manager pulling in transitive dependencies from a registry that anyone can publish to.

10 MB to track every file you will ever write. To branch instantaneously (a branch is a 41-byte file containing a SHA-1 hash). To merge with a three-way algorithm that handles the majority of conflicts automatically. To bisect a regression across ten thousand commits in O(log n) steps. To search the entire history of a project with git log -S and find the exact commit that introduced or removed a string.

For context: a fresh create-react-app installation (before you have written a single component) weighs approximately 300 MB across 2,839 packages. Git tracks the entire Linux kernel, 35 million lines of code with a history spanning two decades, in a .git directory smaller than that.

The Contrast

Git itself is 10 MB. The infrastructure the industry has built around git is considerably larger. GitLab runners. Docker registries. Kubernetes clusters to host the runners that run the containers that execute the pipelines that test the code that git already tracks. Branch protection rules implemented in a SaaS platform that charges per seat for the privilege of enforcing policies on a tool that is free. Code review interfaces that require a browser, a login, an internet connection, and occasionally a prayer to render the diff that git diff would have shown you locally in milliseconds.

The pipeline breaks every sprint. Not because git broke. Git has never broken. The pipeline breaks because the sixteen layers of abstraction between “developer pushes code” and “code runs in production” each introduce their own failure modes, their own configuration languages, their own versioning schemes, and their own opinions about how software should be deployed. Git’s opinion is simple: here is the content, here is the hash, here is the history. What you do with it is your problem.

This is not git’s limitation. This is git’s discipline. A tool that tries to solve every problem solves none of them well. A tool that solves one problem, content tracking, and solves it completely, becomes the foundation upon which everything else is built. The foundation does not need to know about CI/CD. The CI/CD needs to know about the foundation.

The Longevity

Git has been backwards compatible since 2005. Repositories created twenty years ago can be cloned, inspected, and worked on with the latest release. The pack format has evolved (packfile bitmaps, multi-pack indices, commit graphs), but the fundamental object model has not changed. A blob is still a blob. A tree still references blobs and other trees. A commit still references a tree and zero or more parents.

Twenty years of backwards compatibility in a tool used by millions of developers across every operating system, every programming language, every industry. The mailing list still operates. Patches are still reviewed. The release cadence is still regular. No venture capital. No acquisition. No pivot to enterprise. No pivot at all. Just a content-addressable filesystem that does what it says and says what it does.

10 MB. No server. No database. No dependencies. Twenty years, zero breaking changes. They built what lasts.