Vivian Voss

Technical Beauty: sed

sed unix tooling

Technical Beauty ■ Episode 23

“In no case is the output to be considered an improvement on Coleridge.”

That footnote appears in a 1978 technical memorandum from Bell Labs. The memorandum describes a stream editor. The test text is Samuel Taylor Coleridge’s Kubla Khan. The author is a psychologist. His name is Lee E. McMahon, and the tool he built that year is still running on your machine right now.

The Psychologist

McMahon’s path to Bell Labs does not begin in a computer science department, because in 1960 there were precious few of those. It begins at Harvard, where he completed a PhD in psychology. His thesis: “Grammatical analysis as a part of understanding a sentence.” He studied how human beings parse grammar: how the mind disassembles a sentence into structure, extracts meaning from sequence, resolves ambiguity through context. Then he built tools that do the same thing to text files.

The connection between the disciplines is not incidental. It is the entire point. McMahon understood language as structure before he understood it as data. When he arrived at Bell Labs, initially to work on a statistical analysis of the Federalist Papers (determining disputed authorship through word frequency patterns), he found an environment that took text seriously. Bell Labs in the 1970s was the place where text processing was not a means to an end. It was the end. Unix was built on the premise that text is the universal interface, and McMahon was the rare person who understood both sides of that interface: the human grammar and the machine grammar.

He stayed. He built sed. He built grep. He built comm. He contributed to qsort and the Bell Labs Datakit network. On the side, he invented the McMahon pairing system for Go tournaments, a Swiss-style bracket algorithm still used in competitive play today. He died on 15 February 1989, aged 57. The tools outlived him by decades. They will outlive us all.

The Architecture

The design of sed is so economical that describing it takes longer than understanding it. There are two buffers: the pattern space and the hold space. There is one loop: read a line into the pattern space, apply the command script, output the result, advance to the next line. There is no file loading. A one-gigabyte log file and a one-kilobyte configuration file consume the same amount of memory. One line at a time, one pass through the input, no going back.

sed: The Stream Architecture Input stdin / file Pattern Space Current line lives here Output stdout Command Script s/old/new/g d p ... apply Hold Space Auxiliary buffer (h/g/x) Next line → repeat until EOF

That is the entire execution model. Twenty-five commands operate on these two buffers. s substitutes. d deletes. p prints. h copies the pattern space to the hold space. g copies it back. x swaps them. q quits. The rest are variations on addressing, branching, and flow control. There is no malloc. There is no parse tree. There is no abstract syntax tree. There is a line, a handful of commands, and a destination.

The consequence of this design is that sed scales to arbitrary input size without scaling in resource consumption. It does not care whether the input is ten lines or ten billion. It has already forgotten line one by the time it reaches line two. This is not an optimisation. It is the architecture. McMahon did not make sed memory-efficient. He made the concept of memory irrelevant to the problem.

The Implementation

The most telling metric for any long-lived tool is not how much code it required at birth, but how much it required half a century later. The original sed in Version 7 Unix (1979) was 1,600 lines of C. BusyBox’s embedded implementation is 1,299 lines. FreeBSD’s is 2,100. GNU sed 4.9, the most feature-rich variant (with multibyte character support, in-place editing, and POSIX compliance), is 5,000 lines.

Lines of C Across 47 Years V7 Unix (1979) 1,600 BusyBox 1,299 FreeBSD 2,100 GNU sed 4.9 5,000 +multibyte, -i, POSIX 47 years. Tripled. With multibyte Unicode and in-place editing. Modern software triples every 3 years. Without adding features.

In forty-seven years, the codebase tripled. And that tripling bought Unicode, in-place file editing, and full POSIX compliance: genuine features, not refactoring for the sake of refactoring. Modern software, by comparison, triples every three years, typically without adding anything the user requested. GNU sed 4.9 has seen 51 commits in the last three years. Seventeen commits per year. After half a century of service, the maintenance burden is approximately one commit per fortnight. That is not stagnation. That is completion.

Turing Complete

In 2001, Christophe Blaess proved that sed is Turing complete. Twenty-five commands, two buffers, and a loop are sufficient to compute anything computable. The proof is not merely theoretical. People have written Sokoban in sed. They have written Tetris. They have written chess engines. Not because these are practical applications of a stream editor, but because the architecture McMahon designed in 1973 is so fundamentally sound that it accidentally became a general-purpose computation engine.

This is the hallmark of technical beauty: a tool designed to solve one problem so cleanly that it turns out to solve a much larger class of problems. McMahon did not set out to build a Turing-complete language. He set out to edit text without opening a file. The Turing completeness is a consequence of getting the primitives right.

The Performance

One might expect a tool from 1973 to have aged into irrelevance on modern hardware. One would be mistaken. David Lyness’s benchmark (processing two billion base64-encoded characters) measured sed at 82.1 million characters per second. awk managed 69.8 million. Python: 69.0 million. PHP: 21.2 million. The tool with the smallest codebase, the simplest architecture, and the oldest lineage is also the fastest. This is not a coincidence. It is a consequence. When there is nothing between the input and the output except twenty-five commands and two buffers, there is nothing to slow down.

Throughput: 2B Base64 Characters Million characters per second (higher is better) sed 82.1M awk 69.8M Python 69.0M PHP 21.2M Lyness benchmark. The simplest tool is the fastest.

The Grammar of the Internet

McMahon’s most enduring contribution may not be sed itself but a six-character syntax he embedded within it: s/old/new/. The substitution command. Pattern, delimiter, replacement, delimiter. It is so intuitive that it escaped the terminal entirely.

On Reddit, correcting someone’s spelling is done with s/teh/the/. On IRC, it has been the grammar of correction since the 1990s. On Slack, on Discord, on Mastodon, the syntax persists, understood by people who have never opened a terminal, never heard of sed, never considered that the shorthand they are using was invented by a psychologist at a telephone company in New Jersey.

McMahon the linguist would appreciate the irony. He studied how humans parse grammar. Then he created a grammar that humans adopted without parsing. s/pattern/replacement/ is not taught. It is absorbed. It is the closest thing computing has produced to a natural idiom, a piece of machine syntax so well-designed that it passed into common usage without anyone noticing the transition.

Ninety One-Liners

Eric Pement compiled a collection of ninety sed one-liners, each solving a real text-processing problem in sixty-five characters or fewer. Double-spacing a file. Removing blank lines. Reversing line order. Converting DOS line endings. Extracting the nth line. Each is a complete program. Each fits in a tweet. The collection has been translated into eight programming languages, and in every translation the solution is longer.

That is the test of a well-designed tool: not whether it can solve the problem, but whether the solution is shorter than the problem statement. Pement’s one-liners pass that test ninety times.

The Footnote

McMahon’s original paper demonstrates sed’s capabilities using Coleridge’s Kubla Khan as test input. He rearranges stanzas, substitutes words, deletes lines, transforms the poem through a sequence of commands. And then, in a footnote that reveals everything about the man’s character, he writes: “In no case is the output to be considered an improvement on Coleridge.”

That footnote is the soul of good engineering. Build a tool powerful enough to rearrange a masterpiece. Have the humility to note that you should not. McMahon understood that the tool is in service of the text, not the other way round. The editor does not improve the content. The editor enables the human to do so. This distinction, between tool and purpose, between capability and judgement, is one that modern software development has largely forgotten.

Twenty-five commands. Two buffers. One pass. No file loading. Turing complete. A substitution syntax that became internet grammar. A codebase that tripled in half a century while the software around it tripled annually. A tool so complete that seventeen commits per year suffice to maintain it.

A psychologist built that. In 1973. At a telephone company.

Technical beauty is not a function of complexity. It is a function of how little complexity remains after you have solved the problem completely.