Vivian Voss

Technical Beauty: jq

unix tooling

Technical Beauty ■ Episode 25

The Missing Tool

By 2012, JSON had become the undisputed lingua franca of the web. Every API returned it. Every configuration file spoke it. Every log aggregator produced it. And the Unix toolbox, refined over forty years for line-oriented text, had precisely nothing for it.

grep sees lines. awk sees fields. sed sees patterns. None of them understand a tree. A JSON document is not flat. It nests. It branches. It contains arrays within objects within arrays, and the moment you need to extract a value three levels deep, every classic Unix tool reaches for its hammer and discovers the nail is a fractal.

The Unix Toolbox vs Structured Data 40 years of line-oriented tools. Zero for trees. Line-Oriented (1970s-2012) grep → lines matching pattern awk → fields in records sed → stream of characters sort → lines, alphabetically Data model: flat text, delimiters Tree-Structured (2012) jq → nested JSON values {"results": [ {"name": "value", "items": [1,2,3]} ]} Data model: trees, arrays, objects One PhD student added trees. 822 KB.

Stephen Dolan, a PhD student at the University of Cambridge, resolved this quietly in 2012. He did not write a viewer. He did not write a validator. He wrote a language.

A Language, Not a Utility

This is the distinction that earns jq its place among beautiful tools. grep is a filter. sed is a transformer. awk is, admittedly, also a language, but one designed for tabular records. jq is a fully functional programming language with generators, backtracking, and immutable values, compiled to a stack-based bytecode VM. It exists for precisely one purpose: transforming JSON on the command line. stdin in, stdout out. The Unix pipeline, applied to structured data.

The comparison is instructive. To extract a nested value from an API response, Python needs two imports and a runtime:

python3 -c "import json,sys; print(json.load(sys.stdin)['results'][0]['name'])"

jq needs one expression:

jq '.results[0].name'

Same output. 84 characters versus 22. One requires a 50 MB runtime installed on the target machine. The other is an 822 KB binary that does not link against anything except libc. One is a general-purpose language pressed into service for a specific job. The other was born for it.

Same Input. Same Output. Python 84 characters Imports: 2 (json, sys) Runtime: ~50 MB installed Dependencies: libpython, stdlib Startup: interpreter init Model: general-purpose jq 22 characters Imports: 0 Binary: 822 KB (macOS) Dependencies: libc Startup: instant Model: purpose-built One needs a runtime. The other is the runtime.

The Vital Statistics

Approximately 510 KB of C source code, spread across 21 files. A single binary: 822 KB on macOS ARM, 2.1 MB on Linux when statically linked. 33,700+ GitHub stars and 214 contributors, under the MIT licence since its first release in 2012, unchanged.

Those are the numbers. Here is what they mean: jq is smaller than most configuration files in a typical Node.js project. It is smaller than the node_modules folder of a "Hello World" Express application by approximately four orders of magnitude. And it has been parsing JSON reliably since the year Docker was invented, without a single breaking change to its core language.

The Spec That Arrived Twelve Years Late

The language had no formal specification until 2024. For twelve years, jq was defined entirely by its own source code. If you wanted to know what .foo | select(. > 3) did, you read the implementation, or you ran it. The documentation was useful. The source was canonical.

In March 2024, Michael Färber published the first formal specification on arXiv. Twelve years after Dolan wrote the implementation. The specification agreed with the code. Not the other way round.

Rather Unix, that.

The implication is worth pausing on. A language whose behaviour was so consistent, so unsurprising, so obviously correct that it ran for twelve years without a formal spec and nobody noticed the absence. The spec, when it arrived, was descriptive, not prescriptive. It documented what jq already did. Try that with JavaScript.

The Streaming Parser

This is the section where jq quietly demonstrates that being small does not mean being limited.

In regular mode, jq loads the entire JSON document into memory, builds the tree, and operates on it. For a typical API response, this is instant. For ten million JSON items, it consumes approximately 5.5 GB of memory.

In streaming mode (--stream), jq emits path-value pairs as it reads, without building the tree. The same ten million items: 2 MB. At three in the morning, with a 40 GB log file and a production server that would rather not swap, that difference is not academic.

Memory: 10 Million JSON Items Regular Mode (full tree in memory) 5,500 MB Streaming Mode (path-value pairs) 2 MB 2,750x reduction. Same data. Same tool. One flag.

The Sincerest Form of Flattery

When the community decided that jq could be improved, they did not replace the language. They rewrote the engine.

jaq, written in Rust by Michael Färber (the same person who later formalised the spec), is 5 to 10 times faster on some benchmarks. gojq, written in Go, is a pure reimplementation. Both speak the same syntax. Both accept the same filters. Both produce the same output.

This is the strongest evidence of design quality a tool can receive. One does not improve sed by inventing a new grammar. One rewrites the engine. The jq language was so precisely right that two independent reimplementations in two different languages adopted it wholesale. The design was correct the first time. The performance was not.

One Language, Three Engines The design was correct. The performance was improved. jq language functional, generators, backtracking jq C (original, 2012) 822 KB binary 33,700+ stars Stephen Dolan jaq Rust (2021) 5-10x faster Same syntax Michael Färber gojq Go (2019) Pure reimplementation Same syntax itchyny When the design is right, you rewrite the engine. Not the grammar.

The Custodian Changes, the Tool Endures

Dolan left the project around 2019 for Jane Street, where he now works on the OCaml compiler. A PhD in Algebraic Subtyping, supervised by Alan Mycroft at Cambridge, followed by a career designing type systems for a trading firm. The pedigree explains the language: jq feels like it was written by someone who understands type theory. It was.

The community formed jqlang. Version 1.7 arrived in September 2023, after a five-year gap since 1.6. Version 1.8 followed in 2025. The tool did not notice the absence of its creator. It continued parsing. One might observe that this is rather the point of writing a tool that does one thing well.

Timeline 2012 First release MIT licence 2015 v1.5 Major features 2018 v1.6 Last by Dolan 2023 v1.7 jqlang community 2024 Formal spec Färber, arXiv 2025 v1.8 5-year gap The tool did not notice. No formal spec for 12 years. The code was the spec. The spec agreed with the code.

The Point

Forty years of Unix tools for flat text. Forty years of grep and awk and sed and sort and cut and paste, each doing one thing brilliantly, all of them blind to the data format that had quietly taken over the world.

One PhD student at Cambridge added trees. He wrote a functional language with generators and backtracking, compiled it to a bytecode VM, shipped it as an 822 KB binary, and walked away to work on type theory. The community rewrote the engine twice, in two different languages, and kept the grammar untouched. Someone formalised the specification twelve years later and found it agreed with the implementation. The tool spent five years without its creator and continued parsing without complaint.

822 KB. Technical beauty emerges from reduction.