Technical Beauty ✦ Episode 37
A disk is filling up. Somewhere under /var are thousands of stale log files, scattered across directories nobody remembers creating. The task is dreary and familiar: locate the old ones and clear them. One line does it, and reads almost like an instruction to a colleague:
find /var/log -name '*.log' -mtime +30 -delete
No loop, no temporary file, no manual descent into each directory. The tool that reads that line as a single coherent thought has been doing so since 1979, and the way it reads is the whole point of this episode.
The Grammar
Most Unix tools take flags: a verb and a handful of switches that modify it. find is different. find takes an expression.
The arguments after the starting path are not flags in the
usual sense; they are terms in a small query language.
There are primaries, which are tests or actions:
-name matches a glob, -type f
selects regular files, -mtime +30 means
"modified more than thirty days ago", -size
+100M means larger than a hundred megabytes,
-newer ref means changed more recently than a
reference file. There are operators that combine them:
terms written next to each other are joined by an implicit
logical AND, -o is OR, ! is NOT,
and parentheses group sub-expressions (escaped from the
shell as \( ... \)).
find walks the directory tree, and for each file it evaluates the expression. If the expression is true, the actions in it fire. That is the entire model. "Find every regular file under here, larger than a hundred megabytes, not owned by root, and print it" is one expression, evaluated once per file, composed from parts you already know.
This is the reduction the series exists to celebrate. find does not ship a flag for every conceivable query. It ships a grammar, and the grammar composes every query from a small vocabulary of primaries and three operators. The surface you must learn is tiny; the space of things you can express is enormous. That ratio, small vocabulary to large expressivity, is what elegance looks like on a command line.
The Surface
In practice, most of what anyone types is a handful of shapes:
find . -type f -name '*.conf'
find /var/log -mtime +30 -delete
find . -size +100M -exec ls -lh {} +
find . -type d -empty -delete
The -exec action deserves a note, because it
has two forms and the difference matters. -exec cmd
{} \; runs the command once per matched file,
substituting the filename for {}.
-exec cmd {} + gathers as many matches as the
command line allows and runs the command as few times as
possible, which is dramatically faster for large match
sets. The plus form is the one to reach for by default;
the semicolon form is for when the command genuinely takes
one argument at a time.
For everything else, find composes with the rest of the toolbox through the pipe, and it does so safely:
find . -type f -print0 | xargs -0 sha256
-print0 terminates each filename with a null
byte instead of a newline, and xargs -0 reads
them the same way. This is not a nicety. Filenames on Unix
may contain spaces, newlines, and almost any other byte,
and the naive idioms (for f in $(ls), or
piping plain find output into a tool that splits on
whitespace) corrupt or skip such names, occasionally with
destructive results. The null-separated pipeline is the
correct way to move a list of arbitrary filenames between
tools, and find has supported it for decades. Beauty,
here, includes correctness: the elegant idiom is also the
safe one.
On FreeBSD
FreeBSD ships BSD find in the base system, BSD-licensed, at
/usr/bin/find. It is the lean, POSIX-clean
implementation, and on a freshly installed FreeBSD it is
simply present, no package required. The same is true on
OpenBSD, NetBSD and macOS, all of which carry a
BSD-derived find.
GNU find, part of the GNU findutils package and licensed
under the GPL, grew a larger set of primaries over the
years (-printf with its own format language,
several regex variants, and more) and accreted complexity
accordingly. None of that is wrong, and some of the
extensions are genuinely handy; it is simply a different
point on the curve between "small and POSIX-clean" and
"feature-rich". On FreeBSD it is a pkg install
findutils away, installed as gfind,
for the occasions when a script needs a specific GNU
primary. For the daily load, the in-base BSD tool is the
whole tool, and that is the version this episode is about:
the one that fits in your head.
The Lineage
Dick Haight wrote find for Version 7 Unix, released in 1979, along with cpio and expr. He worked in what was then the Unix Support Group, the part of Bell Labs charged with turning the research system into something AT&T could support and ship, rather than in the research group where Unix itself was born.
There is a well-aired anecdote, preserved in the Unix history archives, that the researchers were faintly put off by the syntax of the USG tools: find did not read like the other commands, with its prefix-expression notation and its little grammar. It was, by the aesthetic of the research room, slightly foreign. They kept it anyway, because it was useful, and because once you stop expecting it to look like grep and start reading it as a query language, it is not foreign at all; it is consistent with itself.
Forty-seven years later, the expression grammar is
essentially unchanged. A find one-liner from a 1980s
manual runs today. The modern descendant fd (David Peter,
written in Rust, 2017, MIT and Apache licensed) is faster,
prettier in its output, and friendlier in its defaults (it
ignores .git and respects
.gitignore), and it reproduces the very same
idea: a small set of predicates over a tree walk. The
shape was right the first time.
find is the rare Unix tool that is a little language pretending to be a command. The researchers were right that it reads oddly. They were also right to keep it, because a small grammar that composes every case from a few parts is worth a little oddness at first sight. Learn the vocabulary once, and you can ask the filesystem almost anything, in a sentence.