Vivian Voss

The Program in the Kernel

freebsd security linux ebpf

Last Friday ended on a question: if a kernel will keep adding surface for speed, what stops it from letting you run your own programs inside it? The answer, it turns out, is nothing. That happened a decade ago, and it has a name.

First, the fairness the subject deserves, because the capability itself is one every operator genuinely wants. You have a production box misbehaving. You cannot recompile, you cannot reboot, you certainly cannot attach a debugger to the kernel of a machine doing revenue work. What you want is to ask the running system questions: which process is hammering the disk, what is that latency spike, who keeps opening this file. The ability to trace a live kernel without stopping it is not a luxury. It is the difference between diagnosis and folklore.

Linux answers with eBPF, and eBPF goes further than asking questions. It is a virtual machine inside the kernel: you write a program, the kernel checks it, compiles it to native code, and runs it in the most privileged context the machine has. Ring 0, the floor every other protection stands on. Networking, security tooling, observability: whole industries now run their logic there. Which raises the obvious question, and the answer is the interesting part: what stands between a loaded program and the kernel it runs inside? One thing. A static analyser called the verifier, which inspects the bytecode before it is admitted and proves, or believes it has proved, that the program cannot read or write where it should not.

The verifier is the boundary. So the record of the verifier is the record of the boundary.

The Breach

That record is not what one would wish for a load-bearing wall.

A systematisation-of-knowledge study presented at IEEE S&P counted the eBPF vulnerability classes: of forty-one analysed, twenty-eight, about sixty-eight per cent, were memory-safety bugs, the exact class the verifier exists to make impossible. The wall's job is to stop memory corruption, and memory corruption is the majority of what comes through it.

eBPF vulnerability classes analysed (IEEE S&P) memory-safety (28) other (13) ~ 68% of eBPF vulnerability classes are memory-safety the exact class the verifier exists to make impossible Since kernel 5.13, unprivileged eBPF ships switched off by default on most distributions.

The individual cases read like a relay race across a decade. CVE-2017-16995, a sign-extension mistake in the verifier's arithmetic, gave local root and worked on stock distribution kernels. CVE-2020-8835 won at Pwn2Own: the verifier mistracked register bounds, and the researcher walked out of the sandbox and into the kernel. CVE-2021-3490 was the same species, wrong bounds in 32-bit arithmetic, arbitrary kernel read and write. CVE-2021-31440 turned the class into a container escape, exploited from inside a Kubernetes pod.

And the race has not finished. This year's entries: CVE-2026-31413, where the verifier models an OR operation wrongly, tracks a register as zero while the running program holds a real value, and the divergence between what was proved and what executes becomes an out-of-bounds write; and CVE-2026-31525, where the interpreter's signed division mishandles the one value the C standard warns about, and the arithmetic itself lies.

The verifier's record, 2017 to 2026 2017 — CVE-2017-16995 sign-extension flaw in the verifier • local root on stock kernels 2020 — CVE-2020-8835 bounds-tracking flaw • won Pwn2Own, sandbox to kernel 2021 — CVE-2021-3490 / -31440 wrong 32-bit bounds • arbitrary read/write • Kubernetes container escape 2026 — CVE-2026-31413 / -31525 verifier OR-tracking divergence • signed-division mishandling Nine years, one class. The bug did not age out; it is structural.

There is a quieter admission worth more than any CVE list: the kernel's own configuration now ships unprivileged eBPF switched off. Since 5.13 the kernel has carried a build option to disable it by default; SUSE flipped it in 2021, Ubuntu followed, and today an unprivileged process on most distributions may not load eBPF at all. The ecosystem that built the feature examined the wall it stood behind and decided ordinary users should not be allowed to touch it. That is not a verdict I have to argue for. It is one the ecosystem reached about itself.

The Pattern

Why does the verifier keep failing? Not because its authors are careless. Because of what it is being asked to do.

The verifier must take an arbitrary program, written by someone else, possibly an attacker, and prove ahead of time that it is safe to run in Ring 0. Proving properties of arbitrary programs is the hardest problem in computer science; the verifier survives by restricting what programs may do and then reasoning about every path through them, tracking the possible range of every register at every instruction. Each new eBPF capability, every helper, every map type, every instruction class, widens what must be reasoned about. The proof engine grows with the feature set, and a single arithmetic slip anywhere in it, one wrong bound, one mismodelled operation, is not a bug in a tool. It is a hole in the floor of the machine.

This is the second face of the pattern this series is tracing. Last week the kernel accreted surface for speed. Here it accreted programmability for capability, and the shape is the same: the feature arrived first, and the containment had to be retrofitted around it, then patched, then patched again. eBPF descends from a packet filter that was taught, one extension at a time, to be a general-purpose machine. Nobody sat down in 1993 and designed a safe way to run foreign programs in the kernel. The capability was negotiated into an architecture that had not planned for it, and the verifier is the shape of that negotiation: a wall built after the rooms, load-bearing from the day it was poured, and mended in every kernel release since. A safety net that becomes the most reliable way through the floor is not, on reflection, doing the job its name promised.

The Limit

Honesty, in both directions.

eBPF is genuinely capable, and the industry around it is not a fashion. XDP pushes packet processing to line rate; Cilium runs the networking of a large share of the world's Kubernetes clusters on it; the observability it enables at scale is real and, for some estates, indispensable. The engineers who built the verifier are doing serious, difficult work, and doing it in public. None of that is in question, and none of it is contradicted by the CVE record either: capability and containment are different properties, and it is possible to admire the first while declining to trust the second.

And FreeBSD does not get to be smug here. DTrace has a destructive mode, dtrace -w, that can modify a running kernel; it is a deliberate, privileged, separately-gated choice, but it exists, and an operator who grants it has granted something serious. FreeBSD has its own advisories and its own bad mornings. The claim on the table is narrower than sainthood: it is that one design keeps producing a class of hole and the other was shaped so the class has nowhere to live.

The BSD Angle

FreeBSD has had the operator's capability, the trace on a live production box, since 2008, when DTrace entered the base system. It arrived from a different direction, and the direction is the whole story.

DTrace was designed at Sun in 2003 as an instrument for production systems, by people who wrote down the safety requirement before the feature list. The D language it exposes is deliberately small: no loops, no arbitrary control flow, read-only against the kernel by default. Probes fire, questions are answered, and the language is constructed so that a runaway or malicious script is not expressible in the first place. It runs privileged, full stop; there was never an unprivileged path to stand behind a proof engine. And on FreeBSD it is jail-aware out of the box: a jid variable in every probe context, so tracing a multi-tenant box respects the boundaries the system already has.

Verify what you wrote, or constrain what you can write eBPF — admit, then verify admits an arbitrary program a verifier must prove it safe, per program the proof engine keeps growing a verifier can have a bug DTrace — constrain the language a small language: no loops read-only by default the dangerous program cannot be formed a missing capability cannot One verifies what you wrote. The other constrains what you can write.

Put the two designs side by side and the difference is not capability. Both trace a live kernel. The difference is what stands between the operator and Ring 0. eBPF admits arbitrary programs and then tries to verify them: the safety property must be proved, per program, by an analyser that keeps growing. DTrace never admits arbitrary programs: the safety property is enforced by what the language cannot say. One verifies what you wrote. The other constrains what you can write. A verifier can have a bug; a missing capability cannot. Shipping the instrument in the toolbox rather than as a separate download is not glamorous, and designing the language so the dangerous sentence cannot be formed is not exciting. It is merely one fewer thing to get wrong, enforced by construction rather than by proof.

That is what planned capability looks like, next to retrofitted capability. Not better engineers. An earlier decision.

The Point

An operator choosing a system for the next decade should ask of every powerful feature not what it does, but where it came from. Was the capability designed with the system, its safety written down first, its language shaped to make the dangerous thing inexpressible? Or did it arrive later, negotiated into an architecture that had not planned for it, with a wall built afterwards to hold it back? Retrofitted capability is not wrong. It is merely never free: somebody has to maintain the wall, forever, and the wall is software too.

The question, as ever in this series, is not what the system can do. It is what it will not do to you, at three in the morning, when the wall has a hole in it. A capability that was planned is one moving part fewer that can fail. Boring is not the absence of engineering. It is the result of engineering done early enough.

Next Friday: when one of these surfaces finally gives, who patches it, and how fast can you trust the patch?