Vivian Voss

The Surface You Never Added

freebsd security linux kernel

There is a question an engineer asks once the novelty of a system has worn off and the pager has gone off at three in the morning a few times too often. It is not what the system can do. It is what it will not do to you. The two questions sound similar and lead to entirely different operating systems.

This little series is about the second question, seen through security, because security is where the difference shows first and costs most. And it begins with a quiet decision a very large company made in 2023.

Google turned io_uring off. Not deprecated, not warned about: switched off, in ChromeOS entirely, restricted to a handful of system processes in Android, and disabled on its own production servers. When the firm that runs more Linux than almost anyone alive removes a kernel feature across its entire fleet, it is worth asking what it saw. Disabling a feature across a whole estate is not the reaction of a team that found it mildly inconvenient.

The Breach

io_uring is a genuinely clever piece of machinery. Written by Jens Axboe and merged into Linux 5.1 in 2019, it is a modern asynchronous I/O interface built around two shared-memory rings, a submission queue and a completion queue, through which an application hands the kernel batches of work and collects the results with very few system calls, sometimes none at all. For a database or a busy proxy, that is real throughput.

It is also, by the count of the people who hunt these things, the single most exploited thing in the recent Linux kernel. In 2023 Google's security team published what it had learnt from its kernel exploit bounty: of the submissions it received in 2022, about six in ten exploited io_uring, and the programme paid out roughly a million euros for that component alone. Six in ten. The majority of every kernel break-in the programme was shown, from one subsystem barely three years old. A feature added for speed had quietly become the most reliable way into the kernel you run, and that is not a number one files under teething trouble.

Linux kernel exploits submitted to Google's bounty, 2022 io_uring all the rest of the kernel ~ 6 in 10 of submitted kernel exploits used io_uring one subsystem, barely three years old • ~ €1m paid for that component alone Source: Google Security Team, kCTF kernel-exploit bounty (oss-security, 2023)

The stream has not dried up. In 2026 the subsystem is still producing serious holes: PinTheft (CVE-2026-43494) chains a reference-counting bug in the kernel's RDS sockets with io_uring to overwrite a setuid-root binary in memory and hand an unprivileged user a root shell; a separate use-after-free in the zero-copy receive path (CVE-2026-43174) corrupts kernel memory through mishandled object lifetimes. These are not the growing pains of something new. They are the steady output of a large, fast-moving surface.

The Pattern

Here is the part that matters when you are choosing what to run, rather than merely what to patch.

io_uring did not arrive in a vacuum. It is the fifth way Linux has offered to wait for input and output: select, then poll, then epoll, then POSIX AIO, then io_uring. Each was added because the previous one was too slow for some workload, and each is still there, because removing an interface that programs depend on is its own kind of breakage. The kernel did not replace its I/O model. It accreted one.

Two ways to wait for input and output Linux: accreted select poll epoll POSIX AIO io_uring (2019) shared rings • 100+ opcodes • ~60% of exploits five interfaces, all still present FreeBSD: integrated kqueue (2000) one kevent interface files, sockets, signals, timers, procs unreplaced in 26 years Linux accreted its I/O model; FreeBSD kept one. The surface you never add cannot be attacked.

io_uring is the sharpest expression of that habit. To go faster it moves work the application used to do across the boundary and into the kernel: asynchronous execution on the kernel side, a shared memory region the application and the kernel both write to, and a growing catalogue of operations, well over a hundred opcodes, each a small new thing the kernel will now do on a stranger's behalf. Every one of those is attack surface. The subsystem became a sixty-per-cent problem not because Axboe is a poor engineer, he is plainly an excellent one, but because the architecture rewards adding capability into the kernel and has nobody whose job is to keep that surface small.

This is the first thing an operator should weigh. Defect density is not bad luck. It follows from how a system is built. A kernel that keeps absorbing new privileged surface, for the best of reasons each time, is a kernel whose defect density you cannot bound, because the surface itself is still growing.

The Limit

It would be dishonest to leave it there, and this series will refuse to.

io_uring solves a real problem. For high-throughput, high-IOPS workloads, batching submissions and draining completions with almost no system calls is a measurable win, and the engineering inside it is serious work. FreeBSD's answer, which we are about to meet, does not give you that particular form of asynchronous throughput. If your workload genuinely lives or dies on millions of queued operations a second, io_uring is doing something for you that the alternative does not.

And FreeBSD is not a temple. It has its own advisories, its own bugs, its own mornings that begin badly. The argument of this series is not that one system is without fault. It is that defect density is decided by process and planning, which is a duller claim and a far more useful one. I will not call a design brilliant when it became the majority of the kernel exploits its own bounty paid out. I will say, without hedging, exactly what it bought, and let the reader weigh the price.

The BSD Angle

On FreeBSD, the same class of bug does not arise, and the reason is almost embarrassingly simple: the surface it would live on was never added.

FreeBSD waits for input and output through kqueue, written by Jonathan Lemon and shipped in FreeBSD 4.1 in July 2000. One mechanism, kevent, watches files, sockets, signals, timers, file-system changes and processes through a single interface. It has not needed replacing in twenty-six years. The whole of macOS and the other BSDs adopted it. It is, by any measure, one of the most boring success stories in systems software.

What kqueue does not do io_uring does the work two shared-memory rings, both sides write asynchronous execution inside the kernel 100+ opcodes the kernel runs for you = all of it attack surface kqueue only notifies one kevent interface tells you an event is ready then gets out of the way = nothing to work Boring is not the absence of engineering. It is the result of it.

The difference that matters for security is what kqueue does not do. It notifies. It tells your program that something is ready and then gets out of the way. It does not run asynchronous work for you on a shared ring, it does not maintain a hundred-opcode catalogue of operations to perform inside the kernel, and so it does not offer the attacker the corridor that io_uring does. The event loop your FreeBSD server already runs is the boring one, and boring, here, is the whole of the point.

This is not a claim that FreeBSD is faster. It is the observation that one project, building one event primitive and keeping it, produced a surface an attacker cannot work because it is not there to be worked.

The Point

When you choose a kernel to run for a decade, the question is not what it can do at peak throughput on benchmark day. It is what it will not do to you at three in the morning, eighteen months in, when a subsystem you never enabled on purpose turns out to be the way in.

A system that keeps adding execution surface, for excellent reasons each time, is a system whose defect density you have agreed not to bound. A single event primitive, unreplaced since 2000, is not the most exciting line on a changelog. That is rather the point. The boring system is the one still running, and boring is not the absence of engineering. It is the result of it.

Next Friday: if a kernel will keep adding surface for speed, what stops it from letting you run your own programs inside it?