Vivian Voss

Nine Years in the Page Cache

security linux freebsd architecture

On 29 April, a researcher at the security firm Theori published the details of a Linux kernel flaw that lets an unprivileged local user become root. The full exploit, written in standard Python, fits in roughly seven hundred and thirty-two bytes. It does not race. It does not crash. It runs on Ubuntu 24.04, on Amazon Linux 2023, on Red Hat Enterprise Linux 10.1, on SUSE 16, on Debian, on Fedora, on Arch — on any distribution shipping a kernel from the last nine years. The flaw has a name. It is called Copy Fail, and one rather hopes it will be the last name of its kind for a while.

This is not, however, the story of a single bug. It is the story of three correct decisions, taken by three different teams, fifteen years apart, that happened to meet in the page cache.

The Breach

The official entry is CVE-2026-31431. The CVSS v3.1 score is 7.8, which the registry classifies as High; in operational terms it means local privilege escalation on every Linux machine an attacker can already log in to as an ordinary user, and on every container sharing a host kernel with one. The kernel security team was notified on 23 March 2026 and the mainline patch landed on 1 April. The CVE was assigned on 22 April, and Theori disclosed publicly on 29 April. Eight days from report to patch is, by any reasonable benchmark, fast.

The component is the authencesn template inside the Linux kernel crypto subsystem, exposed to userspace through the algif_aead module of the AF_ALG socket interface. AF_ALG is the socket family that lets unprivileged programs ask the kernel to perform cryptographic operations on their behalf. It has been there since 2010, and it is genuinely useful: it gives userland access to kernel-accelerated cryptography without the userland needing to embed a crypto library of its own.

The exploit chain is short enough to describe in one paragraph. The attacker opens an AF_ALG socket, splices a file they can read (any file readable by the current user) into the socket as the input buffer, and asks the kernel to perform an AEAD operation. Since 2015, AF_ALG has supported AEAD splicing, and the splice mechanism holds direct references to the kernel's cached pages of the spliced file rather than copying their contents. In 2017, a performance commit (the kernel changeset is 72548b093ee3) introduced an in-place optimisation: rather than copying input bytes to a separate output buffer, the AEAD operation could now reuse the same scatterlist for both. And inside the authencesn template — present since 2011 to support IPsec Extended Sequence Numbers — there is a small piece of housekeeping that briefly writes four bytes into the destination buffer as scratch space.

Three Correct Decisions, Fifteen Years Apart 2010 AF_ALG socket family for crypto 2011 authencesn 4-byte scratch for IPsec ESN 2015 AEAD splice direct refs to cached pages 2017 in-place opt 72548b093ee3 one scatterlist 2026 Copy Fail CVE-2026-31431 patch in 8 days Each decision was correct in isolation. The seam is in their composition: the 2017 in-place optimisation made the 2011 four-byte scratch write land inside the kmap_local_page mapping of a page the 2015 splice support had pinned from the page cache. Composite vulnerabilities do not have a single author. They emerge where responsibility lines cross.

In isolation, none of these three decisions is wrong. The authencesn scratch write was correct in its original AEAD context, where the destination buffer was the caller's own memory. The AF_ALG AEAD splice support of 2015 was bounded by an out-of-place design that kept input and output separated. The 2017 in-place optimisation was a legitimate performance win in the kernel's own cryptographic paths.

But once all three were in the tree at the same time, the path opened. The four-byte scratch write of authencesn, walking past its expected buffer in the in-place case, lands inside a kmap_local_page mapping of the spliced file's page cache page. The attacker chooses the file, chooses (through splice offset, splice length, and assoclen) which four bytes get written, and chooses the value to write through bytes four to seven of the AAD. The result is a deterministic four-byte primitive against the kernel's in-memory copy of any file the user can read.

The Exploit Chain, in 732 Bytes of Python 1. ordinary user a shell, no special privilege inside a container or on host 2. open AF_ALG socket algif_aead module crypto family, since 2010 3. splice file into socket /usr/bin/su, any readable file held by ref, not copied (2015) 4. request AEAD op in-place opt (2017) input and output share scatterlist 5. authencesn scratch 4-byte write (since 2011) lands in kmap_local_page 6. 4-byte primitive deterministic, attacker-chosen offset and value Result: write any 4 bytes to any offset of any file the user can read — in the page cache. Disk file is never touched. Page is never marked dirty. No write goes to storage. It does not race. It does not crash. It just works.

The classical target is /usr/bin/su. The on-disk binary is, in any sane setup, owned by root and not writable by the user. The in-memory copy held by the page cache, however, is what gets executed when the user runs the program. Corrupt the in-memory copy carefully, and the next invocation of su becomes an instruction in the attacker's favour. The disk file remains untouched. The page is never marked dirty. Standard file-integrity tooling — Tripwire, AIDE, the host-based integrity layer of every modern EDR — compares disk hashes and sees nothing.

Where /usr/bin/su Actually Lives On disk /usr/bin/su owned by root not writable by user correct sha256 Tripwire ✓   AIDE ✓ EDR ✓   auditd ✓ all clear; the receipt In the page cache in-memory copy the one actually executed 4 bytes corrupted by Copy Fail page not marked dirty next su run → attacker's instruction the cinema Integrity tools watch the receipt. The cinema is where the show plays.

There is one further detail. Because the page cache is shared between containers and the host, a process inside a container with the right local privileges can corrupt the host's cached binaries. The container boundary does not stop this; it was never designed to. The Microsoft Security Response Center's writeup of Copy Fail is unusually direct about the cloud impact, and CERT-EU's Advisory 2026-005 recommends the disabling of algif_aead until the patched kernel has reached every node — which, given the breadth of affected distributions, may be a longer week than anyone wanted.

The Boundary the Container Does Not Defend Container A workload Container B compromised (local user) Container C workload host tools shared page cache (one per host kernel) below namespaces • below seccomp • below capability drops • below cgroups host kernel — one kernel for every container above it

The current patch picture, as of the morning this piece is being written, has Arch, Fedora and Amazon Linux on patched kernels, with Ubuntu, SUSE and Red Hat closing in; Debian's stable line is the slowest, which is consistent with Debian's general posture and not, in this instance, a criticism. The kernel mailing list has been busy and the maintainers have been, as ever, exemplary.

The Pattern

It is tempting, when a flaw of this severity surfaces, to look for someone to be cross with. There is rarely anyone. The Linux kernel crypto subsystem is maintained by careful people who review each patch within the boundary of their own subsystem. The authencesn maintainer of 2011 was correct. The AF_ALG maintainer of 2015 was correct. The performance maintainer of 2017 was correct. Each of them did the engineering due diligence the moment asked of them. None of them was in a position to see the patch fifteen years from now that would turn their decision into a four-byte primitive.

This is the structural observation. Composite vulnerabilities of this kind do not have a single author. They emerge where responsibility lines cross. A kernel of two hundred and fifty thousand commits per year is maintained by hundreds of subsystem stewards; each is excellent within their own remit; the gaps between remits are where the seams sit. And the seams, on occasion, show.

The cross-container aspect is the second seam. Container isolation on Linux is, as the orchestrator-without-orchestrators essays of the last decade have noted, a curated arrangement of kernel namespaces and seccomp filters and capability drops, not a kernel-native partition. The page cache is below all of those abstractions, which is precisely why this particular primitive crosses container boundaries. None of the container runtimes is at fault — they cannot constrain what the kernel does not partition.

The Linux kernel community has, in fact, been having a quiet conversation about AF_ALG itself. A 2025 LKML patch series titled Document the deprecation of AF_ALG proposes — in the trademark restraint of kernel English — that new use cases should prefer userspace cryptography libraries. The conversation is unresolved, the interface is still in the tree, and Copy Fail has now given the conversation a rather sharper outline.

The Limit

Honesty asks for a few qualifications.

The 2017 in-place optimisation was technically sound. The performance numbers were real and the review at the time was thorough. The patch did exactly what its commit message said it did. The flaw is not in any one of the three decisions; it is in their composition.

FreeBSD has its own CVEs. The src-tree security advisories — the FreeBSD-SA series — are a public record one can consult without flinching. Some of them are awkward; one or two are embarrassing. The FreeBSD security team does not pretend otherwise, and neither does anyone who has done a fair amount of FreeBSD work. The point of what follows in the next section is not that one operating system is safer than another. The point is that composite vulnerabilities want composite responsibility models, and the two stacks have chosen different shapes of responsibility.

And there is one further limit worth noting. The Linux kernel security team triaged and patched Copy Fail in eight days. Whatever one thinks of the structural model, the responsiveness of the people inside it deserves the editorial credit it does not always receive. Where the seams do show, the seams get sewn.

The BSD Angle

A reader curious about how the same problem looks in a different kernel finds three things in FreeBSD.

The first is the crypto interface. FreeBSD exposes kernel cryptography to userspace through /dev/crypto, a device node, not a socket family. There is no AF_ALG-equivalent splice path. The OpenCrypto framework, descended from OpenBSD's OCF, was built in a different idiom; the in-place AEAD optimisation as deployed in the Linux kernel does not exist there in the same form. This is a strict observation about architectural shape, not a claim about absolute safety.

The second is Capsicum. FreeBSD's capability-mode sandbox lets a process explicitly relinquish access to global namespaces and operate only on file descriptors with explicitly granted capabilities. A program that does not need a kernel-crypto facility never opens it; a program that opens it explicitly carries a capability that other processes cannot. The attacker corridor is narrowed structurally before any specific bug is reached. Capsicum was added to FreeBSD 9 in 2012 and is now in active use by base-system tools such as tcpdump, dhclient, auditdistd, kdump, ctld, iscsid and several others. It is, in a quiet way, the kind of capability hygiene that composite vulnerabilities benefit from.

The third is jails. Containerisation in FreeBSD is a kernel-native primitive, not a daemon-mediated arrangement of namespaces. There is no privileged daemon to compromise. Page-cache sharing across jail boundaries is structurally different; VNET gives each jail an independent network stack as a kernel subsystem, not as an external CNI plugin. None of this is a guarantee against every vulnerability — FreeBSD has had jail-relevant CVEs and will have more — but the shape of the boundary is different, and the same composite primitive does not, as far as one can see, have an equivalent path.

Two Kernels, Two Shapes of Attack Surface Linux crypto via AF_ALG socket algif_aead, splice + in-place isolation via namespaces + seccomp + capability drops curated arrangement, not partition shared page cache across containers and host composite path opens FreeBSD crypto via /dev/crypto no AF_ALG splice path Capsicum capability sandbox in tcpdump, dhclient, kdump, … corridor narrowed structurally jails + VNET kernel-native, no privileged daemon equivalent path does not arise

On FreeBSD, the equivalent scenario does not arise in the same form because the parts that compose it are not present in the same arrangement. That is the honest version of the comparison.

The Point

There is no real disagreement that the Copy Fail patch should land everywhere it has not already landed. By the time this piece is published, most readers' production fleets will be on patched kernels, and the immediate operational question will be a familiar one: did our automation roll the kernel update through every node, including the awkward ones?

But the longer point is not that one. It is the structural observation. Three correct decisions, taken by three different teams, fifteen years apart, meeting at a seam that no single review could have seen. The recommendation that follows from that observation is not "patch and move on". It is check the seams.

Wherever responsibility crosses team boundaries in the stack you maintain, fifteen years from now will be 2041. A kernel patch you accept this afternoon may, in 2041, sit alongside two other patches accepted by two other teams in two other years. None of you will be in a position to see it. The most one can do is make the seams as narrow as the architecture will tolerate, and keep an eye on which interfaces are quietly being deprecated by their own maintainers.

One rather hopes the next nine years are a little duller in this regard than the last.