The Unix Way ■ Episode 06
The Inheritance Problem
A process is compromised. A buffer overflow, a malformed packet, a dependency with quiet ambitions. The attacker now controls execution. What happens next depends entirely on what the process can reach, and on a stock Unix system, the answer has been the same since 1969: everything the user can touch. Every file. Every socket. Every device. The compromised process inherits the full ambient authority of the user who launched it.
This is not a bug. It is the original Unix security model, and it served admirably when "users" meant researchers at Bell Labs who could be trusted not to run hostile code from the internet. The internet, regrettably, had other plans.
Two operating systems decided to fix this. They chose opposite philosophies. One removed the doors from the room. The other hired a bouncer and handed him a clipboard.
FreeBSD: Capsicum
In 2010, Robert Watson and Jonathan Anderson at the University of Cambridge presented a paper that won Best Student Paper at USENIX Security. The core insight was disarmingly simple: rather than listing what a process may not do, remove everything and hand back precisely what it needs.
The result was
Capsicum,
compiled into FreeBSD 10.0 by default in 2014. The API is one function
that matters: cap_enter(). One syscall. Irreversible.
#include <sys/capsicum.h>
int main(void)
{
int fd = open("/var/log/capture.pcap", O_WRONLY);
/* Restrict fd: write and seek only */
cap_rights_t rights;
cap_rights_init(&rights, CAP_WRITE, CAP_SEEK);
cap_rights_limit(fd, &rights);
/* Enter capability mode. No return. Ever. */
cap_enter();
/*
* The process loses all access to global namespaces.
* No filesystem. No new sockets. No new processes.
* What remains: only the file descriptors already held,
* with rights explicitly granted above.
*/
write_packets(fd);
return 0;
}
There is no cap_exit(). There is no escalation path, no
privilege restoration, no polite request form for processes that would
like their authority back. The kernel sets a flag. The flag does not unset.
The process enters a world that contains precisely its open file descriptors,
restricted to the operations explicitly granted. The filesystem, the network,
the process table: they do not merely become inaccessible. For the sandboxed
process, they cease to exist.
The model is subtraction. Start with everything, remove everything, hand back precisely what is needed. The process cannot escape. Not because a filter stops it, but because the door no longer exists. One might observe that this is rather difficult to bypass.
Linux: seccomp-bpf
On Linux, the answer arrived in two stages. In 2005, Andrea Arcangeli added
seccomp
strict mode: four syscalls permitted (read, write,
exit, sigreturn), everything else killed the process.
Elegant, certainly. Also almost completely unusable for anything that needed
to do actual work.
In 2012, Will Drewry introduced seccomp-bpf in Linux 3.5: a BPF programme that inspects every syscall at runtime and decides whether to allow, deny, or kill. This was genuinely useful. It was also a fundamentally different philosophy.
#include <seccomp.h>
int main(void)
{
/* Default: kill on any syscall not explicitly allowed */
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);
/* Allowlist: only these syscalls permitted */
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(close), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
/* Load the filter */
seccomp_load(ctx);
/*
* From here:
* - Only listed syscalls work
* - But: all existing FDs retain full rights
* - An allowed read() can read ANY open FD
* - The filter checks the call, not the target
*/
write_packets(fd);
return 0;
}
The model is filtration. The process retains full ambient authority. A filter sits between the process and the kernel, checking each call against a list. The bouncer checks your name at the door. He does not check what you do once you are inside.
Docker's default seccomp profile blocks approximately 44 of 300+ syscalls. The remaining 256 pass through. The difference between an allowlist and a blocklist is the difference between "you may enter rooms 3 and 7" and "you may enter any room except 12 and 15." One of these gets more dangerous as the building adds floors.
The Same Browser, Two Philosophies
The most instructive comparison is not theoretical. It is Chromium.
Every web browser faces the same threat: rendering untrusted content from the internet inside a process that has access to the local system. The solution, universally, is sandboxing. Chromium sandboxes its renderer processes so that a compromised tab cannot read your files, steal your session cookies, or pivot to the network.
On FreeBSD,
Chromium uses Capsicum.
The renderer opens the resources it needs (fonts, shared memory, IPC
channels to the browser process), then calls cap_enter().
From that moment, the renderer exists in a world that contains precisely
its open file descriptors, restricted to the operations it requires.
An attacker who compromises a renderer tab inherits a process that can
talk to the browser over a pipe and draw pixels. Nothing else exists.
On Linux, Chromium uses seccomp-bpf combined with namespaces. The renderer loads a BPF filter that blocks dangerous syscalls, then enters a restricted namespace. The engineering is sound. The result is effective. But the BPF programme must enumerate every syscall that should be blocked or allowed. New kernel versions add new syscalls. Each new syscall is, by default, a gap in every seccomp profile that uses a blocklist. The filter is a living document. It must be maintained.
Same browser. Same threat model. Same problem. Capsicum does not care how
many syscalls the kernel adds. After cap_enter(), a new syscall
that opens files does not work because the process is in capability mode.
The restriction is structural, not enumerative. The kernel can gain a
thousand new syscalls and the sandbox holds, because the sandbox is not a
list of things you cannot do. It is the absence of the ability to do them.
The CVE That Proved the Point
In 2022,
CVE-2022-30594
demonstrated precisely why the architectural difference matters.
On Linux kernels before 5.17.2, PTRACE_SEIZE allowed local
attackers to set PT_SUSPEND_SECCOMP, bypassing seccomp
filters entirely. The filter was correct. Every rule was properly written.
The mechanism to enforce it was not.
Capsicum's model is structurally immune to this class of attack. There is no filter to suspend. There is no enforcement layer that can be circumvented. The process is in capability mode. The global namespace does not exist. You cannot bypass a door that is not there, regardless of how creative your lock-picking skills might be.
The filter was correct. The mechanism to enforce it was not. This is the fundamental difference between subtraction and filtration.
The Epistemological Divide
The difference between Capsicum and seccomp is not a difference of implementation. It is a difference of epistemology, and it is worth spending a moment on why this matters beyond the technical detail.
seccomp asks: "What should this process not be allowed to call?" This requires you to know, in advance, every dangerous thing a process might do. Every new kernel version, every new syscall, every new attack vector must be anticipated and added to the filter. The Linux kernel had 335 syscalls in 2012. It has over 450 today. Each addition is a potential gap in every seccomp profile that uses a blocklist.
Capsicum asks: "What does this process actually need?" Then it removes everything else. You do not enumerate the threats. You enumerate the requirements. The set of things a process needs is small, knowable, and stable. The set of things a process might abuse grows with every kernel release. One of these sets is rather easier to manage than the other.
The Practical Evidence
FreeBSD's base system quietly demonstrates what Capsicum makes possible.
tcpdump opens the capture device and output file, restricts
their rights, enters capability mode. A compromised tcpdump cannot read
/etc/shadow. Cannot open a reverse shell. Cannot reach the
network. The attack surface is: a read-only BPF device and a write-only
output file.
dhclient follows the same pattern. Open the socket, open the
lease file, enter capability mode. The DHCP client, which runs as root and
handles network input from untrusted sources, is sandboxed to precisely the
resources it requires. On most Linux distributions, a compromised dhclient
can read anything the dhclient user can read, which, given that it typically
runs as root, is everything.
The full list of Capsicum-enabled tools in FreeBSD base is quietly
instructive: tcpdump, dhclient,
hastd, auditdistd, gzip,
and OpenSSH. These are precisely the tools an attacker targets first:
network-facing, parsing untrusted input, often running as root. On FreeBSD,
they are precisely the tools that have the least to offer once compromised.
The Point
Both approaches improve security. Meaningfully. seccomp-bpf protects millions of containers, phones, and browsers every day. Dismissing it would be ignorant. But the question, as ever, is architectural: would you rather patch the filter, or remove what needs filtering?
Capsicum eliminates ambient authority. seccomp restricts it. One locks the door and removes it from the hinges. The other hires a bouncer and hopes the guest list is complete.
The door that does not exist cannot be opened. Rather reassuring, that.