02 March 2026 Read on LinkedIn

Capsicum vs seccomp: Process Sandboxing

freebsd linux security unix

The Unix Way ■ Episode 06

The Inheritance Problem

A process is compromised. A buffer overflow, a malformed packet, a dependency with quiet ambitions. The attacker now controls execution. What happens next depends entirely on what the process can reach, and on a stock Unix system, the answer has been the same since 1969: everything the user can touch. Every file. Every socket. Every device. The compromised process inherits the full ambient authority of the user who launched it.

This is not a bug. It is the original Unix security model, and it served admirably when "users" meant researchers at Bell Labs who could be trusted not to run hostile code from the internet. The internet, regrettably, had other plans.

Two operating systems decided to fix this. They chose opposite philosophies. One removed the doors from the room. The other hired a bouncer and handed him a clipboard.

FreeBSD: Capsicum

In 2010, Robert Watson and Jonathan Anderson at the University of Cambridge presented a paper that won Best Student Paper at USENIX Security. The core insight was disarmingly simple: rather than listing what a process may not do, remove everything and hand back precisely what it needs.

The result was Capsicum, compiled into FreeBSD 10.0 by default in 2014. The API is one function that matters: cap_enter(). One syscall. Irreversible.

#include <sys/capsicum.h>

int main(void)
{
    int fd = open("/var/log/capture.pcap", O_WRONLY);

    /* Restrict fd: write and seek only */
    cap_rights_t rights;
    cap_rights_init(&rights, CAP_WRITE, CAP_SEEK);
    cap_rights_limit(fd, &rights);

    /* Enter capability mode. No return. Ever. */
    cap_enter();

    /*
     * The process loses all access to global namespaces.
     * No filesystem. No new sockets. No new processes.
     * What remains: only the file descriptors already held,
     * with rights explicitly granted above.
     */

    write_packets(fd);
    return 0;
}

There is no cap_exit(). There is no escalation path, no privilege restoration, no polite request form for processes that would like their authority back. The kernel sets a flag. The flag does not unset. The process enters a world that contains precisely its open file descriptors, restricted to the operations explicitly granted. The filesystem, the network, the process table: they do not merely become inaccessible. For the sandboxed process, they cease to exist.

The model is subtraction. Start with everything, remove everything, hand back precisely what is needed. The process cannot escape. Not because a filter stops it, but because the door no longer exists. One might observe that this is rather difficult to bypass.

Linux: seccomp-bpf

On Linux, the answer arrived in two stages. In 2005, Andrea Arcangeli added seccomp strict mode: four syscalls permitted (read, write, exit, sigreturn), everything else killed the process. Elegant, certainly. Also almost completely unusable for anything that needed to do actual work.

In 2012, Will Drewry introduced seccomp-bpf in Linux 3.5: a BPF programme that inspects every syscall at runtime and decides whether to allow, deny, or kill. This was genuinely useful. It was also a fundamentally different philosophy.

#include <seccomp.h>

int main(void)
{
    /* Default: kill on any syscall not explicitly allowed */
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);

    /* Allowlist: only these syscalls permitted */
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(close), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);

    /* Load the filter */
    seccomp_load(ctx);

    /*
     * From here:
     *   - Only listed syscalls work
     *   - But: all existing FDs retain full rights
     *   - An allowed read() can read ANY open FD
     *   - The filter checks the call, not the target
     */

    write_packets(fd);
    return 0;
}

The model is filtration. The process retains full ambient authority. A filter sits between the process and the kernel, checking each call against a list. The bouncer checks your name at the door. He does not check what you do once you are inside.

Docker's default seccomp profile blocks approximately 44 of 300+ syscalls. The remaining 256 pass through. The difference between an allowlist and a blocklist is the difference between "you may enter rooms 3 and 7" and "you may enter any room except 12 and 15." One of these gets more dangerous as the building adds floors.

The Same Tool, Two Philosophies

The most instructive comparison is not theoretical. It is tcpdump.

tcpdump captures network packets. It runs as root, because it must open a BPF device. It parses untrusted network data from the wire. It is precisely the kind of tool an attacker dreams of compromising: root privileges, network-facing, parsing arbitrary input. Both FreeBSD and Linux sandbox it. They chose opposite approaches.

On FreeBSD, tcpdump uses Capsicum. It opens the BPF capture device and the output file, restricts their rights with cap_rights_limit(), then calls cap_enter(). From that moment, only the already-opened BPF descriptor and the output file remain. The filesystem does not exist. The network does not exist. New sockets cannot be opened. A compromised tcpdump on FreeBSD can read packets from one device and write to one file. Nothing else.

On Linux, tcpdump uses seccomp-bpf. A filter list decides which syscalls pass. The allowed calls retain full authority over every open file descriptor. An allowed read() can read any open FD. The filter checks the call, not the target. Same tool. Same threat model. One removes access. The other filters calls.

The structural difference matters when the kernel grows. Capsicum does not care how many syscalls the kernel adds. After cap_enter(), a new syscall that opens files does not work because the process is in capability mode. The restriction is structural, not enumerative. The kernel can gain a thousand new syscalls and the sandbox holds, because the sandbox is not a list of things you cannot do. It is the absence of the ability to do them.

The CVE That Proved the Point

In 2022, CVE-2022-30594 demonstrated precisely why the architectural difference matters. On Linux kernels before 5.17.2, PTRACE_SEIZE allowed local attackers to set PT_SUSPEND_SECCOMP, bypassing seccomp filters entirely. The filter was correct. Every rule was properly written. The mechanism to enforce it was not.

Capsicum's model is structurally immune to this class of attack. There is no filter to suspend. There is no enforcement layer that can be circumvented. The process is in capability mode. The global namespace does not exist. You cannot bypass a door that is not there, regardless of how creative your lock-picking skills might be.

The filter was correct. The mechanism to enforce it was not. This is the fundamental difference between subtraction and filtration.

The Epistemological Divide

The difference between Capsicum and seccomp is not a difference of implementation. It is a difference of epistemology, and it is worth spending a moment on why this matters beyond the technical detail.

seccomp asks: "What should this process not be allowed to call?" This requires you to know, in advance, every dangerous thing a process might do. Every new kernel version, every new syscall, every new attack vector must be anticipated and added to the filter. The Linux kernel had 335 syscalls in 2012. It has over 450 today. Each addition is a potential gap in every seccomp profile that uses a blocklist.

Capsicum asks: "What does this process actually need?" Then it removes everything else. You do not enumerate the threats. You enumerate the requirements. The set of things a process needs is small, knowable, and stable. The set of things a process might abuse grows with every kernel release. One of these sets is rather easier to manage than the other.

The Practical Evidence

tcpdump is not alone. FreeBSD's base system quietly demonstrates what Capsicum makes possible across its most security-critical tools.

dhclient follows the same pattern. Open the socket, open the lease file, enter capability mode. The DHCP client, which runs as root and handles network input from untrusted sources, is sandboxed to precisely the resources it requires. On most Linux distributions, a compromised dhclient can read anything the dhclient user can read, which, given that it typically runs as root, is everything.

The full list of Capsicum-enabled tools in FreeBSD base is quietly instructive: tcpdump, dhclient, hastd, auditdistd, gzip, and OpenSSH. These are precisely the tools an attacker targets first: network-facing, parsing untrusted input, often running as root. On FreeBSD, they are precisely the tools that have the least to offer once compromised.

The Point

Both approaches improve security. Meaningfully. seccomp-bpf protects millions of containers, phones, and browsers every day. Dismissing it would be ignorant. But the question, as ever, is architectural: would you rather patch the filter, or remove what needs filtering?

Capsicum eliminates ambient authority. seccomp restricts it. One locks the door and removes it from the hinges. The other hires a bouncer and hopes the guest list is complete.

The door that does not exist cannot be opened. Rather reassuring, that.

The numbers: Capsicum permits approximately 190 of ~567 syscalls in capability mode. seccomp via Docker's default profile blocks ~44 of 300+. Capsicum overhead: one kernel flag per process, negligible. seccomp-bpf: per-syscall filter evaluation at runtime. CVE-2022-30594 bypassed seccomp entirely on Linux pre-5.17.2. Capsicum's model is not susceptible to this class of attack because there is no filter to suspend. Linux is not standing still: Landlock (merged in 5.13, 2021) adds filesystem sandboxing that moves closer to Capsicum's capability model. The original Capsicum paper (Watson, Anderson et al., USENIX Security 2010) remains the definitive explanation of the capability model.