Vivian Voss

Why We Restart to Fix It

architecture freebsd kubernetes devops

On Second Thought ⊣ Episode 10

The pager has gone off. Memory on the auth service is climbing in a way it should not be. You SSH in, you observe nothing in particular, you kubectl delete pod. The pod comes back, memory is fresh, the graph flattens. The on-call channel returns to silence. Nobody asks what was wrong.

This is the tenth episode of On Second Thought, a series about the daily routines we perform without ever quite deciding to. Today's routine is the one that runs at the top of half the world's incident response: when the machine misbehaves, restart it. Have you tried turning it off and on again. The line is a joke, until it is a runbook, until it is the production strategy, until it is the only debugging step we still know.

The Axiom

The reflex is universal. A stuck container, a frozen browser tab, a JVM that has been ruminating on a class loader for forty minutes, a Kafka consumer that fell off a partition, a connection pool that quietly stopped reaping idle handles. The runbook for half the world's incidents has three steps, and the first two are window-dressing for the third. We accept this as the natural response to a system that fails, in the same way we accept that traffic jams are simply the price of having cars. On second thought, both deserve a second thought.

The strange thing is not that we restart. Restart is, in the right architecture, a perfectly reasonable response to a particular class of fault. The strange thing is that the restart has become the diagnosis, and that we have built an entire generation of platforms around the assumption that it would.

The Restart Reflex, End to End pager fires memory climbing ssh in observe nothing kubectl delete pod the diagnosis pod returns memory fresh alert clears graph flattens nobody asks what was wrong the question that was never written down

The Origin

The reflex has two parents, and we mostly inherited only one of them.

The first is the consumer-electronics tradition, codified in The IT Crowd's catchphrase but older than the show. A Sky+ box from 2004, a Windows laptop, a wireless router: a device that has wandered into an unknown state can be returned to a known one only by power-cycling it, because the device offers no language in which to be asked where it had gone wrong. The assumption was reasonable for the hardware of its time. It was honest about its limits: the only legible interface to the device's interior was the off switch.

The second tradition is the one engineers built deliberately to replace the first, and it is the one we mostly forgot.

In 1986, at Ericsson's Computer Science Laboratory in Stockholm, Joe Armstrong, Robert Virding and Mike Williams began work on what became Erlang, a language for the kinds of telephone exchanges that simply could not be allowed to go down. The constraint was not academic. A switch that dropped calls cost regulatory fines and lost contracts. A switch that ran for a year between reboots was the point. The output of that work, the AXD301 ATM switch, runs on roughly two million lines of Erlang and is the system most often cited for "nine nines" of reliability: on the order of 31 milliseconds of downtime per year. The figure is contested in the way every figure of that shape is contested; whether the measurement was apples-to-apples, whether it included planned maintenance, whether the operational data was systematically collected. The architecture that produced it, however, is uncontested, and it is the architecture that matters here.

Armstrong's principle, on the surface, looked exactly like the consumer tradition: when a process gets into a bad state, terminate it. He called it "let it crash", and the phrase has done more damage to the idea than any critic could. Read as a slogan it sounds like the Sky+ box: when in doubt, kill it. Read as architecture, it is the opposite.

Two Parents. We Kept One. Consumer Tradition Sky+ box (2004) Windows laptop wireless router power button no language to be asked honest about its limits Stockholm Tradition (Erlang, 1986) worker process crash as message supervisor written strategy one-for-one • one-for-all • rest-for-one escalation limit → up the tree a failure is data ▲ What we inherited. ▲ What we left in Stockholm.

Three properties make it architecture.

First, processes are isolated. An Erlang process is not a thread, and not a coroutine; it has its own heap, its own message queue, and shares nothing mutable with any other process. When one crashes, it cannot corrupt the state of another, because there is no shared state to corrupt. A crash takes itself with it and nothing else.

Second, every worker has a supervisor. The supervisor is not a vague concept; it is a specific process, with a specific role, defined in OTP, the standard Erlang library. When a worker crashes, the crash is delivered to its supervisor as a message. The supervisor decides what to do.

Third, the supervisor decides according to a written strategy. The strategies have names: one-for-one (restart only the crashed worker), one-for-all (restart all siblings), rest-for-one (restart the crashed worker and any later in the dependency order). Every supervisor has a maximum restart frequency, and when the frequency is exceeded, the supervisor itself crashes, which delivers the failure to its supervisor, one level up. A failure escalates a tree, not a runbook. The rule that handles it was written years before the outage.

OTP Supervisor Tree top supervisor rest-for-one supervisor A one-for-one supervisor B one-for-all worker worker ☠ worker only the crashed worker restarts worker ↻ worker ☠ worker ↻ all siblings restart together escalation limit exceeded A crash is delivered as a message, not noticed as an alarm. The rule that handles it was written years before the outage.

Let it crash, in its proper form, is not "have you tried turning it off and on again." It is "we have already decided what to do when this fails, and we wrote it down." The restart is the same gesture. The contract underneath is wholly different.

The Cost

What we kept of let-it-crash is the let-it-crash. What we left in Stockholm is the supervisor.

The first cost lands daily. Restart is the diagnosis. The pod comes back, the alert clears, the day is shipped. The cause of the alert is not investigated, because nothing in the response made room for investigation: the runbook said restart, the restart worked, the page closed. A memory leak, a file-descriptor exhaustion, a lock contention, a queue backing up because a downstream service is throttling, each leaves the same heartbeat-recovery signature on a dashboard, and each needs a different fix. The restart erases the question that distinguishes them. The bug remains exactly as resident in the code as it was before the pager went off, with the small refinement that the team is now slightly more trained to ignore it.

Four Different Causes, One Heartbeat-Recovery Signature memory leak file-descriptor exhaustion lock contention downstream throttled queue backing up liveness probe fails same dashboard signature kill + replace pod cause not captured The probe is a contract that the orchestrator rotates the symptoms. Each cause needs a different fix. The restart treats them as one. Self-healing, in the paracetamol sense.

The second cost is structural. We have built whole platforms on the assumption. Kubernetes liveness and readiness probes are, in the honest reading, a contract that the orchestrator will rotate the symptoms while the cause goes unexamined. A pod that fails its liveness check is killed and replaced. There is no concept, in the standard Kubernetes flow, of capturing the dying process's state, of preserving the crash for later inspection, of asking why before the next pod is scheduled. "Self-healing" is the marketing term for this, and it is accurate in the sense that a person who takes paracetamol every four hours has a self-healing headache. The symptom keeps disappearing. The cause has not been touched.

The third cost is institutional. A team that restarts to fix gets very good at restarting and never gets good at diagnosing. The post-incident review produces a runbook with an additional command. The runbook is consulted next time the pager goes off; the additional command is added; the team's collective intuition about the system shifts from "what is this system actually doing" to "what sequence of recovery steps clears the current alert". In the worst case, the only conjecture anybody had about why the alert ever fired leaves quietly with the last engineer who maintained the service, and the new on-call rotation inherits the runbook but not the model.

The fourth cost is the one this series exists to point at: we have stopped expecting our systems to be debuggable. The restart was a shortcut, originally; we took it because diagnosing the live system was hard, and the restart was cheap, and the bug was small. We then built more software on top of that shortcut, and more on top of that, until "you cannot reasonably diagnose this in production" stopped being an embarrassment and started being a feature description.

The Question

There is software in operation that does not work this way, and the alternative is older and quieter than the current default.

WhatsApp serves north of a billion users with around fifty engineers. Its backend is Erlang. The supervisor model from Stockholm runs the company's production. Crashes happen. They are caught by their supervisors. The strategies are written. The escalation tree handles the rest. The engineers do not spend their day power-cycling boxes; they spend it writing the rules under which the boxes manage themselves. It is a small, deliberate team operating a system that, by every comparable measure, ought to require many times its size to keep running. The supervisor architecture is why.

In the unixoid tradition, the FreeBSD base provides the operator's half of the same picture. init and rc.d use the same model that Stockholm did: explicit start, explicit dependency, explicit recovery. A service has a script that says how it starts, what must be up before it starts, and what to do when it dies. When a service on a FreeBSD machine misbehaves, the operator has dtrace to follow what the kernel and user-space code are actually doing, ktrace to record system calls for later inspection, procstat and fstat to read what a process is holding, post-mortem core dumps that survive the crash and can be examined at leisure, and a kernel that will, with some precision, tell you what process held what lock at what time. The reboot is available, on FreeBSD as everywhere else. It is rarely the first reach, because the system is willing to speak, and the operator has been trained to listen.

When the Service Misbehaves, What Does the Operator Have? Kubernetes (default) kubectl delete pod the only reach cause not captured probe rotates symptom runbook gains a line the diagnosis becomes the runbook FreeBSD base dtrace ktrace procstat fstat core dump (post-mortem) what process held what lock when reboot available, rarely the first reach the system is willing to speak

So the honest question is not whether to keep the restart. The runbooks have it for a reason and they are not foolish. The restart, in a supervisor architecture, is a perfectly normal recovery step. The question is the one we did not write down: in a system that fails, was the restart the answer, or the moment the question got dropped?

A restart, on second thought, is not a tool. It is a measurement. It tells you, with some precision, how much of the cause you decided you could afford to leave unknown.