Tales from the Bare Metal ☞ Episode 03
« Thou shalt not let a regex run unbounded anywhere! »
13:42 UTC on Tuesday, 2 July 2019. A Cloudflare engineer deploys a single new managed WAF rule. Within seconds, every Cloudflare server in the world is at 100% CPU and incoming HTTP traffic stops moving. Twenty-seven minutes later, with traffic restored, the engineers will find that the rule was a regex of forty-five characters.
This is the story of how a regular expression that compiles cleanly, passes its tests, deploys via a globally distributed key-value store in seconds, and is enabled in simulation mode rather than blocking mode, takes down a sizeable fraction of the public internet for the duration of a coffee break. The interesting part is not the regex. The interesting part is the architecture that gave the regex unrestricted access to every CPU core on the network.
The Incident, in UTC
- 13:42 Engineer deploys the new managed WAF rule. Automatic deployment via Quicksilver, Cloudflare's distributed key-value store, propagates the change worldwide in seconds
- 13:45 First PagerDuty alert: synthetic WAF test fails
- 14:00 WAF identified as the root cause of the global CPU saturation
- 14:02 A "global terminate" of the WAF Managed Rulesets is proposed
- 14:07 Global terminate executed
- 14:09 Traffic and CPU return to normal worldwide
- 14:52 WAF re-enabled globally after the offending rule is rolled back and the change is tested
Twenty-seven minutes from rule push to traffic restoration. Two further hours before the WAF Managed Rulesets are fully re-enabled.
The Diagnosis
The offending rule was a managed XSS-detection regular expression in Cloudflare's WAF. Its full form is a forty-five-character pattern; the catastrophic part is one sub-expression:
(?:.*=.*)
Reduced to its essence:
.*.*=.*
Two greedy quantifiers (.*) followed by a
literal (=) followed by another greedy
quantifier. When the engine matches this against a long
input that does not contain =, it has to try
every possible way of splitting the input between the two
.* quantifiers before concluding failure. For
an input of length n, there are O(n2)
possible splits to try. For longer inputs against more
complex variants of this pattern, the worst case escalates
rapidly. This is the classic shape of
catastrophic backtracking, a failure mode
known under the acronym ReDoS (regular
expression denial of service).
Cloudflare's WAF was implemented in Lua, running inside nginx (the OpenResty stack), with PCRE (Perl Compatible Regular Expressions) doing the actual regex matching. PCRE evaluates patterns by backtracking and has no built-in runtime budget; once a match attempt begins, it runs until it succeeds or exhausts the search space.
On a single host, a single pathological input produces a single CPU-bound match attempt. Multiplied by every HTTP request that hits the WAF, multiplied by every CPU core on every Cloudflare edge server worldwide, this is exactly the global saturation that took the network down.
The Context
Three systemic conditions shaped the afternoon.
The author was solving for coverage, not performance. The rule was new; the goal was to catch a wider class of XSS payloads. The test harness verified that the rule matched the patterns it was supposed to match and did not match the patterns it was supposed to ignore. CPU-time profiling on pathological inputs was not part of the harness, and Cloudflare's postmortem notes explicitly that previous WAF builds had not logged any CPU regression of the kind that would have flagged the new rule. The rule looked fine to every gatekeeper between author and edge.
The engine had no runaway protection. PCRE has no built-in CPU budget; this is a property of the algorithm class (backtracking matchers cannot bound their own runtime in general). Cloudflare had previously had a CPU-usage protection mechanism in the WAF, but it had been removed for unrelated reasons and was never reinstated. The combination of an unbounded engine and a removed wrapper-level guard meant the runtime budget was, in practice, infinite.
The deployment path had no staged rollout for managed rules. Quicksilver pushes changes globally in seconds. This is a feature for emergency kill switches: when something has to be turned off everywhere at once, seconds matter. For ordinary rule changes, it is a feature that converts a local bug into a global outage. There was no canary stage, no per-data-centre soak time, no automated rollback on CPU regression. And the dashboard and API that customers and operators would have used to disable the rule manually ran on the same Cloudflare edge network; once the edge was saturated, the control plane was unreachable too.
The pattern across the three is familiar: each system was reasonable on its own. Quicksilver was excellent for what it was designed for. PCRE was an industry-standard regex engine. The XSS rule was a reasonable response to a known attack class. The failure mode lived in the spaces between them, where no single component owned the question "what is the worst case here?"
The Principle
The unixoid response to "this code path could run forever" is to give it a budget that the kernel enforces. The detailed answer has three parts.
Use an engine with runtime guarantees. The
linear-time regex engines, of which
Google's RE2
and the Rust regex crate are the two
production-grade examples, both refuse at compile time to
accept patterns whose worst case is exponential, and both
run any pattern they do accept in time proportional to the
input. Cloudflare's preventive measures, announced in the
postmortem,
include switching the WAF from PCRE to either RE2 or Rust
regex. The linear-time class was first published as a
theoretical result by
Ken Thompson in 1968
and has been available in production form since RE2 was
open-sourced in 2010. It is not new technology; it is
technology that has not yet displaced its older sibling
everywhere it should.
Where the engine cannot be changed, cap it from
outside. On FreeBSD,
rctl(8)
enforces resource limits on processes, jails and users:
CPU time, memory, file descriptors, and several others. A
regex worker run inside a jail with
rctl -a jail:waf:cputime:deny=200ms cannot
eat more than 200 milliseconds of CPU per process before
the kernel terminates it. On any Unix,
ulimit -t in the parent shell or a separate
worker process per request with a kill-after timeout
provides a coarser version of the same guarantee. The
principle is older than software: the operator pays for an
unbounded computation; the kernel is the only place that
can refuse on the operator's behalf.
Treat global propagation as the kill switch, not the default. Quicksilver propagating worldwide in seconds is the right tool for "turn this off everywhere" or "patch the active CVE". It is the wrong default for "deploy a new rule": the cost of a slow rollout (minutes per region) is the upper bound on the blast radius of a bug. Cloudflare's SOP changes announced after the outage include a staged rollout path for managed rules while retaining the emergency global lane. Two paths, two purposes, two SLAs.
The FreeBSD context is worth pausing on. The BSD camp has
been arguing for runtime-bounded execution at the kernel
boundary for decades: rctl for CPU and memory
caps, jails for
blast-radius containment,
Capsicum for
capability-level sandboxing, the Unix-traditional
discipline of one job per process with a clean exit path.
None of this needed to be invented in 2019. It needed to
be applied at the point where the WAF was making decisions
in the request hot path. The Cloudflare team has applied
the equivalent answers since; the lesson, for the rest of
the industry, is that the equivalent answers exist and
have been quietly proven for a long time.
Where It Travels
Cloudflare runs the WAF, but the pattern is everywhere a regex meets user input:
- Every Web Application Firewall. Cloudflare, AWS WAF, F5, ModSecurity, custom-built rules
- Every intrusion detection or prevention system. Suricata, Zeek and snort all run rule sets containing regex against packet payloads or extracted strings
- Every centralised logging stack. Splunk, Elastic and Datadog and similar systems regex inbound log lines for parsing and routing. A pathological message can melt the parsing tier
- Every JSON or GraphQL schema validator that supports
patternkeywords. A schema author can write a ReDoS into an API definition without realising it - Every text-classification or content-moderation pipeline that runs regex on user-submitted content as a first-pass filter
- Every Python
reapplication on a streaming source, where one bad input can stall the whole stream - Every dashboard that runs on the platform it monitors. This is not the regex lesson; it is its travelling companion. A status page, a kill-switch console or a deployment dashboard that depends on the same infrastructure it controls will be unreachable exactly when it is needed
Forty-five characters of expression, twenty-seven minutes of internet. The engine did exactly what the regex asked for. The architecture decided what exactly meant.