13 May 2026 Read on LinkedIn

The Regex That Ran Unbounded

security performance architecture freebsd

Tales from the Bare Metal ☞ Episode 03

« Thou shalt not let a regex run unbounded anywhere! »

13:42 UTC on Tuesday, 2 July 2019. A Cloudflare engineer deploys a single new managed WAF rule. Within seconds, every Cloudflare server in the world is at 100% CPU and incoming HTTP traffic stops moving. Twenty-seven minutes later, with traffic restored, the engineers will find that the rule was a regex of forty-five characters.

This is the story of how a regular expression that compiles cleanly, passes its tests, deploys via a globally distributed key-value store in seconds, and is enabled in simulation mode rather than blocking mode, takes down a sizeable fraction of the public internet for the duration of a coffee break. The interesting part is not the regex. The interesting part is the architecture that gave the regex unrestricted access to every CPU core on the network.

The Incident, in UTC

13:42 Engineer deploys the new managed WAF rule. Automatic deployment via Quicksilver, Cloudflare's distributed key-value store, propagates the change worldwide in seconds
13:45 First PagerDuty alert: synthetic WAF test fails
14:00 WAF identified as the root cause of the global CPU saturation
14:02 A "global terminate" of the WAF Managed Rulesets is proposed
14:07 Global terminate executed
14:09 Traffic and CPU return to normal worldwide
14:52 WAF re-enabled globally after the offending rule is rolled back and the change is tested

Twenty-seven minutes from rule push to traffic restoration. Two further hours before the WAF Managed Rulesets are fully re-enabled.

The Diagnosis

The offending rule was a managed XSS-detection regular expression in Cloudflare's WAF. Its full form is a forty-five-character pattern; the catastrophic part is one sub-expression:

(?:.*=.*)

Reduced to its essence:

.*.*=.*

Two greedy quantifiers (.*) followed by a literal (=) followed by another greedy quantifier. When the engine matches this against a long input that does not contain =, it has to try every possible way of splitting the input between the two .* quantifiers before concluding failure. For an input of length n, there are O(n²) possible splits to try. For longer inputs against more complex variants of this pattern, the worst case escalates rapidly. This is the classic shape of catastrophic backtracking, a failure mode known under the acronym ReDoS (regular expression denial of service).

Cloudflare's WAF was implemented in Lua, running inside nginx (the OpenResty stack), with PCRE (Perl Compatible Regular Expressions) doing the actual regex matching. PCRE evaluates patterns by backtracking and has no built-in runtime budget; once a match attempt begins, it runs until it succeeds or exhausts the search space.

On a single host, a single pathological input produces a single CPU-bound match attempt. Multiplied by every HTTP request that hits the WAF, multiplied by every CPU core on every Cloudflare edge server worldwide, this is exactly the global saturation that took the network down.

The Context

Three systemic conditions shaped the afternoon.

The author was solving for coverage, not performance. The rule was new; the goal was to catch a wider class of XSS payloads. The test harness verified that the rule matched the patterns it was supposed to match and did not match the patterns it was supposed to ignore. CPU-time profiling on pathological inputs was not part of the harness, and Cloudflare's postmortem notes explicitly that previous WAF builds had not logged any CPU regression of the kind that would have flagged the new rule. The rule looked fine to every gatekeeper between author and edge.

The engine had no runaway protection. PCRE has no built-in CPU budget; this is a property of the algorithm class (backtracking matchers cannot bound their own runtime in general). Cloudflare had previously had a CPU-usage protection mechanism in the WAF, but it had been removed for unrelated reasons and was never reinstated. The combination of an unbounded engine and a removed wrapper-level guard meant the runtime budget was, in practice, infinite.

The deployment path had no staged rollout for managed rules. Quicksilver pushes changes globally in seconds. This is a feature for emergency kill switches: when something has to be turned off everywhere at once, seconds matter. For ordinary rule changes, it is a feature that converts a local bug into a global outage. There was no canary stage, no per-data-centre soak time, no automated rollback on CPU regression. And the dashboard and API that customers and operators would have used to disable the rule manually ran on the same Cloudflare edge network; once the edge was saturated, the control plane was unreachable too.

The pattern across the three is familiar: each system was reasonable on its own. Quicksilver was excellent for what it was designed for. PCRE was an industry-standard regex engine. The XSS rule was a reasonable response to a known attack class. The failure mode lived in the spaces between them, where no single component owned the question "what is the worst case here?"

The Principle

The unixoid response to "this code path could run forever" is to give it a budget that the kernel enforces. The detailed answer has three parts.

Use an engine with runtime guarantees. The linear-time regex engines, of which Google's RE2 and the Rust regex crate are the two production-grade examples, both refuse at compile time to accept patterns whose worst case is exponential, and both run any pattern they do accept in time proportional to the input. Cloudflare's preventive measures, announced in the postmortem, include switching the WAF from PCRE to either RE2 or Rust regex. The linear-time class was first published as a theoretical result by Ken Thompson in 1968 and has been available in production form since RE2 was open-sourced in 2010. It is not new technology; it is technology that has not yet displaced its older sibling everywhere it should.

Where the engine cannot be changed, cap it from outside. On FreeBSD, rctl(8) enforces resource limits on processes, jails and users: CPU time, memory, file descriptors, and several others. A regex worker run inside a jail with rctl -a jail:waf:cputime:deny=200ms cannot eat more than 200 milliseconds of CPU per process before the kernel terminates it. On any Unix, ulimit -t in the parent shell or a separate worker process per request with a kill-after timeout provides a coarser version of the same guarantee. The principle is older than software: the operator pays for an unbounded computation; the kernel is the only place that can refuse on the operator's behalf.

Treat global propagation as the kill switch, not the default. Quicksilver propagating worldwide in seconds is the right tool for "turn this off everywhere" or "patch the active CVE". It is the wrong default for "deploy a new rule": the cost of a slow rollout (minutes per region) is the upper bound on the blast radius of a bug. Cloudflare's SOP changes announced after the outage include a staged rollout path for managed rules while retaining the emergency global lane. Two paths, two purposes, two SLAs.

The FreeBSD context is worth pausing on. The BSD camp has been arguing for runtime-bounded execution at the kernel boundary for decades: rctl for CPU and memory caps, jails for blast-radius containment, Capsicum for capability-level sandboxing, the Unix-traditional discipline of one job per process with a clean exit path. None of this needed to be invented in 2019. It needed to be applied at the point where the WAF was making decisions in the request hot path. The Cloudflare team has applied the equivalent answers since; the lesson, for the rest of the industry, is that the equivalent answers exist and have been quietly proven for a long time.

Where It Travels

Cloudflare runs the WAF, but the pattern is everywhere a regex meets user input:

Every Web Application Firewall. Cloudflare, AWS WAF, F5, ModSecurity, custom-built rules
Every intrusion detection or prevention system. Suricata, Zeek and snort all run rule sets containing regex against packet payloads or extracted strings
Every centralised logging stack. Splunk, Elastic and Datadog and similar systems regex inbound log lines for parsing and routing. A pathological message can melt the parsing tier
Every JSON or GraphQL schema validator that supports pattern keywords. A schema author can write a ReDoS into an API definition without realising it
Every text-classification or content-moderation pipeline that runs regex on user-submitted content as a first-pass filter
Every Python re application on a streaming source, where one bad input can stall the whole stream
Every dashboard that runs on the platform it monitors. This is not the regex lesson; it is its travelling companion. A status page, a kill-switch console or a deployment dashboard that depends on the same infrastructure it controls will be unreachable exactly when it is needed

Forty-five characters of expression, twenty-seven minutes of internet. The engine did exactly what the regex asked for. The architecture decided what exactly meant.

The 27-minute shape, the preventive answer:

The clock (UTC, 2 July 2019): 13:42 rule pushed via Quicksilver; 13:45 synthetic-test alerts; 14:00 WAF identified; 14:07 global terminate; 14:09 traffic and CPU normal; 14:52 WAF re-enabled after rollback
The offending sub-expression: (?:.*=.*), simplified .*.*=.*. Two greedy quantifiers in sequence; classic catastrophic backtracking on PCRE
Cloudflare's preventive measures (per the John Graham-Cumming postmortem): switch to RE2 or Rust regex (both linear-time); reintroduce CPU-usage protection; performance profiling in CI on pathological inputs; staged rollout SOP for managed rules with the global-deploy lane retained for kill switches; separate the control-plane dashboard from the edge network it controls
No data loss, no breach. CPU saturation on every edge core; control plane sharing fate with the network. The wrong engine in the wrong code path with the wrong rollout