Vivian Voss

The Regex That Ran Unbounded

security performance architecture freebsd

Tales from the Bare Metal ☞ Episode 03

« Thou shalt not let a regex run unbounded anywhere! »

13:42 UTC on Tuesday, 2 July 2019. A Cloudflare engineer deploys a single new managed WAF rule. Within seconds, every Cloudflare server in the world is at 100% CPU and incoming HTTP traffic stops moving. Twenty-seven minutes later, with traffic restored, the engineers will find that the rule was a regex of forty-five characters.

This is the story of how a regular expression that compiles cleanly, passes its tests, deploys via a globally distributed key-value store in seconds, and is enabled in simulation mode rather than blocking mode, takes down a sizeable fraction of the public internet for the duration of a coffee break. The interesting part is not the regex. The interesting part is the architecture that gave the regex unrestricted access to every CPU core on the network.

The Incident, in UTC

  • 13:42 Engineer deploys the new managed WAF rule. Automatic deployment via Quicksilver, Cloudflare's distributed key-value store, propagates the change worldwide in seconds
  • 13:45 First PagerDuty alert: synthetic WAF test fails
  • 14:00 WAF identified as the root cause of the global CPU saturation
  • 14:02 A "global terminate" of the WAF Managed Rulesets is proposed
  • 14:07 Global terminate executed
  • 14:09 Traffic and CPU return to normal worldwide
  • 14:52 WAF re-enabled globally after the offending rule is rolled back and the change is tested

Twenty-seven minutes from rule push to traffic restoration. Two further hours before the WAF Managed Rulesets are fully re-enabled.

2 July 2019, the 27-Minute Shape 27 minutes, traffic stopped 13:42 rule pushed via Quicksilver 13:45 synthetic test alerts 14:00 WAF identified 14:07 global terminate 14:09 traffic restored 14:52 WAF re-enabled

The Diagnosis

The offending rule was a managed XSS-detection regular expression in Cloudflare's WAF. Its full form is a forty-five-character pattern; the catastrophic part is one sub-expression:

(?:.*=.*)

Reduced to its essence:

.*.*=.*

Two greedy quantifiers (.*) followed by a literal (=) followed by another greedy quantifier. When the engine matches this against a long input that does not contain =, it has to try every possible way of splitting the input between the two .* quantifiers before concluding failure. For an input of length n, there are O(n2) possible splits to try. For longer inputs against more complex variants of this pattern, the worst case escalates rapidly. This is the classic shape of catastrophic backtracking, a failure mode known under the acronym ReDoS (regular expression denial of service).

Two Engine Classes on the Same Bad Input input length CPU time PCRE backtracking, exponential worst case RE2 / Rust regex linear time, refuses exponential patterns at compile time

Cloudflare's WAF was implemented in Lua, running inside nginx (the OpenResty stack), with PCRE (Perl Compatible Regular Expressions) doing the actual regex matching. PCRE evaluates patterns by backtracking and has no built-in runtime budget; once a match attempt begins, it runs until it succeeds or exhausts the search space.

On a single host, a single pathological input produces a single CPU-bound match attempt. Multiplied by every HTTP request that hits the WAF, multiplied by every CPU core on every Cloudflare edge server worldwide, this is exactly the global saturation that took the network down.

The Context

Three systemic conditions shaped the afternoon.

The author was solving for coverage, not performance. The rule was new; the goal was to catch a wider class of XSS payloads. The test harness verified that the rule matched the patterns it was supposed to match and did not match the patterns it was supposed to ignore. CPU-time profiling on pathological inputs was not part of the harness, and Cloudflare's postmortem notes explicitly that previous WAF builds had not logged any CPU regression of the kind that would have flagged the new rule. The rule looked fine to every gatekeeper between author and edge.

The engine had no runaway protection. PCRE has no built-in CPU budget; this is a property of the algorithm class (backtracking matchers cannot bound their own runtime in general). Cloudflare had previously had a CPU-usage protection mechanism in the WAF, but it had been removed for unrelated reasons and was never reinstated. The combination of an unbounded engine and a removed wrapper-level guard meant the runtime budget was, in practice, infinite.

The deployment path had no staged rollout for managed rules. Quicksilver pushes changes globally in seconds. This is a feature for emergency kill switches: when something has to be turned off everywhere at once, seconds matter. For ordinary rule changes, it is a feature that converts a local bug into a global outage. There was no canary stage, no per-data-centre soak time, no automated rollback on CPU regression. And the dashboard and API that customers and operators would have used to disable the rule manually ran on the same Cloudflare edge network; once the edge was saturated, the control plane was unreachable too.

Three Conditions, One Outage .*.*=.* PCRE, two greedy quantifiers 1 ■ Author solving for coverage CPU-time profiling not in the harness; no prior CPU regression logged 2 ■ Engine without budget PCRE backtracks until done; the earlier CPU-usage guard had been removed and not reinstated 3 ■ Deployment without canary Quicksilver global in seconds; the dashboard ran on the same saturated edge it would have switched off No single component owned the question "what is the worst case here?" The failure lived in the spaces between them.

The pattern across the three is familiar: each system was reasonable on its own. Quicksilver was excellent for what it was designed for. PCRE was an industry-standard regex engine. The XSS rule was a reasonable response to a known attack class. The failure mode lived in the spaces between them, where no single component owned the question "what is the worst case here?"

The Principle

The unixoid response to "this code path could run forever" is to give it a budget that the kernel enforces. The detailed answer has three parts.

Use an engine with runtime guarantees. The linear-time regex engines, of which Google's RE2 and the Rust regex crate are the two production-grade examples, both refuse at compile time to accept patterns whose worst case is exponential, and both run any pattern they do accept in time proportional to the input. Cloudflare's preventive measures, announced in the postmortem, include switching the WAF from PCRE to either RE2 or Rust regex. The linear-time class was first published as a theoretical result by Ken Thompson in 1968 and has been available in production form since RE2 was open-sourced in 2010. It is not new technology; it is technology that has not yet displaced its older sibling everywhere it should.

Where the engine cannot be changed, cap it from outside. On FreeBSD, rctl(8) enforces resource limits on processes, jails and users: CPU time, memory, file descriptors, and several others. A regex worker run inside a jail with rctl -a jail:waf:cputime:deny=200ms cannot eat more than 200 milliseconds of CPU per process before the kernel terminates it. On any Unix, ulimit -t in the parent shell or a separate worker process per request with a kill-after timeout provides a coarser version of the same guarantee. The principle is older than software: the operator pays for an unbounded computation; the kernel is the only place that can refuse on the operator's behalf.

Treat global propagation as the kill switch, not the default. Quicksilver propagating worldwide in seconds is the right tool for "turn this off everywhere" or "patch the active CVE". It is the wrong default for "deploy a new rule": the cost of a slow rollout (minutes per region) is the upper bound on the blast radius of a bug. Cloudflare's SOP changes announced after the outage include a staged rollout path for managed rules while retaining the emergency global lane. Two paths, two purposes, two SLAs.

Three Layers, One Budget 1 ■ Engine with runtime guarantees RE2 (Google, 2010) and Rust regex crate — refuse exponential patterns at compile time, linear in input 2 ■ Cap from outside the engine FreeBSD rctl(8) for CPU and memory; ulimit -t; one worker per request with a hard kill-after timeout 3 ■ Global propagation as kill switch, not default Two paths, two SLAs; control plane on a separate failure domain from the network it controls

The FreeBSD context is worth pausing on. The BSD camp has been arguing for runtime-bounded execution at the kernel boundary for decades: rctl for CPU and memory caps, jails for blast-radius containment, Capsicum for capability-level sandboxing, the Unix-traditional discipline of one job per process with a clean exit path. None of this needed to be invented in 2019. It needed to be applied at the point where the WAF was making decisions in the request hot path. The Cloudflare team has applied the equivalent answers since; the lesson, for the rest of the industry, is that the equivalent answers exist and have been quietly proven for a long time.

Where It Travels

Cloudflare runs the WAF, but the pattern is everywhere a regex meets user input:

  • Every Web Application Firewall. Cloudflare, AWS WAF, F5, ModSecurity, custom-built rules
  • Every intrusion detection or prevention system. Suricata, Zeek and snort all run rule sets containing regex against packet payloads or extracted strings
  • Every centralised logging stack. Splunk, Elastic and Datadog and similar systems regex inbound log lines for parsing and routing. A pathological message can melt the parsing tier
  • Every JSON or GraphQL schema validator that supports pattern keywords. A schema author can write a ReDoS into an API definition without realising it
  • Every text-classification or content-moderation pipeline that runs regex on user-submitted content as a first-pass filter
  • Every Python re application on a streaming source, where one bad input can stall the whole stream
  • Every dashboard that runs on the platform it monitors. This is not the regex lesson; it is its travelling companion. A status page, a kill-switch console or a deployment dashboard that depends on the same infrastructure it controls will be unreachable exactly when it is needed

Forty-five characters of expression, twenty-seven minutes of internet. The engine did exactly what the regex asked for. The architecture decided what exactly meant.