Vivian Voss

The Unit That Crossed a Boundary

architecture freebsd tooling devops

Tales from the Bare Metal ☞ Episode 04

« Thou shalt not measure science in imperial units! »

23 September 1999. The Mars Climate Orbiter fires its main engine to enter orbit around Mars, passes behind the planet as planned, and is never heard from again. The spacecraft cost 193 million dollars to build, part of a 327.6 million dollar mission. It travelled 670 million kilometres across nine months of deep space, and it was lost to a number with no unit written on it.

This is the cleanest example in the engineering record of why a quantity is not the same thing as a number.

The Incident

The Mars Climate Orbiter launched on 11 December 1998. Its job was to enter a stable orbit around Mars and study the planet's atmosphere, also serving as a communications relay for the Mars Polar Lander that would follow.

The mission proceeded normally for nine months. On 23 September 1999, the orbiter executed its Mars Orbit Insertion burn: a planned firing of the main engine to slow the craft enough for Mars' gravity to capture it. The burn was designed to place the orbiter at a closest approach of 226 km above the Martian surface, comfortably above the atmosphere.

The craft passed behind Mars, as expected, and signal was lost, as expected. It was never reacquired. Post-failure reconstruction showed that the trajectory had brought the orbiter to approximately 57 km above the surface, deep within the atmosphere. A spacecraft built for the vacuum of orbit does not survive atmospheric entry at orbital speed. It either burned up or was torn apart; either way, it was gone.

Two Trajectories, 169 km Apart planned: 226 km safely above the atmosphere — stable orbit actual: 57 km deep in the atmosphere — craft does not survive

The Mars Climate Orbiter Mishap Investigation Board released its Phase I report on 10 November 1999. The root cause was stated plainly, and it was not a hardware fault, not a navigation error in the usual sense, not a launch problem. It was a unit mismatch at a software interface.

The Diagnosis

The trajectory of a spacecraft is adjusted over its journey by small thruster firings. Each firing produces an impulse, and the magnitude of that impulse must be fed into the navigation software so the predicted trajectory stays accurate.

Two organisations wrote the two halves of this loop. Lockheed Martin, in Colorado, built the spacecraft and the ground software that calculated the impulse from each thruster firing. NASA's Jet Propulsion Laboratory, in California, ran the navigation software that consumed those impulse figures and computed the resulting trajectory.

Lockheed Martin's software produced the impulse in pound-force seconds. This is an imperial unit: a pound-force is the force exerted by gravity on a one-pound mass, and a pound-force second is that force applied for one second. JPL's navigation software expected the impulse in newton-seconds. This is the metric (SI) unit: a newton is the force that accelerates one kilogram at one metre per second squared, and a newton-second is that force applied for one second.

One pound-force second equals 4.45 newton-seconds. The two numbers describe the same physical impulse, but the number representing it differs by a factor of 4.45 depending on which unit you mean. The interface specification required metric. Lockheed Martin's software, for the specific file in question, produced imperial. JPL's software read the imperial numbers as though they were metric, and so every impulse was interpreted as 4.45 times smaller than it actually was. The corrections were systematically wrong, in the same direction, for nine months, until the predicted 226 km insertion altitude was, in reality, 57 km.

The Number Crossed. The Unit Did Not. Lockheed Martin ground software (Colorado) output: 1.00 lbf·s interface (spec: metric) 1.00 no unit attached NASA JPL trajectory software (California) read as: 1.00 N·s 1 lbf·s = 4.45 N·s every impulse read 4.45× too small, in the same direction, for nine months

The single most important sentence in the whole account is this: the number was correct. The software did not miscalculate. The bits that crossed the interface were the right bits. What was missing was the label that said what those bits meant, and each side filled in the missing label with its own assumption.

The Context

A unit mismatch sounds like the kind of thing that should be caught in an afternoon. The interesting question is not how the mistake was made, but how it survived nine months and 670 million kilometres without being caught. Three conditions allowed it.

Three Conditions, One Lost Mission 1 ■ The spec was a document, not a check the interface spec required metric; nothing enforced it at runtime. A document assigns blame after the fact 2 ■ The warning was below the escalation threshold cruise navigators saw the drift months early; raised informally, never escalated; each correction looked within tolerance 3 ■ The test coverage stopped at the boundary each system correct in isolation; no end-to-end test crossed the handoff, which is exactly where the fault lived

First, the interface specification existed and was clear: it required metric. The imperial output was a deviation from spec. But nothing enforced the specification at runtime. The specification was a document, not a check. A document does not stop a wrong number; it only assigns blame after the wrong number has done its work.

Second, the discrepancy was visible before the loss. Navigators at JPL noticed during the cruise that the trajectory was not behaving quite as the models predicted; small corrections were needed more often than expected. The signal was there in the data, months before arrival. It was discussed informally but never escalated into a formal anomaly investigation, partly because each individual correction was within tolerance and the cumulative drift looked like ordinary navigation noise until it was too late.

Third, there was no end-to-end test. No test ran the Lockheed Martin impulse calculation and the JPL trajectory calculation against the same manoeuvre and compared the result to an independent reference. Each system was tested in isolation and behaved correctly in isolation. The fault lived only in the handoff between them, which is precisely the region that isolated unit tests do not cover.

These three conditions recur in almost every interface failure. The spec is a document not a check; the warning signal is visible but below the escalation threshold; the test coverage stops at the boundary rather than crossing it.

The Principle

Science settled the question of units a century ago. The Système International (SI) is the one coherent measurement system the entire scientific and engineering world shares, precisely so that a number computed by one team means the same thing to another. The Mars Climate Orbiter was lost because one half of the programme had not honoured that settlement: it still computed in pound-force seconds, an imperial unit with no place in serious engineering work.

The lesson is direct. In scientific and engineering contexts, measure in metric. There is no defence for imperial units in work where a factor of 4.45 decides whether a spacecraft enters orbit or enters the ground. Every serious laboratory, the United States included, standardised on SI for exactly this reason, and the orbiter is the canonical demonstration of what the alternative costs. Imperial units are a regional convention for everyday life; they are not a tool for computing trajectories, and the moment a programme treats them as one, it has built a 4.45 into its maths and dared the universe to find it.

There is a second guard worth stating, because metric alone is not sufficient. Two systems both using metric can still fail if one means seconds and the other milliseconds, if one means metres and the other kilometres, if one means bytes and the other kibibytes. So the unit must also travel with the number, enforced where the two systems meet.

Make the Unit Travel With the Number 1 ■ Types the compiler checks Rust newtypes: NewtonSeconds(f64) ≠ PoundForceSeconds(f64) • F# units of measure: float<N*s> ≠ float<lbf*s> 2 ■ Fields the parser demands the unit is a mandatory field, not an optional comment; the bare number becomes unrepresentable 3 ■ Names the variable carries never timeout = 30; write timeout_ms = 30 or Duration::from_secs(30). The unit lives where the reader cannot miss it

The first rule needs no tooling: in science, measure in metric. The second rule catches what the first cannot: name the unit at every boundary regardless. The orbiter needed both, and had neither.

On FreeBSD this discipline is visible in the small, ordinary places. dd bs=1M states the block size with its unit; the bare number would be ambiguous. sysctl values are documented with their units, and the tunables in loader.conf name their dimensions. The convention across the base system is that a number with a physical meaning is rarely written naked. This is not glamorous and it has prevented an enormous number of quiet disasters. The boundary is where the unit must be loudest, because the boundary is exactly where two assumptions meet and discover they disagree.

Where It Travels

The Mars Climate Orbiter is a spacecraft, which makes the failure feel exotic. It is not exotic. The same failure ships in ordinary software every day:

  • Every API that passes a duration as a bare integer. Is 30 seconds or milliseconds? The function name setTimeout(30) does not say, and the two readings differ by a factor of a thousand
  • Every configuration value that takes a size without a suffix. Is cache_size = 1000 bytes, kilobytes, or entries?
  • Every function signature with a parameter named distance, interval, rate or size and no unit anywhere in the type or the name
  • Every CSV handed from one team's export to another team's import, where column 7 is "amount" and nobody agreed whether it is gross or net, dollars or cents
  • Every number that means one thing on the left of an interface and another thing on the right, because the interface carried the number faithfully and the meaning not at all

The Same Mistake, Smaller: Gigabyte and Gibibyte

The unit mismatch does not even need two systems as different as imperial and metric. It hides inside a single system, in the gap between decimal and binary.

A gigabyte, by the SI definition, is 1,000,000,000 bytes; "giga" is the standard decimal prefix for a billion, the same prefix as in gigahertz or gigawatt. A gibibyte, defined by the IEC in 1998, is 1,073,741,824 bytes: two to the power of thirty, the nearest binary round number. The two differ by about 7.4%, and the gap widens with every step up the ladder: terabyte versus tebibyte is about 10%, petabyte versus pebibyte about 12.6%.

The Gap Widens Up the Ladder kB vs KiB +2.4% MB vs MiB +4.9% GB vs GiB +7.4% TB vs TiB +10% PB vs PiB +12.6% A "1 TB" decimal drive (1,000,000,000,000 bytes) displays as ~931 GiB. Prefer KiB, MiB, GiB, TiB when you mean powers of two. The IEC defined them in 1998.

The disk industry sells in decimal. A "1 TB" drive holds 1,000,000,000,000 bytes, exactly as labelled. Many operating systems, Windows most stubbornly, then display that capacity in binary while still calling the result "GB". The drive shows up as roughly 931 "GB", the customer concludes they have been short-changed by seven per cent, and a support ticket is born. Nobody has lied; two definitions of the same word have simply met at a boundary without introducing themselves.

It is the Mars Climate Orbiter with the stakes turned down from a spacecraft to a slightly disappointing disk. The IEC defined a complete binary ladder in 1998, parallel to the decimal one at every rung: kibibyte (KiB) beside kilobyte (kB), mebibyte (MiB) beside megabyte (MB), gibibyte (GiB) beside gigabyte (GB), and on through tebibyte (TiB), pebibyte (PiB) and beyond. The recommendation is the same discipline as everywhere else in this story. If you mean the binary value, write GiB, MiB, TiB. If you mean the decimal value, write GB, MB, TB. Never write one while meaning the other. The standard has been available for over a quarter of a century; the only thing still keeping GB where GiB belongs is habit, and habit is precisely what put pound-force seconds where newton-seconds were expected.

A 193-million-dollar spacecraft was lost because one team did its sums in imperial units. The number was right. The unit was from the wrong century. In science and engineering, measure in metric; and where two systems must still meet, name the unit of every number that crosses between them, so that the compiler or the variable name holds what no shared assumption ever safely will.