In 2013, a German data scientist named David Kriesel discovered something rather unsettling about Xerox WorkCentre scanners: they were silently changing numbers in scanned documents. A 7 became an 8. No error message. No warning. No red flag of any kind. Just a quiet substitution, filed away and forwarded to the next person in the chain.
The cause was JBIG2 compression, which uses pattern matching to reduce file size. When two characters look sufficiently similar, the algorithm treats them as identical. The 7 looked enough like an 8 (to the algorithm, not to any human) and so the 8 it became. Years of invoices, construction plans, medical records, and tax documents were corrupted before anyone noticed. Because, of course, the output looked perfectly normal.
That is the distinguishing feature of a shallow error: it passes inspection.
The Fluent Mistranslation
Fast-forward to the present. DeepL translates "composability" as "Kombinierbarkeit", combinability. Close. Plausible. The sort of thing a reviewer who is not a domain specialist would wave through without a second glance. And so it propagates: through technical documentation, review cycles, international teams. Everyone reads fluently. Nobody notices the semantic drift.
"Composability" is a precise term. It means the ability of components to be assembled into larger structures whilst retaining their individual interfaces, a property central to Unix philosophy, functional programming, and modular system design. "Combinability" merely means things can be put together. A bag of Lego is combinable. Unix pipes are composable. The distinction matters, but only to people who know what they are looking for. Everyone else reads on, satisfied.
We laugh at badly translated Chinese instruction manuals. "Do not the cat ironing." Splendid. But those are the safe errors, so obviously wrong that they invite scepticism by default. Nobody reads "Please to enjoy the happy washing experience" and trusts the electrical safety instructions that follow. The absurdity is its own warning label.
The fluent mistranslation is the dangerous one. It passes review precisely because it reads well.
The Error Spectrum
Not all errors are created equal, and the ones we worry about least are frequently the ones that cause the most damage.
On the left: errors that announce themselves. A syntax error halts the build. A 500 status code triggers the on-call rota. A garbled translation provokes laughter, then correction. These are loud failures. They cost time, never trust.
On the right: errors that arrive dressed for the occasion. A number that has been quietly swapped. A term that has been subtly shifted. A paragraph that reads beautifully and says something almost, but not quite, correct. These are shallow errors. They cost trust, eventually. By the time anyone notices, the damage has compounded.
Coherence Engines
Simon Wardley's observation about large language models is worth stating plainly: they are coherence engines, not truth engines. Their function is to produce output that reads as plausible. Not output that is correct. Output that sounds correct. The distinction is the entire problem.
A coherence engine and a pattern-matching compressor have more in common than one might suppose. JBIG2 looked at a 7, compared it statistically to an 8, and concluded they were close enough. An LLM looks at "composability," compares it statistically to its nearest neighbours in the target language, and concludes "Kombinierbarkeit" is close enough. Both systems optimise for plausibility. Neither system has a mechanism for semantic verification. The error surface is identical: the output looks right, nobody checks, and the corruption propagates.
The Missing Assertion
Test-driven development introduced a discipline that software engineering now takes for granted: the assertion. An explicit expectation, stated in code, that fails loudly when reality diverges from intent. The entire value of a test suite lies not in the tests that pass, but in the tests that don't, because a loud failure is a caught failure.
Translation has no such mechanism. Neither does most AI-generated
content. There is no assertion layer between the output and the
consumer. No expect(translation).toPreserveSemantics().
The reviewer is the test suite, and the reviewer (being human,
being busy, being presented with fluent prose) is precisely
the sort of test suite that waves green on plausibility alone.
TDD gave software engineering its rigour through one simple insight: make the failure visible. The shallow error is dangerous precisely because nothing in the pipeline is designed to surface it.
The Uncomfortable Implication
We have built an industry around loud errors. Linters, type checkers, continuous integration, monitoring dashboards, alerting systems, all of them optimised for failures that announce themselves. We are exceptionally good at catching the errors that were never going to cause lasting damage in the first place.
The shallow errors, the ones that read well, that pass review, that propagate through documentation and decision-making and downstream systems, have no tooling. No pipeline. No dashboard. They survive precisely because they do not trigger anything.
The dangerous errors are not the ones that fail loudly. They are the ones that read well.
A badly translated manual is a joke. A fluently mistranslated specification is a lawsuit. The distance between the two is not quality. It is visibility.