Vivian Voss

AI Code Generation: The Hallucination Tax

javascript performance security architecture

Performance-Fresser ■ Episode 20

"AI will write your code! 55 per cent faster! Ship in half the time!"

METR ran a randomised controlled trial. Sixteen experienced developers. 246 tasks. Mature codebases averaging one million lines of code. Not toy projects. Not LeetCode exercises. Real production systems with real dependencies and real architectural decisions. Result: developers using AI were 19 per cent slower. Not faster. Slower.

The developers themselves believed they were 20 per cent faster. They were not. One does admire the confidence.

The 2026 Follow-Up

In February 2026, METR published an update. The returning developers from the original study, same people, same codebases, now estimated at -18 per cent speedup. The confidence interval: -38 per cent to +9 per cent. It crosses zero. The newly recruited developers: -4 per cent. Confidence interval: -15 per cent to +9 per cent. Also crosses zero.

METR acknowledged significant selection bias in their experimental design. Their conclusion was not "AI now works." It was: the experimental design can no longer reliably measure the effect, so they are changing methodology. One does appreciate a research team that publishes its own limitations rather than its own press release.

The gap appears to be narrowing. But neither the 2025 nor the 2026 data supports the marketing claim of "55 per cent faster." The confidence intervals crossing zero means the effect could be positive, negative, or nonexistent. The data does not yet say what either side would like it to say. The cheerleaders cannot claim victory. The sceptics cannot claim vindication. The honest answer is: we do not know yet. One does find that the most inconvenient findings are usually the most accurate.

The Hallucination

19.6 per cent of AI-recommended packages do not exist. Nearly one in five imports point to packages that were never published, never registered, never conceived by a human. The study analysed 756,000 code samples across 16 models. This is not an edge case. It is the baseline. 43 per cent of those hallucinated packages reappear consistently across re-queries. The AI does not guess randomly. It hallucinates with conviction, and it hallucinates the same things repeatedly, which is rather the worst combination of attributes one could design if one were trying.

Attackers have noticed. The technique is called slopsquatting: register a package on npm or PyPI matching the name the AI consistently halluccinates. The next developer who accepts the AI's suggestion installs malware. The model's confidence becomes a supply chain attack vector. One does appreciate an industry that invents new vulnerability categories at this pace.

The Marketing vs The Measurement -19% actual speed (METR RCT) devs believed +20% 19.6% packages do not exist 756K samples, 16 models 40% Copilot: vulnerabilities NYU, 1,692 programs 29% trust AI accuracy (was 40%) 45% say debugging AI takes longer Sources: METR (2025), USENIX, NYU, Stack Overflow 2025 (65K respondents) 61% say AI output "looks correct but is not reliable" (Qodo)

40 per cent of GitHub Copilot's generated code contains security vulnerabilities. NYU tested 89 scenarios generating 1,692 programmes. Stanford found that developers with AI access write significantly less secure code than those without, whilst being more confident that their code is secure. The study measured 47 developers across Python, JavaScript, and C. The less they questioned the AI, the more vulnerabilities they introduced. Rather inconvenient finding, that.

The Complexity Tax

Here is what the benchmarks reveal but the marketing omits: AI performs measurably worse on complex, abstracted code. Framework-specific conventions, proprietary APIs, deep dependency chains: these are the contexts where hallucination rates climb.

20.41 per cent of code hallucinations stem from incorrect API usage. The more framework-specific the API, the more the model confuses conventions, invents methods that do not exist, and mixes patterns from different versions. A model asked to write Express middleware will occasionally reference methods from Koa. A model writing React components will import hooks from Vue. The confidence is unwavering. The imports are imaginary. Higher Halstead complexity, larger vocabulary, deeper abstraction: all correlate directly with higher failure rates in LLM-generated code.

More Complexity = More Hallucinations JavaScript21.3%hallucinated imports Python15.8%hallucinated imports Vanilla codeless to hallucinate about 3.2M npm packages = more hallucination surface than 500K PyPI packages. The complexity you built for humans to struggle with is now the complexity AI struggles with too.

Vanilla code in the language's standard library produces cleaner AI output. Not because the AI is smarter. Because there is less to hallucinate about. Fewer abstractions, fewer proprietary patterns, fewer opportunities for the model to confidently fabricate something that compiles but does not work. JavaScript illustrates this precisely: 21.3 per cent hallucinated imports versus 15.8 per cent in Python. More packages in the ecosystem means more hallucination surface. The complexity you built for humans to struggle with is now the complexity AI struggles with too.

The Model

Not all models are equal, and the tooling matters as much as the model behind it.

Copilot autocompletes lines. It predicts the next token based on your current file. An agentic model in a proper development environment reasons about architecture, reads your project structure, navigates across files, and understands context at a system level. The difference is not incremental. It is categorical. Comparing Copilot's autocomplete to an agentic coding assistant is like comparing spell-check to a copy editor. Both involve text. The similarity ends there.

Less than 44 per cent of AI-generated code was accepted in the METR study. The developers spent more time evaluating, adjusting, and discarding suggestions than they would have spent writing the code themselves. The tool that was meant to remove friction became the friction.

Choosing the right model for the right task is engineering. Using whatever ships with your IDE is hope.

The Code Quality Decline

GitClear analysed 211 million lines of code and measured the impact of AI adoption on code quality:

Code Quality After AI Adoption (GitClear, 211M Lines) 25% → <10% refactoring collapsed 2021 to 2024 +48% code cloning surged 8.3% to 12.3% 2x code churn doubled vs 2021 baseline The AI generates code faster. The code gets replaced faster. The net effect is not acceleration. It is accumulation of disposable code. Nobody refactors because the AI will just generate more.

Refactoring collapsed: from 25 per cent of changed lines in 2021 to under 10 per cent in 2024. Code cloning surged: copy-pasted lines rose from 8.3 per cent to 12.3 per cent, a 48 per cent increase. Code churn doubled: lines reverted or updated within two weeks doubled versus the 2021 baseline.

The AI generates code faster. The code gets replaced faster. The net effect on the codebase is not acceleration. It is accumulation of disposable code that nobody refactors because the AI will just generate more.

The Trust Erosion

The developer community is noticing. Stack Overflow's 2025 survey (65,000 respondents) tells the story:

  • 84 per cent use or plan to use AI tools (adoption is not the problem)
  • Only 29 per cent trust AI accuracy (down from 40 per cent the previous year)
  • Favourability dropped from 72 per cent to 60 per cent
  • 75 per cent still prefer asking a human over trusting AI output
  • 77 per cent say "vibe coding" is not part of their professional work

The industry adopted the tool before it understood the tool. Now the understanding is catching up, and confidence is dropping. Not because AI got worse. Because expectations met reality.

The Lever

The solution is not abandoning AI. It is reducing what the AI must navigate.

Write lean code: close to the language's own idioms, minimal abstractions, no framework magic. A function that does one thing, named clearly, understood where it is read. The AI generates better output for this code because there is less to hallucinate about. The human reviews it faster because there is less to misunderstand. This is not nostalgia. It is architecture for the AI era. The same principles that made code maintainable for humans now make it reviewable when generated by machines.

61 per cent say AI produces code that "looks correct but is not reliable." The hallucination is not in the model. It is in the expectation that a tool can navigate complexity you have not mastered yourself.

Write lean. The AI will follow.

METR: 19% slower with AI (developers believed 20% faster). 19.6% of recommended packages do not exist. 40% of Copilot output contains vulnerabilities. Refactoring collapsed from 25% to under 10%. Trust in AI accuracy: 29%, down from 40%. The answer is not more AI. It is less complexity. Write lean. The AI will follow.