Why AI Hallucinations Are Not a Bug to Be Patched

A new research paper from OpenAI and Georgia Tech has landed that deserves attention from anyone deploying AI in a context where accuracy matters. If you work in banking, healthcare, or the public sector, that means you.

The paper, “Why Language Models Hallucinate” by Kalai, Nachum, Vempala, and Zhang, makes a deceptively simple argument: language models hallucinate because the entire training and evaluation pipeline rewards guessing over admitting uncertainty.

Hallucinations are not a mysterious emergent behaviour.They are a predictable, mathematically demonstrable consequence of how these systems are built and measured.

For those of us building AI-powered solutions for regulated industries, this is not just an academic curiosity. It fundamentally shapes how we should think about deploying these systems responsibly.

The Core Argument

The paper establishes a formal connection between language model generation and binary classification. The key insight is elegant: even if the training data were entirely error-free, the statistical objectives optimised during pre-training would still produce errors. The authors demonstrate that generative error rates are bounded below by roughly twice the misclassification rate of a simpler yes-or-no task - essentially, the model’s ability to distinguish valid outputs from invalid ones.

In plain terms: if a model cannot reliably tell the difference between a true statement and a plausible falsehood in a given domain, it will inevitably generate plausible falsehoods in that domain. This is not a failure of engineering. It is a mathematical certainty.

The paper then goes further, arguing that post-training (the reinforcement learning and fine-tuning phase that is supposed to make models more helpful and accurate) actually reinforces the problem. Why? Because the benchmarks and evaluations used to measure model quality systematically penalise uncertainty. A model that says “I don’t know” scores worse on most leaderboards than one that guesses confidently and gets it right half the time.The incentive structure rewards confident guessing. Which is precisely the behaviour we call hallucination.

Why this matters for regulated industries

The implications for anyone deploying AI in a regulated context are significant.

First, it confirms what many of us have suspected:hallucinations cannot be eliminated through scale alone. Bigger models, more training data, and more compute will reduce the rate of hallucination, but they will not eliminate it. The paper demonstrates that certain categories of error -what the authors call “arbitrary-fact hallucinations” - arise from information that simply cannot be distinguished from noise in the training data, regardless of model size.

Second, it provides a rigorous basis for the architectural position we have long advocated at Marino: that probabilistic AI systems should not be making unsupervised decisions in high-stakes environments. When the EU AI Act requires human oversight for high-risk AI systems, it is not bureaucratic caution. It is a rational response to a well-characterised technical limitation.

Third, the paper’s analysis applies broadly. It is not specific to any particular model architecture, training approach, or provider.It applies to Transformer-based models, to reasoning systems, and to retrieval-augmented generation. The fundamental dynamic is structural: generation is harder than classification, and evaluation incentives reward overconfidence.

Evaluation problem = governance problem

One of the most provocative arguments in the paper is that the hallucination problem is partly socio-technical. The authors argue that modifying how existing benchmarks score uncertain responses would do more to reduce hallucination than creating new hallucination-specific evaluations.If models were rewarded for saying “I’m not confident” rather than penalised for it, the training incentives would shift accordingly.

This resonates strongly with the challenges we see in regulated industries. When an AI system is deployed in a clinical or financial context, the evaluation criteria should reflect the real-world cost of errors.A credit scoring model that confidently produces an incorrect assessment is far more dangerous than one that flags uncertainty and defers to a human reviewer.A clinical decision support tool that hedges appropriately is more valuable than one that always produces a definitive answer, some of which will inevitably be wrong.

For organisations building their own AI evaluation frameworks (as the EU AI Act’s conformity assessment requirements will increasingly demand) this paper provides a strong theoretical basis for including uncertainty handling as a first-class evaluation criterion.

Self-Hosting and the control advantage

The hallucination problem also strengthens the case for self-hosted, open-weight models in regulated environments. When you control the model, you control the evaluation. You can design your testing regime to penalise overconfidence rather than reward it. You can fine-tune for appropriate uncertainty expression in your specific domain. You can build guardrails that intercept low-confidence outputs before they reach end users.

The open-weight ecosystem has matured to the point where this is practical. Models like Mistral, Llama, Qwen, and Kimi K2.5 are capable enough for production use, and the hardware required to run them has become genuinely accessible. A single high-end GPU server can handle inference for most enterprise workloads, and EU-hosted infrastructure options mean you can maintain full data sovereignty while doing so.

This is not about avoiding the cloud. It is about having the architectural freedom to implement the kind of nuanced evaluation and guardrailing that the hallucination problem demands, and that regulators will increasingly expect.

What we take from this

At Marino, we have always maintained that AI should be the assistant, not the pilot - particularly in regulated environments. This paper provides the mathematical underpinning for that position.

The practical takeaways for our clients are clear. Do not assume hallucinations will be solved by the next model release. Build your systems with the expectation that probabilistic AI will sometimes be confidently wrong, and design your human oversight accordingly. Evaluate yourAI systems not just on accuracy but on how they handle uncertainty. And consider whether the architectural choices you are making — including whether to self-host — give you the control you need to manage this inherent limitation responsibly.

Hallucinations are not going away. But with the right architecture, governance, and evaluation frameworks, they can be managed. That is the work we are doing every day.

If you are thinking about how to deploy AI responsibly in a regulated context, get in touch at marinosoftware.com/contact.

!@THEqQUICKbBROWNfFXjJMPSvVLAZYDGgkyz&[%r{\"}mosx,4>6]|?'while(putc 3_0-~$.+=9/2^5;)<18*7and:`#

Keith Davey

Why AI Hallucinations Are Not a Bug to Be Patched

A new paper from OpenAI and Georgia Tech proves what regulated industries need to hear: hallucinations are mathematically inevitable, not an engineering failure waiting to be fixed. Here's what that means for how you build.

Let's talk