Explore

A FAQ about LLM Hallucinations

We made Stardog Voicebox hallucination-free because we ❤️ users and hate liars!

Last edited 106 days ago by Kendall Clark

Introduction

As generative AI (GenAI) technologies become universal, hallucinations—the algorithmic creation of false but convincing data—threaten acceptance, especially in regulated industries such as banking, life sciences, and manufacturing. This FAQ discusses the nature of AI hallucinations, potential impacts, and mitigation techniques. This is the LLM Hallucinations Frequently Asked (and Answered) Questions page from the Stardog Voicebox team. We will keep this document updated as the industry evolves to deal with LLM hallucinations and related matters.

⁠

kgcdx_hallucination_woman_persephone_etruscan_mosaic_a584be0c-47be-4c66-be13-ff3107359c58.png

Persephone, the queen of the underworld, and patroness of hallucination-free AI systems!

Persephone, the queen of the underworld, and patroness of hallucination-free AI systems!

⁠

Stardog Voicebox is a fast, accurate AI data assistant, 100% hallucination-free

An advert/CTA here.

Why do hallucinations matter?

“Facts”, as the kids say these days. Let’s start with three facts:

Every business—from global enterprise to

SMB⁠

—wants to benefit from GenAI.

Every RAG-and-LLM app is prone to hallucinations; in fact, every app that trusts LLM and shows its outputs to users directly is prone to misleading users by hallucination.

A high-stakes use case, especially but not only in regulated industries, is one where culpable failure leads to bad outcomes: stock price crashes, big fines, reputational harms, jail time.

The problem comes from the conflict between enterprise goals and the imperfection of startup markets and emerging tech. The C-suite wants GenAI dynamism and value prop; but the startup world, influenced by a tiny coterie of faddish investors, is over-rotated on RAG-and-LLM startups because they’re easy to launch, give good demo, and “everyone’s confident they can lower hallucination incidence”. That stalemate of sorts leads to some of the caution we see in high-stakes, regulated industry adoption of GenAI.

What do we mean by “hallucination”, “hallucination free”, “safe AI”, and “AI Safety?

It’s important to be clear and precise about the terms we use to describe parts of the world.

Hallucinations are LLM side-effects

A bug, a wrong answer, or a viewpoint you don’t agree with may be flaws of some system; but they’re not hallucinations, which we define in this way:

A hallucination is a factual error or inaccuracy in the output of an LLM, often involving a non-existent entity, object, relationship, or event.

Just as importantly, a correctness or performance bug, or a regression, or a specification conformance issue is not a hallucination. That is, not every system fault or error is a hallucination. Sometimes a bug is just a bug!

AI systems are hallucination-free or not, safe or unsafe. Safety means hallucination-free.

It’s trivially easy to build a system that’s 100% free of hallucinations: don’t use LLMs in that system. To build a GenAI app that’s 100% free of hallucinations is harder since, more or less by construction, GenAI requires a language model and they’re all prone to hallucination; but still we can rescue the claim by saying that, no matter what happens internally in a GenAI app, what is critical is what its users see:

An AI system is safe if that system is free of hallucinations, that is, if its users never see hallucinations and nothing its users see depends logically on hallucinations.

Are there usage patterns that make hallucinations more or less likely?

In fact, it’s become experimentally clear that, due to uncertainty from the combination of training data and prompt inputs, LLMs hallucinate more often when hallucinated output is more probable than the truth given the incomplete or outdated training data and prompt inputs.

Forcing LLM to generate output in an area where training data is weak increases the chances of generating hallucinations. Particular prompt strategies do increase the likelihood of hallucination, especially when that strategy forces the LLM to generate tokens where the training data is weak.

Also forcing LLMs to justify, rationalize, or explain previous outputs that contain hallucinations will increase hallucination frequency in subsequent outputs. The LLM appears to get “into a bad spot” and usually cannot recover; i.e., hallucinations cascade or “

snowball⁠

”.

⁠

Finetuning⁠

increases hallucination frequency if its content is too distant from training material.

Studies have shown that various language use in prompts is associated with an increased hallucination frequency:

Various levels of readability, formality, and concreteness of

tone⁠

⁠

Long prompt inputs leads to “

missing middle⁠

” information loss and increases hallucinations

Irrelevant information in context prompts may

easily distract LLMs⁠

, putting them into a “bad spot” from which they struggle to recover, which struggles often increase hallucinations

⁠

Complicated chains of reasoning⁠

increase hallucinations

What are some common techniques to mitigate hallucinations?

Various prompt strategies and other techniques mitigate hallucinations:

Use a big model to tutor a smaller model to remove hallucinations. In business environments, challenges often arise when neither model fully understands the information domain or the background knowledge related to the use case, which is typically the case.

Use

pause-tokens in prompts⁠

to decrease hallucinations

A growing cluster of various “

chain of knowledge⁠

“ prompt techniques mitigate hallucination frequency

⁠

Chain of natural language⁠

to detect and re-write hallucinatory output post-generation, before users can be harmed.

Grounding LLM outputs in external data sources

⁠

Improving LLM interpretability to detect, internally, when hallucinations occur⁠

. The internal states of local LLMs—another reason to prefer them, ceteris paribus—can be inspected without code or structural manipulation, but purely through

contrastive prompt strategies⁠

(a paper that absolutely should have been called “LLM Androids Hallucinate Electric Sheep”).

If we’re lucky there will be a single structural mechanism underlying all hallucinations.

If we aren’t, there will be many mechanisms.

We are rarely lucky.

Manipulating the epistemic and evidentiary relationship between training data, frozen in the LLM, and the context, i.e., the prompt inputs. LLMs tend to favor frozen knowledge when it conflicts with context, increasing hallucination frequency. One of the mechanisms of manipulating background-versus-context is

context-aware decoding⁠

⁠

A good survey of mitigation techniques is

here⁠

Why do LLMs hallucinate? Do different types of LLM hallucinate?

No one is entirely sure; there’s no proof of why hallucinations happen. However, generally researchers believe that the probabilistic nature of LLM is responsible for hallucinations. LLMs generate one token (a token is more or less a word) at a time and they generate the most probably next token based on all the tokens it’s seen and that it’s already generated.

OpenAI has shown an interesting result that suggests hallucination is inherent to LLM:

...there is an inherent statistical lower-bound on the rate that pretrained language models hallucinate certain types of facts, having nothing to do with the transformer LM architecture or data quality. For “arbitrary” facts whose veracity cannot be determined from the training data, we show that hallucinations must occur at a certain rate for language models that satisfy a statistical calibration condition appropriate for generative language models. Specifically, if the maximum probability of any fact is bounded, we show that the probability of generating a hallucination is close to the fraction of facts that occur exactly once in the training data (a “Good-Turing” estimate), even assuming ideal training data without errors.

As of 2024, there are no LLMs that have been publicly disclosed or described that aren’t prone to hallucinations. State of the art research suggests hallucinations are inherent to the way that LLMs work.

What about video, audio, and image models? Or RNN, SSM alternatives to Transformers?

Yes. Every kind of modality and type of model hallucinates as of early 2024.

Why do GenAI systems contain hallucinations?

Now that’s an interesting question and the answer is, well, socio-political or economic or some mix of economics and technology. As we said above, LLM’s are intrinsically fluent liars, not to be trusted in high-stakes use. But GenAI systems—that is, applications that use LLMs internally—are a different matter. Most of them employ a design pattern called RAG; so much so that we’ve taken to calling the whole field RAG-with-LLM to emphasize that there are alternatives to RAG.

Which is to say that, while LLMs will continue to evolve, there may be something about them that intrinsically lies; but we can’t say that about GenAI systems. Nothing requires a GenAI system to show users raw LLM output that may contain lies. Systems that perform in that way reflect human choices and values and, as such, are a variant of the so-called ‘alignment problem’ that is far easier adjusted.

What are some alternatives to RAG?

The most important one is Semantic Parsing, which can be distinguished from RAG thusly: RAG trusts LLM to tell people true facts about the world, notwithstanding LLM’s tendency to hallucinate; Semantic Parsing trusts LLM to algorithmically determine human intent, that is, answer the question “what does this person want to know?” and then it queries trusted data sources to satisfy that intent.

Semantic Parsing uses LLM to convert human inputs into valid sentences of regular languages (query, constraint, rules, etc) and executes those in the normal way to enlighten users.

In contrast, RAG trusts LLM to provide answers to user questions, that is, RAG systems trust fluent liars to tell the truth to users; what RAG adds to the bare use of LLMs directly is dynamic contextualization via vector embeddings of ideally relevant document chunks; but

as we’ve seen⁠

in this FAQ, LLMs tend to prefer background knowledge to contextualized knowledge when they conflict.

Why don’t we just develop a hallucination detector?

Yes, that is a thing that we can do once we improve LLM interpretability, which is no small order, but progress has been pretty steady. Reliable hallucination detection appears to be more likely to succeed than convincing LLMs not to hallucinate at all.

⁠

⁠

This analysis from a

recent study⁠

is aligned with our approach in Safety RAG, where we explicitly designed a safe approach in which users either (1) rely on correct information or (2) are told, in an act of humility of system design, “I don’t know” and thereby left unenlightened and unharmed.

Doesn’t prompting solve hallucination?

Surely LLMs hallucinate because their training ends and they’re released into the world and don’t know what’s happened subsequently. Isn’t this where prompting and RAG come in to provide additional context as a supplement for training content?

Well, yes and no. RAG and prompting techniques are intended to add context to LLM. But LLM often

ignore the prompt content in favor of training content when they conflict.⁠

Contrastive-aware decoding⁠

is one way to force LLMs to pay more attention to context (i.e., prompt content) to avoid hallucinating about post-training facts.

What kinds of hallucinations happen frequently?

The most common types of hallucination (i.e., “a dream-like sight of things that aren’t part of reality”) that we’ve seen include—

General factual inaccuracies

“The Eiffel Tower was made in Dallas, but now resides in Paris, Texas.”

“The Federal Reserve is directly responsible for setting the prime interest rate for all banks in the United States.”

“The latest iPhone is manufactured entirely by robots, without any human involvement.”

Contextual misunderstanding

User: “I’m concerned about the side effects of this new medication.”

LLM: “That’s great! Side effects are a sign that the medication is working effectively.”

Fabricated or non-existent references, i.e., Bad Data Lineage

“A recent study by the National Institute of Pharmaceutical Research found that taking Vitamin C supplements can prevent Alzheimer’s disease.”

“Today newspapers reported that The Industrial Bank of Maldives completed its acquisition of the State of Iowa.”

Blending real-world entities, events, or concepts:

“During the 2008 financial crisis, the World Bank bailed out several major U.S. banks, including Frito Lay, Goldman Sachs and Morgan Stanley.”

“During World War II, President Abraham Lincoln delivered his famous Gettysburg Address.”

Temporal displacements

“The first manned mission to Mars was launched in 1985, led by astronaut Sally Ride.”

“Napoleon was exiled to Elba as a direct result of the Parisian worker’s revolution of 1871.”

Identity confusions

“Albert Einstein, the 18th-century philosopher, is best known for his theory of relativity and his equation, E=mc^2.”

Geographic misattributions

“The Great Wall of China, stretching from California to New York, is one of the Seven Wonders of the World.”

Amusing? Yes. Prudentially indicated on Wall Street, or in the Genève research lab, or the factory floor in Pittsburgh? Aw, hell no!

Are hallucinations bad for business?

It depends on the business and more critically the use case for GenAI. But we can make some helpful observations and generalizations.

RAG-and-LLM has its place in low-stakes B2C apps where creativity and flair are more valuable than accuracy and precision; but no one in a regulated industry like banking, life sciences, or manufacturing wants to pay a fine, suffer reputational harm, or go to jail because a GenAI startup cut corners and made shit up that user’s didn’t catch.

Is every hallucination harmful?

No. As in most cases, harm is context-dependent and interest-relative. Here’s a list of non-harmful hallucination types:

virtual companion, life coach, ideation partner, “voice of the deceased”, etc.

No comments from us on the social utility of these things except to say people seem to want them pretty badly and we don’t see any great harms, exactly...

That said, mucking about with the very old human grieving process seems kinda morbid.

synthetic data generator

In fact, this has very high utility and may well prevent the end of LLM progress for lack of relevant data; on the flip side, it may hasten

systemic model collapse⁠

. AI is hard!

Also has very high near-term utility in various data-centric app dev approaches.

creating lists of hallucination examples for FAQs (!!)

can you spot which 3 LLM wrote (but with some light human editing) and which 4 a human wrote?

help-me-write-my-novel and similar use cases

I’ll start reading robot novels when I’ve finished all the human ones...

image generators

In fact a primary aesthetic discriminant here is hallucinatory material

But that’s just one of the many differences between the word—which was there in the beginning, apparently—and the symbol.

One reason Stardog stresses “high-stakes use cases in regulated industries” when discussing the disutility of hallucination is precisely because those are contexts where (1) hallucinations are especially harmful and (2) there is real overlap with our mission, investment and commercial focus.

Are there things that aren’t good but also aren’t hallucinations?

Of course! As hard as eliminating hallucinations entirely may be, it’s easier than infallibilism! Errors of inference or reasoning, bias, bugs, regressions, UX warts, etc.

Why do Databases Still Matter in the Age of GenAI?

Because facts still matter! LLMs will replace databases when no one cares any more about getting facts right. We haven’t started holding our breath just yet!