Back to Blog
LLM BehaviorFinance

A Taxonomy of LLM Hallucinations in Financial Analysis

December 27, 2025 · 10 min read

"Hallucination" is used as a catch-all term for LLM errors. But not all hallucinations are equal. Some are obvious and harmless. Others are subtle and can cost you money.

After reviewing outputs from financial analyses, we've identified five distinct levels. The levels are ordered by how much domain expertise you need to catch them.

The Five Levels

L1

Fabricated Citations

low dangereasy to detect
Example

"According to the company's Q3 2025 10-Q filed on October 15, 2025..."

Defense

Check if the filing exists on SEC EDGAR. Takes 30 seconds. If the date or document doesn't exist, the quote is fabricated.

Level 1 hallucinations are the easiest to catch. The model invents a source that doesn't exist. You can verify by checking whether the document is real.

We see this most often when models try to be "helpful" by providing specific citations they weren't trained on. Perplexity has an advantage here — it actually searches and cites, rather than fabricating.

L2

Wrong Numbers

medium dangermedium to detect
Example

"Revenue was $4.2 billion in Q3, up 12% year-over-year."

Defense

Cross-check against two sources (aggregator + primary filing). If the number appears in only one source, be suspicious.

Level 2 is wrong numbers. The document exists, but the model misreads or invents the figures. This is harder to catch because you need to actually read the source.

Common failure modes:

  • Confusing revenue with contribution ex-TAC (we saw this with Criteo)
  • Using TTM numbers when quarterly was asked for
  • Mixing up fiscal years with calendar years
  • Getting the sign wrong on growth rates
L3

Omitted Context

high dangerhard to detect
Example

"The company has $800M in cash, providing a strong balance sheet."

Defense

Check the full balance sheet. Cash without liabilities is meaningless. Look for 'client funds,' lease liabilities, pension obligations.

Level 3 is where things get dangerous. The number is correct, but crucial context is missing. The statement is technically true but misleading.

Real Example: PayPal
An LLM reported PayPal had "$15 billion in cash." True — but that cash includes customer funds held on behalf of users. Net of liabilities, the usable cash was closer to $3 billion. The headline number overstated financial flexibility by 5x.

Level 3 hallucinations require you to know what context is missing. If you don't already understand payment processor accounting, you won't catch this.

L4

Plausible Reasoning Errors

high dangerhard to detect
Example

"With 22% FCF yield and aggressive buybacks, shareholder returns are excellent."

Defense

Run the full math. Check if buybacks exceed SBC dilution. 22% gross yield means nothing if 15% is lost to stock-based compensation.

Level 4 is faulty reasoning that sounds correct. The model cites real numbers and makes an inference that seems logical but is actually wrong.

We built the "Treadmill Test" specifically to catch this. Many companies with high buyback yields are actually treading water or going backwards when you account for dilution.

Why This Is Hard to Catch
Level 4 requires you to run the math yourself. The model's arithmetic might even be correct — the error is in the logic of what to include or exclude. You need to know the right framework, not just verify the numbers.
L5

Structural Misunderstanding

critical dangerhard to detect
Example

"Criteo's pivot to Retail Media makes it a growth company trading at value multiples."

Defense

Nearly impossible to detect without deep familiarity. The only real defense: read voraciously, question assumptions, and always be willing to update your understanding.

Level 5 is the worst. The model understands the surface facts but misunderstands the structural dynamics of the business or industry.

In the Criteo example: yes, Retail Media is growing. But we calculated that 13% of revenue growing at 20% cannot mathematically offset 87% of revenue declining at 5%. The "pivot" narrative is true in direction but false in magnitude.

Level 5 errors are nearly impossible to detect — short of developing deep familiarity yourself. And here's the uncomfortable truth: even "experts" get structural calls wrong. The market is full of professional analysts who missed obvious-in-hindsight structural shifts.

The Real Defense Against L5
There's no silver bullet. The best you can do is: read voraciously, defend yourself against common cognitive biases, seek out opposing viewpoints, and always be ready to update your priors when new information emerges. This is a continuous practice, not a credential.

Defense by Level

LevelDetection MethodWho Can Catch It
L1 — Fake CitationsCheck if source existsAnyone with internet access
L2 — Wrong NumbersCross-check 2+ sourcesAnyone who reads the filing
L3 — Missing ContextCheck what's NOT mentionedSomeone who knows what to look for
L4 — Reasoning ErrorsRun the math yourselfSomeone who knows the right framework
L5 — Structural ErrorsDeep familiarity + continuous learningNearly impossible — requires ongoing study

How Our Framework Addresses Each Level

  • L1 (Fake Citations): We direct models to use aggregators over SEC EDGAR (aggregators are indexed better). Perplexity provides real-time citations.
  • L2 (Wrong Numbers): We mandate cross-source reconciliation in Stage 0. If numbers don't match, flagged.
  • L3 (Missing Context): We have specific checks for client funds, lease liabilities, SBC, and other commonly omitted items built into the prompt.
  • L4 (Reasoning Errors): The Treadmill Test, NCI check, and other mandatory calculations force explicit math. Harder for models to hide bad logic.
  • L5 (Structural Errors): Honestly? We can't reliably catch these either. We mark what we've manually checked vs. LLM-only, but structural understanding requires ongoing learning — not a checklist.

The Uncomfortable Truth

We can catch L1-L3 reliably with our current framework. L4 we catch most of the time with mandatory calculations. L5 is where we have to be honest: we can only catch structural errors in domains we understand.

If an LLM makes a structural error about oil drilling physics, we might miss it. If it makes a structural error about tech sector business models, we're more likely to catch it.

This is why we include a "What We Checked" section in every analysis. We want readers to know exactly what has been manually verified vs. what's still LLM-only — so you can decide what additional verification you need to do.

The Meta-Lesson
The higher the level of hallucination, the more you need to do your own thinking. No prompt engineering, multi-model validation, or "expert review" fully replaces your own informed judgment. That's why we show our work — so you can evaluate the reasoning yourself.

Acknowledgment: This taxonomy is based on errors we've actually encountered. As we see new failure modes, we'll update both this post and our verification framework.