"Hallucination" is used as a catch-all term for LLM errors. But not all hallucinations are equal. Some are obvious and harmless. Others are subtle and can cost you money.
After reviewing outputs from financial analyses, we've identified five distinct levels. The levels are ordered by how much domain expertise you need to catch them.
The Five Levels
Fabricated Citations
"According to the company's Q3 2025 10-Q filed on October 15, 2025..."
Check if the filing exists on SEC EDGAR. Takes 30 seconds. If the date or document doesn't exist, the quote is fabricated.
Level 1 hallucinations are the easiest to catch. The model invents a source that doesn't exist. You can verify by checking whether the document is real.
We see this most often when models try to be "helpful" by providing specific citations they weren't trained on. Perplexity has an advantage here — it actually searches and cites, rather than fabricating.
Wrong Numbers
"Revenue was $4.2 billion in Q3, up 12% year-over-year."
Cross-check against two sources (aggregator + primary filing). If the number appears in only one source, be suspicious.
Level 2 is wrong numbers. The document exists, but the model misreads or invents the figures. This is harder to catch because you need to actually read the source.
Common failure modes:
- Confusing revenue with contribution ex-TAC (we saw this with Criteo)
- Using TTM numbers when quarterly was asked for
- Mixing up fiscal years with calendar years
- Getting the sign wrong on growth rates
Omitted Context
"The company has $800M in cash, providing a strong balance sheet."
Check the full balance sheet. Cash without liabilities is meaningless. Look for 'client funds,' lease liabilities, pension obligations.
Level 3 is where things get dangerous. The number is correct, but crucial context is missing. The statement is technically true but misleading.
Level 3 hallucinations require you to know what context is missing. If you don't already understand payment processor accounting, you won't catch this.
Plausible Reasoning Errors
"With 22% FCF yield and aggressive buybacks, shareholder returns are excellent."
Run the full math. Check if buybacks exceed SBC dilution. 22% gross yield means nothing if 15% is lost to stock-based compensation.
Level 4 is faulty reasoning that sounds correct. The model cites real numbers and makes an inference that seems logical but is actually wrong.
We built the "Treadmill Test" specifically to catch this. Many companies with high buyback yields are actually treading water or going backwards when you account for dilution.
Structural Misunderstanding
"Criteo's pivot to Retail Media makes it a growth company trading at value multiples."
Nearly impossible to detect without deep familiarity. The only real defense: read voraciously, question assumptions, and always be willing to update your understanding.
Level 5 is the worst. The model understands the surface facts but misunderstands the structural dynamics of the business or industry.
In the Criteo example: yes, Retail Media is growing. But we calculated that 13% of revenue growing at 20% cannot mathematically offset 87% of revenue declining at 5%. The "pivot" narrative is true in direction but false in magnitude.
Level 5 errors are nearly impossible to detect — short of developing deep familiarity yourself. And here's the uncomfortable truth: even "experts" get structural calls wrong. The market is full of professional analysts who missed obvious-in-hindsight structural shifts.
Defense by Level
| Level | Detection Method | Who Can Catch It |
|---|---|---|
| L1 — Fake Citations | Check if source exists | Anyone with internet access |
| L2 — Wrong Numbers | Cross-check 2+ sources | Anyone who reads the filing |
| L3 — Missing Context | Check what's NOT mentioned | Someone who knows what to look for |
| L4 — Reasoning Errors | Run the math yourself | Someone who knows the right framework |
| L5 — Structural Errors | Deep familiarity + continuous learning | Nearly impossible — requires ongoing study |
How Our Framework Addresses Each Level
- L1 (Fake Citations): We direct models to use aggregators over SEC EDGAR (aggregators are indexed better). Perplexity provides real-time citations.
- L2 (Wrong Numbers): We mandate cross-source reconciliation in Stage 0. If numbers don't match, flagged.
- L3 (Missing Context): We have specific checks for client funds, lease liabilities, SBC, and other commonly omitted items built into the prompt.
- L4 (Reasoning Errors): The Treadmill Test, NCI check, and other mandatory calculations force explicit math. Harder for models to hide bad logic.
- L5 (Structural Errors): Honestly? We can't reliably catch these either. We mark what we've manually checked vs. LLM-only, but structural understanding requires ongoing learning — not a checklist.
The Uncomfortable Truth
We can catch L1-L3 reliably with our current framework. L4 we catch most of the time with mandatory calculations. L5 is where we have to be honest: we can only catch structural errors in domains we understand.
If an LLM makes a structural error about oil drilling physics, we might miss it. If it makes a structural error about tech sector business models, we're more likely to catch it.
This is why we include a "What We Checked" section in every analysis. We want readers to know exactly what has been manually verified vs. what's still LLM-only — so you can decide what additional verification you need to do.
Acknowledgment: This taxonomy is based on errors we've actually encountered. As we see new failure modes, we'll update both this post and our verification framework.