How Structured Prompts Can Actually Increase AI Hallucinations

Frameworks feel rigorous. Stages, checkpoints, required tables, scoring rubrics — they create the appearance of systematic analysis. And that appearance is dangerous, because it makes you trust the output more than you should.

The Paradox

We built structured frameworks specifically to reduce LLM hallucinations. Then we discovered the frameworks were hiding a category of errors we hadn't anticipated.

The errors weren't in the reasoning.

The errors were in the numbers.

And the frameworks made the numbers look verified when they weren't.

Yes, we're aware this phrasing is an LLM-ism. We used it intentionally. Read more about LLM-isms and when they're problematic →

Where We Started: The One-Off Prompt Problem

When we started this project, we knew single-prompt LLM queries were unreliable. Ask an LLM to "analyze this stock" and you get something that sounds authoritative but is often shallow, inconsistent, and occasionally fabricated.

Hallucinations

Confident fabrications that sound plausible

Recency bias

Training data cutoffs create blind spots

False confidence

Can't tell verified from guessed

Sycophancy

Tendency to tell you what you want

Our initial solution was to bounce LLM responses off each other (we cover this in depth in Why We Use Multiple LLMs). If four models analyze the same company and three agree on a conclusion, that's more credible than one model's opinion. The multi-model approach worked — disagreements surfaced blind spots, and consensus built confidence.

But the outputs were inconsistent. One model would focus on valuation, another on governance, a third on liquidity. Comparing them was like comparing apples to philosophical treatises about fruit.

The First Insight

Multi-model validation only works if models are answering the same questions in the same structure. Without a framework, you get four different analyses that can't be meaningfully compared.

Building the Frameworks

So we built frameworks. The Roadkill Radar came first — a structured approach for evaluating beaten-down stocks. Then the Prospectus Probe for newly public companies.

Both frameworks specified stages, required outputs, decision hierarchies, and classification taxonomies. Models couldn't wander off into whatever interested them — they had to answer specific questions in a specific order.

The results improved immediately. Analyses became comparable. Disagreements became tractable ("Gemini rates governance as concerning, ChatGPT rates it clean — why?"). Classifications became reproducible.

We were feeling good. The methodology was maturing. We added base layers and equity layers. We refined the multi-model synthesis process. We built more sophisticated critique-and-response cycles.

Then we built the Fugazi Filter — a forensic accounting framework for assessing companies where the core question isn't "is it cheap?" but "can we trust the numbers?"

And the Fugazi Filter broke our confidence.

What the Models Told Us

During the Fugazi Filter development, we ran the standard multi-model critique cycle. Four models proposed framework structures, we synthesized them, then each model critiqued the synthesis.

The critiques were unusually pointed. Here's what Gemini said about our v0.1 draft:

"The framework is conceptually 9/10 but operationally 4/10. It asks the LLM to act as a quant rather than a forensic linguist."

This was the key insight. We had designed frameworks that asked LLMs to do things they couldn't reliably do — and the framework structure made those failures invisible.

Opus added:

"The most effective fraud detection often comes from a small number of high-signal indicators examined with deep attention, not 30+ factors examined superficially."

We had over-engineered. More stages, more tables, more required outputs — it felt rigorous, but it was actually diluting attention across too many factors.

The Core Problem: Pseudo-Precision

When we applied the Fugazi Filter to an actual company (Carvana), the synthesis confidently reported specific numbers:

"$1.4B insider selling by the Garcia family"
"ABS 61+ day delinquencies at 3.93% (doubled from 1.76%)"
"PIK interest deferring ~$430M/year until 2026"

These numbers were treated as verified facts. They appeared in tables. They drove classifications. They looked authoritative.

Then one of our models — GPT-5.2 — challenged them:

"The synthesis uses quantified claims as 'E2 evidence' without consistently citing primary sources. Where did the $1.4B figure come from? What trustee report shows the 3.93% delinquency rate? What indenture language confirms the PIK mechanics?"

Opus responded honestly:

"My original figures came from web search results citing secondary sources. These are not primary sources. The delinquency deterioration trend is directionally correct based on multiple corroborating sources, but the precise figures should be verified against primary trustee data."

The Hidden Failure Mode

Our frameworks were designed to extract specific numbers and present them in tables. The tables made the numbers look verified. But the models were parsing secondary sources, making reasonable inferences, and occasionally hallucinating — and we couldn't tell which was which.

Why Frameworks Amplify This Problem

Without a framework, LLM outputs are obviously approximate. "The company has significant debt" is clearly qualitative. You know to verify it.

With a framework that requires specific numbers — "Net Debt / EBITDA: ___x" — the output looks verified. The framework creates a professional-looking table. The number appears precise. The appearance of rigor increases trust.

But the number might be:

Parsed from a document the LLM accessed — which might be stale or misread
Retrieved from training data — which has a cutoff date
Calculated mentally — LLMs are bad at arithmetic
Hallucinated with confidence — the worst case, but indistinguishable from the others

We cannot distinguish between these cases just by looking at the output. The framework's structure makes all four look identical.

The Three Framework Failure Modes

The Fugazi Filter discourse crystallized three interconnected problems that can affect any structured LLM framework:

1Over-Engineering

Given freedom to design, LLMs add stages, scores, gates, and rubrics without considering operational cost. Each additional component is another opportunity for errors to compound. The v0.1 Fugazi Filter had 8 stages, 2 checkpoints, 6 categories, and 4 detailed scoring rubrics. By the end, later stages were getting less attention as the context window filled.

2Multi-Task Conflation

Early Fugazi Filter versions tried to answer three questions at once: Can we trust the numbers? (integrity) Can it survive stress? (fragility) Is the market pricing this? (valuation). These require different data and different analytical methods. Collapsing them into one classification destroyed nuance. A transparent but over-levered company got lumped with a suspected fraud.

3Pseudo-Precision

Frameworks that require specific numbers create the appearance of verification. "EV/EBITDA: 4.2x" looks more trustworthy than "the company appears cheap." But the LLM may have pulled that 4.2x from a stale source, calculated it incorrectly, or fabricated it entirely. The table format hides the uncertainty.

What We're Doing About It

We can't eliminate these problems entirely — not without building a completely different kind of system. But we can be more honest about them.

Qualitative over quantitative. Instead of requiring "EV/EBITDA: ___x", we ask "Does the valuation appear significantly below sector and historical norms?" The LLM becomes a pattern-recognition engine, not a calculator. It's what they're good at.

Source tagging. Every factual claim should indicate where it came from — primary source (SEC filing), secondary source (news article citing the filing), or estimated. We can't always enforce this, but asking for it surfaces uncertainty.

Explicit "don't know" paths. Our frameworks used to treat "Data Not Found" as a failure. Now it's a valid output. If a model can't verify something, saying so is more valuable than guessing.

Dual-axis separation. The Fugazi Filter now outputs Integrity Risk and Fragility Risk separately, with a distinct interpretation matrix. A company can be LOW integrity risk and HIGH fragility risk (honest but stressed) or the reverse (fraudulent but well-funded). Different action implications.

Evidence-level requirements. For high-stakes classifications, we now require specific evidence thresholds — not just "concerns exist" but "concerns are documented in [specific source type]." This doesn't guarantee accuracy, but it makes the evidentiary basis explicit.

What We Still Can't Solve

We're being honest here: some problems require architectural changes we haven't made yet.

Without a structured data layer — code that actually pulls SEC filings, parses specific fields, and presents them as verified inputs — any quantitative output from an LLM is suspect. The model might be right. It might be wrong. We can't tell.

The current workaround is to treat numbers as approximate and weight conclusions by evidence quality. A synthesis that says "four models agree this company has concerning governance, based on disclosed related-party transactions in the 10-K" is more reliable than one that says "the Treadmill Test score is 7.3%."

Is this enough? We don't know. We're running more analyses, tracking where the frameworks fail, and iterating. The goal isn't perfection — it's being honest about the limitations while extracting what value we can.

The Meta-Lesson

The most valuable output of this process wasn't a better framework — it was understanding what frameworks can and can't do. Structured analysis is better than unstructured analysis. But structure can create false confidence if you don't also structure the uncertainty.

The Takeaway

If you're building LLM-based analysis systems, here's what we'd tell you:

Frameworks are necessary — without structure, multi-model comparison is meaningless
Frameworks create new risks — specifically, the appearance of rigor masking actual uncertainty
Play to LLM strengths — pattern recognition, synthesis, counter-arguments — not calculation or data verification
Make uncertainty explicit — source tagging, evidence levels, "don't know" as a valid output
Have your models critique the framework — they'll tell you what's operationally broken

We started this project trying to reduce LLM errors. We ended up learning that the errors we could see — the obvious hallucinations, the inconsistent reasoning — were less dangerous than the errors we couldn't see: numbers that looked verified because the framework formatted them nicely.

The fix isn't to abandon frameworks. It's to build frameworks that are honest about what they can and can't verify. Confidence in numbers ≠ accuracy of numbers. Structure ≠ rigor. Precision ≠ correctness.

We're still learning. But at least now we know what we're watching for.

Advanced LLM Use: How Frameworks Can Hide Hallucinations