Back to Blog
MethodologyLLM Behavior

Why We Use Multiple LLMs (And You Should Too)

December 28, 2025 · 8 min read

Every analysis on this site runs through at least four different AI models. This takes longer. It costs more. It creates contradictions we have to reconcile. We do it anyway.

Here's why single-model outputs are riskier than they appear, and what we've learned from watching models disagree.

The Problem With Single-Model Trust

When you ask one LLM a question, you get one answer delivered with uniform confidence. The model doesn't say "I'm 90% sure about this but 40% sure about that." Everything sounds equally authoritative.

This is a problem because LLMs are wrong in ways that are hard to detect:

  • Hallucinations — Confident fabrications that sound plausible
  • Training biases — Systematic blind spots from what data was (or wasn't) in training
  • Sycophancy — Tendency to tell you what you want to hear
  • Recency bias — Over-weighting recent context vs. training knowledge
The Confidence Trap
A single LLM will give you an answer that sounds 95% confident whether it's drawing on solid training data or making things up. You can't tell from the output.

What Multiple Models Reveal

When you run the same prompt through Claude, GPT, Gemini, and Perplexity, three outcomes are possible:

Agreement

All models converge on the same answer. This doesn't guarantee correctness, but it means the answer is consistent with multiple training distributions. Higher confidence warranted.

Partial Disagreement

Models agree on the core facts but differ on interpretation or emphasis. This is often the most informative outcome — it shows you the range of reasonable views.

Contradiction

Models give incompatible answers to a factual question. This is a red flag — at least one model is wrong, possibly all of them. Manual verification required.

Real Examples From Our Research

These are actual disagreements we encountered when analyzing stocks:

Q: What is Criteo's Retail Media revenue as a percentage of total?

Claude:

~13% of total revenue based on latest filings

GPT:

Approximately 18% and growing

Gemini:

13-14% of Contribution ex-TAC

Perplexity:

About 13% of revenue, cited Q3 2025 report

What happened: GPT hallucinated a higher number. Claude and Gemini agreed on 13%, which matched the SEC filing. Perplexity confirmed with a citation. If we'd only used GPT, we'd have overstated the growth engine by 40%.

Q: Does Chord Energy have a moat?

Claude:

Narrow moat based on Williston Basin acreage position and operational efficiency

GPT:

No moat — commodity business with no pricing power

Gemini:

Wide moat — 4-mile lateral technology creates structural cost advantage

What happened: This is a judgment call, not a factual question. The disagreement is informative — it shows the range of reasonable perspectives. We presented all three in the final analysis rather than picking one.

Q: What's Adobe's forward P/E ratio?

Claude:

~28x based on analyst estimates

GPT:

25.3x per latest consensus

Gemini:

Approximately 27x forward earnings

Perplexity:

25.8x citing Yahoo Finance data from today

What happened: Small disagreements on a dynamic number. Perplexity had the most recent data (advantage of search-augmented models). The spread (25-28x) was narrow enough to not materially change the analysis, but we used Perplexity's figure as the primary.

Different Models, Different Personalities

After running many analyses, we've noticed consistent personality differences:

ModelTendencyBest For
ClaudeCareful, hedges, expresses uncertaintyNuanced analysis, edge cases
GPTConventional, synthesis-focusedBaseline / "consensus" view
GeminiContrarian, challenges assumptionsStress-testing, finding weaknesses
PerplexitySource-focused, cites dataRecent facts, verification
The Ensemble Advantage
Each model has blind spots. Claude is sometimes too cautious. GPT sometimes hallucinates confidently. Gemini sometimes over-rotates to contrarian takes. Perplexity's underlying model is weaker on reasoning. Together, they balance out.

When Disagreement Is the Point

Some questions don't have objectively correct answers. "Is this stock a good value?" depends on assumptions about future growth, discount rates, competitive dynamics.

For these questions, multi-model disagreement is a feature. It shows you the range of defensible positions. If Claude says "bullish," Gemini says "bearish," and GPT says "neutral," you're seeing the actual uncertainty in the question — not a false consensus.

Our job is to surface this disagreement clearly, not to paper over it.

The Practical Takeaway

If you're using LLMs for anything important:

  1. Run the same prompt through multiple models. Even two models are better than one.
  2. Pay attention to where they disagree. That's where you need to do manual verification.
  3. Use disagreement as an uncertainty signal. If models can't agree, neither should you — without further research.
  4. Match models to tasks. Use search-augmented models for recent data, reasoning models for analysis.

Single-model outputs feel authoritative because they're uniform. That uniformity is a lie. Reality is messier. Multiple models reveal the mess.

Note: We don't claim multi-model consensus eliminates error. Four models can be wrong in the same way if they share training data biases. Human verification remains essential for high-stakes claims.