Back to Blog
LLM BehaviorMethodologyResearch

Auditing the Auditors: What We Learned Running 4 LLMs Through a Structured Debate

January 11, 2026 · 12 min read

Evolution Note
This post documents an earlier stage of our thinking. We've since evolved from "which vendor models to use" to "which personas and cognitive styles to apply." The behavioral observations here remain useful, but our methodology now emphasizes role design over model selection. See Why Personas Matter More Than Models and the Analysis Ensemble.
The Discovery
LLMs have personalities — not in the "sentient AI" sense, but in the "predictable behavioral patterns" sense. We ran four frontier models through a structured debate and watched them disagree, correct each other, and occasionally get things wrong. The patterns were consistent enough to predict.

We learned two things: your prompts matter more than memory features, and each model has systematic biases you can predict and compensate for.

See the methodology in action

Our equity analyses run through this exact multi-model process.

The Experiment

We ran an analysis of Carvana (CVNA) through our committee process: Opus 4.5, ChatGPT-5.2, Gemini 3, and Perplexity. Two different analytical lenses (Fugazi Filter for accounting integrity, Gravy Gauge for revenue durability). Six stages per lens — initial analysis, synthesis, critique, debate questions, responses, and convergence.

The Data

  • 66 markdown files generated across the full analysis
  • 8 independent Stage 1 analyses — the "unpolluted" outputs before any model saw another's work
  • 4 frontier models debating across 6 structured stages
  • Consumer chat interfaces — memory features potentially active

The Questions

  1. 1
    Are the models using memory about us?

    We ran these through consumer chat interfaces where memory features exist. Is our history bleeding into the outputs?

  2. 2
    Do the models have stable personalities?

    Not "consciousness" — just predictable behavioral tendencies that show up consistently.

Finding 1: Memory Didn't Matter

We found no evidence of user-specific personalization in any output. Zero references to prior conversations. No "based on your interest in..." framing. No style adaptation to match previous interactions.

Every output referenced only the source documents we provided (10-K, 10-Q, earnings call). The structure was identical to what you'd get from a fresh account.

Why This Happened
Our prompts are highly structured: required sections, explicit evidence levels, specific output format. The structure appears to force models into "analyst mode" that excludes personal context. Memory features may not activate when prompts are sufficiently rigid.

We don't know if this generalizes. Maybe memory matters for conversational prompts but not structured ones. Maybe our user history wasn't populated enough to surface. But for structured analytical work, the prompt template seems to dominate.

Finding 2: The Personalities Are Real

Each model showed consistent behavioral signatures across both lenses and all stages. These weren't random variations — they were patterns.

Opus 4.5 (Claude)

The Thorough Hedger

Traits:
  • Longest outputs (25KB vs. 8KB for Gemini)
  • 10+ numbered findings per section
  • Explicit evidence levels on every claim
  • Heavy hedging: 'While no definitive evidence...'
Blind spots:
  • Excessive hedging dilutes signal clarity
  • May understate clear concerns to avoid appearing alarmist

ChatGPT-5.2 (GPT)

The Mechanistic Reviser

Traits:
  • Focuses on how mechanisms work ('the plumbing')
  • Initially aggressive on severity labels
  • High willingness to revise when challenged
  • Uniquely fact-checks externally (SEC.gov, Justia)
Blind spots:
  • Initial overreach on governance labels
  • May overcorrect when challenged

Gemini 3

The Bold Narrator

Traits:
  • Shortest outputs — gets to the point
  • Colorful language ('chemically enhanced earnings')
  • Most aggressive initial stance
  • Strong conceptual framing
Blind spots:
  • May conflate 'fragile' with 'fraudulent'
  • Colorful language can overstate severity
  • Fewer citations makes claims harder to verify

Perplexity

The Academic Pessimist

Traits:
  • Academic/research-paper style
  • Citation-heavy with inline references
  • Tends toward pessimistic readings
  • Good at surfacing litigation and regulatory risks
Blind spots:
  • Prone to specific factual errors
  • Called Grant Thornton 'Big-4 type' — it's mid-tier
  • Citation obsession may prioritize sourcing over analysis

The Patterns in Action

Here's how the same question generated different responses — and what that reveals:

Question: How should we classify Carvana's governance alignment?

"MISALIGNED. The Garcia family's 84% voting control through dual-class shares... creates severe conflicts of interest. However, the assessment is MISALIGNED rather than CAPTURED because disclosures are adequate and no proven abuse exists."

Opus

"CAPTURED. The setup is structurally captured: minority shareholders have limited ability to enforce discipline if incentives diverge."

ChatGPT (initial)

"MISALIGNED. The governance structure is designed to entrench the Garcia family's control and facilitate value transfer to their private interests."

Gemini

ChatGPT started at "CAPTURED" — a stronger label meaning demonstrated abuse, not just structural conflict. Through the critique phase, Opus and Perplexity pushed back: "CAPTURED requires proven abuse, not just concentrated control." ChatGPT revised to MISALIGNED.

The same dynamic played out with Gemini on accounting severity:

OpusHeld
QUESTIONABLEQUESTIONABLE
ChatGPTRevised
CAPTUREDMISALIGNED
GeminiRevised
CONCERNINGQUESTIONABLE
PerplexityHeld
QUESTIONABLEQUESTIONABLE

Gemini initially rated accounting integrity as "CONCERNING" — implying potential manipulation. After Opus noted that "aggressive but legal isn't the same as manipulation," Gemini revised to QUESTIONABLE. The colorful language ("chemically enhanced earnings") stayed, but the classification calibrated to match the evidence.

The Error That Got Caught
Perplexity described Grant Thornton (Carvana's auditor) as a "Big-4 type firm." Gemini caught this in critique: "Grant Thornton is mid-tier, not Big 4. This distinction matters for risk assessment." Perplexity acknowledged the error. A single-model workflow would have propagated it.

What This Means For You

If you're using LLMs for serious analysis, here's what we'd suggest:

  1. Don't worry about memory (for structured work). Highly structured prompts seem to override personalization. If you're worried, the prompt template is the fix, not disabling memory.
  2. Expect specific biases from specific models. Gemini runs hot on severity. ChatGPT overreaches on governance. Perplexity gets facts wrong. Build your process to catch these.
  3. Use Opus or Perplexity as your anchor. In our run, both were calibrated to the eventual consensus from the start. ChatGPT and Gemini needed to be "talked down."
  4. Let ChatGPT fact-check. It was the only model that went to SEC.gov, Justia, and Reuters to verify claims. That caught the "no insider selling" error in our synthesis.
  5. Keep Gemini's frames, not its labels. The "chemically enhanced earnings" and "circular financing" concepts were brilliant — they became central to the final analysis. The initial severity ratings were not.

The Committee Worked

The part we're genuinely pleased about is that the 6-stage process caught every issue we could identify. The structured debate forced models to defend their positions with evidence, and when the evidence didn't support their initial stance, they revised.

What the process caught

  • 5 position overreaches — models starting with severity labels that exceeded what the evidence supported
  • 1 factual error — Perplexity's mischaracterization of the auditor's tier
  • 1 scope creep issue — a classification that went beyond the lens's mandate

The final classifications — QUESTIONABLE accounting, MISALIGNED governance, CONDITIONAL revenue durability — weren't an average of the four models. They were the positions that survived critique from all perspectives, which represents a meaningfully different kind of confidence than what you get from a single model.

We don't know if this generalizes beyond financial analysis, or whether different prompting approaches would surface different personality patterns, or whether model updates will shift these tendencies over time. What we do know is that for this specific task, with this structure, the models behaved predictably — and the process caught the failures before they reached the final output.

This report was generated by the Runchey Research AI Ensemble using primary SEC data and reviewed by Matthew Runchey for accuracy.

This analysis is for educational purposes only and does not constitute investment advice. See our Editorial Integrity & Disclosure Policy and Terms of Service.

Further Reading