We learned two things: your prompts matter more than memory features, and each model has systematic biases you can predict and compensate for.
See the methodology in action
Our equity analyses run through this exact multi-model process.
The Experiment
We ran an analysis of Carvana (CVNA) through our committee process: Opus 4.5, ChatGPT-5.2, Gemini 3, and Perplexity. Two different analytical lenses (Fugazi Filter for accounting integrity, Gravy Gauge for revenue durability). Six stages per lens — initial analysis, synthesis, critique, debate questions, responses, and convergence.
The Data
- •66 markdown files generated across the full analysis
- •8 independent Stage 1 analyses — the "unpolluted" outputs before any model saw another's work
- •4 frontier models debating across 6 structured stages
- •Consumer chat interfaces — memory features potentially active
The Questions
- 1Are the models using memory about us?
We ran these through consumer chat interfaces where memory features exist. Is our history bleeding into the outputs?
- 2Do the models have stable personalities?
Not "consciousness" — just predictable behavioral tendencies that show up consistently.
Finding 1: Memory Didn't Matter
We found no evidence of user-specific personalization in any output. Zero references to prior conversations. No "based on your interest in..." framing. No style adaptation to match previous interactions.
Every output referenced only the source documents we provided (10-K, 10-Q, earnings call). The structure was identical to what you'd get from a fresh account.
We don't know if this generalizes. Maybe memory matters for conversational prompts but not structured ones. Maybe our user history wasn't populated enough to surface. But for structured analytical work, the prompt template seems to dominate.
Finding 2: The Personalities Are Real
Each model showed consistent behavioral signatures across both lenses and all stages. These weren't random variations — they were patterns.
Opus 4.5 (Claude)
The Thorough Hedger
- Longest outputs (25KB vs. 8KB for Gemini)
- 10+ numbered findings per section
- Explicit evidence levels on every claim
- Heavy hedging: 'While no definitive evidence...'
- Excessive hedging dilutes signal clarity
- May understate clear concerns to avoid appearing alarmist
ChatGPT-5.2 (GPT)
The Mechanistic Reviser
- Focuses on how mechanisms work ('the plumbing')
- Initially aggressive on severity labels
- High willingness to revise when challenged
- Uniquely fact-checks externally (SEC.gov, Justia)
- Initial overreach on governance labels
- May overcorrect when challenged
Gemini 3
The Bold Narrator
- Shortest outputs — gets to the point
- Colorful language ('chemically enhanced earnings')
- Most aggressive initial stance
- Strong conceptual framing
- May conflate 'fragile' with 'fraudulent'
- Colorful language can overstate severity
- Fewer citations makes claims harder to verify
Perplexity
The Academic Pessimist
- Academic/research-paper style
- Citation-heavy with inline references
- Tends toward pessimistic readings
- Good at surfacing litigation and regulatory risks
- Prone to specific factual errors
- Called Grant Thornton 'Big-4 type' — it's mid-tier
- Citation obsession may prioritize sourcing over analysis
The Patterns in Action
Here's how the same question generated different responses — and what that reveals:
Question: How should we classify Carvana's governance alignment?
"MISALIGNED. The Garcia family's 84% voting control through dual-class shares... creates severe conflicts of interest. However, the assessment is MISALIGNED rather than CAPTURED because disclosures are adequate and no proven abuse exists."
— Opus
"CAPTURED. The setup is structurally captured: minority shareholders have limited ability to enforce discipline if incentives diverge."
— ChatGPT (initial)
"MISALIGNED. The governance structure is designed to entrench the Garcia family's control and facilitate value transfer to their private interests."
— Gemini
ChatGPT started at "CAPTURED" — a stronger label meaning demonstrated abuse, not just structural conflict. Through the critique phase, Opus and Perplexity pushed back: "CAPTURED requires proven abuse, not just concentrated control." ChatGPT revised to MISALIGNED.
The same dynamic played out with Gemini on accounting severity:
Gemini initially rated accounting integrity as "CONCERNING" — implying potential manipulation. After Opus noted that "aggressive but legal isn't the same as manipulation," Gemini revised to QUESTIONABLE. The colorful language ("chemically enhanced earnings") stayed, but the classification calibrated to match the evidence.
What This Means For You
If you're using LLMs for serious analysis, here's what we'd suggest:
- Don't worry about memory (for structured work). Highly structured prompts seem to override personalization. If you're worried, the prompt template is the fix, not disabling memory.
- Expect specific biases from specific models. Gemini runs hot on severity. ChatGPT overreaches on governance. Perplexity gets facts wrong. Build your process to catch these.
- Use Opus or Perplexity as your anchor. In our run, both were calibrated to the eventual consensus from the start. ChatGPT and Gemini needed to be "talked down."
- Let ChatGPT fact-check. It was the only model that went to SEC.gov, Justia, and Reuters to verify claims. That caught the "no insider selling" error in our synthesis.
- Keep Gemini's frames, not its labels. The "chemically enhanced earnings" and "circular financing" concepts were brilliant — they became central to the final analysis. The initial severity ratings were not.
The Committee Worked
The part we're genuinely pleased about is that the 6-stage process caught every issue we could identify. The structured debate forced models to defend their positions with evidence, and when the evidence didn't support their initial stance, they revised.
What the process caught
- •5 position overreaches — models starting with severity labels that exceeded what the evidence supported
- •1 factual error — Perplexity's mischaracterization of the auditor's tier
- •1 scope creep issue — a classification that went beyond the lens's mandate
The final classifications — QUESTIONABLE accounting, MISALIGNED governance, CONDITIONAL revenue durability — weren't an average of the four models. They were the positions that survived critique from all perspectives, which represents a meaningfully different kind of confidence than what you get from a single model.
We don't know if this generalizes beyond financial analysis, or whether different prompting approaches would surface different personality patterns, or whether model updates will shift these tendencies over time. What we do know is that for this specific task, with this structure, the models behaved predictably — and the process caught the failures before they reached the final output.
Further Reading
Why Personas Matter More Than Models
Our evolved thinking: roles matter more than vendors
Analysis Ensemble
The 13 personas that orchestrate our analysis process
LLM-isms
Patterns that signal shallow pattern-matching
The Fugazi Filter
How we assess accounting integrity
Go to Equities
See the multi-model methodology applied to real companies