Model Calibration
How well do our predictions match reality? Calibration metrics show whether models are overconfident, underconfident, or well-calibrated.
Calibration Results
Resolved Markets
vs External Consensus
| Market | Our Score | Consensus | Result |
|---|---|---|---|
| MOHWhat will Molina Healthcare's 2026 full-year diluted EPS guidance midpoint be? | 30.1 | 28.8 | Missed |
Scoring Methodology
Brier Score (Binary)
For binary yes/no predictions, we use the Brier Score:
- 0.00 — Perfect prediction (100% confident and correct)
- 0.25 — Maximally uncertain (50% prediction)
- 1.00 — Worst possible (100% confident and wrong)
Weighted Interval Score (Continuous)
For continuous predictions (revenue, margins, etc.), we use the Weighted Interval Score (WIS), which evaluates the full distribution:
- Rewards narrow intervals when the actual value falls within them
- Penalizes overconfidence (narrow but wrong)
- Lower score = better prediction
Calibration Curve
A perfectly calibrated model's predictions should match outcomes:
- Events predicted at 30% should occur ~30% of the time
- Events predicted at 70% should occur ~70% of the time
We group predictions into buckets (0-10%, 10-20%, etc.) and compare predicted rates to actual outcomes.
Model Comparison
Each market receives predictions from multiple models:
Deep reasoning, handles edge cases and complex scenarios
Balanced approach, good at pattern recognition
Fast, pattern-focused, captures obvious signals
The aggregate prediction uses the median across all model runs. Model agreement is calculated as 1 minus normalized standard deviation — higher agreement suggests more confidence in the prediction.