Where the Models Break

Jun 08, 2026

The previous posts in this series covered how win probability models work across six sports - the Markov chains behind baseball and tennis, the Poisson framework for soccer and hockey, the random walk in basketball, and the machine learning approach forced by football’s enormous state space. Each sport’s model reflects the structure of its game, and within those structures, the models are remarkably good.

But every model has boundaries. Win probability estimates are displayed to millions of viewers as clean numbers - 73.2%, 41.8% - that suggest a precision the underlying models don’t actually possess. The numbers are estimates, not measurements. And in specific, identifiable situations, those estimates break down in ways that matter.

This post catalogs the cross-sport failure modes. Understanding where models break is as important as understanding how they work - particularly if, like SportChartz, you’re building systems that use win probability as an input to further analysis.

The garbage time problem

Every sport has garbage time - stretches of play where the outcome is effectively decided and the competitive dynamics that the model assumes no longer hold. The specifics vary by sport, but the pattern is universal.

Basketball

When one NBA team leads by 25+ points in the fourth quarter, the leading team clears its bench. Starters rest. Reserves who typically play 5 minutes per game are now playing extended minutes against the other team’s reserves. The scoring dynamics change completely: the per-possession expected points drop, the variance in shooting quality increases, and neither team is executing its real offensive or defensive scheme.

A standard win probability model trained on all game data treats these possessions identically to competitive possessions. The result is a model that correctly assigns 99%+ win probability to the leading team but slightly misestimates the shape of the remaining probability distribution.

Football

Football’s version of garbage time is more insidious because it starts earlier. When a team leads by 21+ points in the second half, the trailing team abandons its normal offensive approach. Run-heavy teams start passing on every down. Leading teams run the ball on every play to drain clock. The Expected Points framework, calibrated on competitive play, doesn’t capture these strategic shifts.

The practical consequence: win probability estimates during blowouts are approximately correct in magnitude (the leading team is going to win) but the specific numbers (99.2% vs. 98.7%) reflect dynamics that no longer exist.

Why it matters

Garbage time affects model calibration at the extremes. When evaluating a model’s Brier score or calibration curve, the extreme bins (>95% and <5% predicted probability) are populated partly by garbage-time observations. If the model says 97% and the actual rate is 96%, the miscalibration is partly driven by garbage time dynamics that the model wasn’t designed for.

For prediction market applications, garbage-time miscalibration is mostly harmless - the position is already decided. But for player evaluation metrics (EPA, WPA), garbage time creates noise that inflates or deflates individual players’ contributions based on the context they played in rather than the quality of their play.

The end-of-game strategy problem

In basketball, football, and hockey, the final minutes of a game involve strategic behavior that fundamentally differs from the rest of the game. Models trained on full-game data apply assumptions of stationarity that break precisely when the stakes are highest.

Basketball: intentional fouling

The final 90-120 seconds of a close basketball game operate under a different protocol. Trailing teams foul intentionally to stop the clock. Leading teams may foul to prevent three-point attempts. The scoring process shifts from field goals (variable value, clock-consuming possessions) to free throws (fixed value, minimal time consumed).

FiveThirtyEight addressed this by building a separate decision-tree model for the endgame. Most other models don’t - they extrapolate the Brownian motion or logistic regression framework into a regime where its assumptions are violated. The result is systematic error in the most important moments of the game.

Football: two-minute drill

In the final two minutes of each half, offensive play selection, tempo, and clock management all change. The hurry-up offense compresses 30-second play gaps into 10-15 seconds. Timeout usage becomes strategic. The relative value of scoring (immediate) versus ball control (drain clock) shifts based on whether the team is leading or trailing.

Models like nflfastR handle this through the time-remaining variable, which the XGBoost trees can split on. But the training data for two-minute-drill situations is sparse relative to normal play, and the strategic interactions (timeout usage, onside kick decisions, spike-the-ball choices) create decision branches that the model may not have enough data to learn.

Hockey: empty net

The goalie pull in hockey’s final minutes creates a scoring-rate discontinuity - the trailing team’s offense gains an extra attacker while the leading team faces an empty net they can shoot at from anywhere. The Poisson model’s assumption of a constant scoring rate breaks completely.

The empirical success rate of goalie pulls is approximately 15% when trailing by one goal. But the optimal timing of the pull (mathematical models suggest around 5 minutes remaining, while coaches typically pull with 2-3 minutes remaining) represents a systematic strategic inefficiency that the model can quantify but most implementations don’t account for.

The rare-state estimation problem

Every model must make predictions for game states that occur infrequently. The less data a model has for a specific state, the more uncertain the estimate - but the displayed win probability rarely communicates this uncertainty.

Football’s sparse states

A 4th-and-8 from the opponent’s 37-yard line, trailing by 5, with 4:23 remaining in the third quarter, 2 timeouts for the offense and 1 for the defense. This is a completely plausible game state. It has probably occurred a handful of times in recorded NFL history. The model’s win probability estimate for this state is an interpolation from nearby states, not a direct observation.

XGBoost handles this through its tree structure - it generalizes from states with similar features. But the uncertainty around the estimate is much larger than for common states (1st-and-10 from your own 25, first quarter, tied game). The model provides a point estimate without confidence intervals, giving the false impression that all predictions carry the same precision.

Baseball’s extreme states

Baseball’s state space is small enough that most cells are well-populated. But extreme score differentials (10+ runs) with unusual base-out states in late innings are sparse. A team trailing by 11 runs in the bottom of the 8th with bases loaded and no outs is an unusual combination. The historical lookup may rely on a very small number of observations, making the 0.2% win probability estimate essentially a guess rather than a well-estimated probability.

Tennis at deuce in the fifth set

Tennis win probability models are exact given the input parameters, but they’re sensitive to the point-win probability estimate. In a fifth-set tiebreak (or no-tiebreak final set) between closely matched players, a 1-percentage-point difference in the assumed point-win probability can swing the match-win probability by 5-10 percentage points. The model’s output is precise but only as accurate as its single input parameter.The calibration problem

A calibration problem exists when a model’s predicted probabilities don’t match observed frequencies. Every model has calibration imperfections, but they’re often hidden in aggregate Brier scores.

Home advantage miscalibration

FiveThirtyEight’s RAPTOR basketball model systematically overestimates home team win probability - predicting home wins at ~70% when the actual rate is ~61%. This is a large systematic bias that compounds across every game in the season.

The source is likely in the RAPTOR player ratings, which may overweight home-court performance when projecting to game outcomes. But because the miscalibration is systematic rather than random, it’s detectable only by examining calibration curves, not aggregate accuracy metrics.

The confidence gap

Many models are overconfident at the extremes. A model that predicts 90% should be right 90% of the time. If it’s right 85% of the time for predictions in the 88-92% range, it’s overconfident. This pattern is common across sports:

Football models tend to be slightly overconfident in early-game predictions because they overweight the current score when most of the game remains
Basketball models tend to be overconfident during the first half because the random walk hasn’t converged enough for the drift (team quality) to be distinguishable from noise
Soccer models tend to be underconfident on draws because the independent Poisson assumption underestimates draw probability

These biases are correctable through post-hoc calibration (isotonic regression, Platt scaling), but many deployed models don’t apply these corrections.

The independence assumption

Multiple sports’ models rely on an assumption of independence between events. This assumption is approximately true but generates systematic errors in specific situations.

Momentum and serial correlation

The Markov property - the future depends only on the present state, not the path taken to get there - is assumed by baseball’s transition matrices and tennis’s hierarchical model. If momentum exists (the probability of winning the next point or at-bat is affected by having won the previous one), the Markov assumption introduces error.

Klaassen and Magnus (2001) tested this directly in tennis and found small but statistically significant deviations from independence at break points and set points, consistent with a psychological pressure effect that slightly reduces the server’s point-win probability in high-leverage moments.

In basketball, Gabel and Redner (2012) found that the exponential distribution of inter-scoring intervals holds well - suggesting that short-run momentum in basketball scoring is weak or nonexistent at the aggregate level. But this finding applies to the overall scoring process; it doesn’t rule out momentum effects for specific players or specific situations within possessions.

The independence assumption is a reasonable approximation. But “reasonable approximation” means “systematically wrong in specific situations that may be identifiable.” For a platform applying technical analysis to win probability curves, these small serial correlations are exactly the kind of signal that might show up in the chart.

Event clustering

In soccer and hockey, goals sometimes cluster - a team that scores one goal is more likely to score another in the next few minutes than the baseline Poisson rate would predict. This may reflect psychological effects (deflation of the conceding team, momentum of the scoring team) or tactical effects (teams opening up after falling behind).

The independent Poisson model treats each goal as arriving independently. If goals cluster, the model will underestimate the probability of multiple-goal swings in short time windows - exactly the events that create the most dramatic win probability movements.

The draw and overtime problem

Soccer and hockey both force models to handle outcomes beyond simple win/loss, and both create structural challenges.

Soccer draws

The independent Poisson model systematically underestimates draw frequency by 2-4 percentage points. The Dixon-Coles correction and bivariate Poisson extensions reduce but don’t eliminate this bias. In leagues where 25-27% of matches end in draws, a 2-point miscalibration means roughly 8 additional misclassified matches per season in a 380-match league.

The draw is the hardest outcome to predict because it requires both teams to score exactly the same number of goals - a condition that sits on the thin diagonal of the scoreline probability matrix. Small errors in the estimated scoring rates for either team shift probability mass on and off this diagonal, creating draw-probability estimates that are inherently less stable than win-probability estimates.

Hockey overtime and shootout

The transition from regulation hockey (5-on-5, continuous play, Poisson-amenable) to overtime (3-on-3, sudden death, dramatically elevated scoring rate) to shootout (sequential 1-on-1 duels, Bernoulli trials) requires three separate sub-models joined at discontinuous boundaries.

The win probability curve exhibits visible jumps at these transition points. A team that was at 45% win probability with 30 seconds left in a tied regulation game might jump to 48% entering overtime (slightly above 50% if they have home ice or a quality advantage that manifests differently in 3-on-3 play) and then shift again entering the shootout based on goaltender quality and shooter depth.

No single model handles the full trajectory smoothly. The result is a win probability curve that’s well-estimated within each phase but potentially miscalibrated at the transitions.

The uncertainty about uncertainty

Perhaps the most fundamental problem across all sports is that win probability models report point estimates without uncertainty bands. A model that says 65% doesn’t tell you whether it’s confident that the true probability is between 63% and 67%, or whether it might plausibly be anywhere from 55% to 75%.

Sources of uncertainty

Parameter estimation uncertainty: The model’s coefficients (logistic regression) or tree structures (XGBoost) are estimated from finite data. Different training samples would produce slightly different models.
Model specification uncertainty: The choice of model (logistic regression vs. XGBoost vs. Poisson) affects the output. Different modeling choices produce different probabilities for the same game state.
State measurement uncertainty: In football, the exact yard line is measured imprecisely. In basketball, whether a team “has possession” during a loose ball is ambiguous. These state measurement errors propagate into win probability errors.
Non-stationarity: Team quality changes within a game (fatigue, injuries, substitutions) and across a season. A model trained on early-season data may not apply to late-season games, and a model calibrated on regular-season data may not apply to playoffs.

A 2023 simulation study

Gorman et al. (2023, arXiv:2406.16171) conducted a simulation study examining how difficult it is to estimate win probability accurately. They found that even with large amounts of data, the inherent randomness in sports outcomes means that win probability estimates carry substantial uncertainty. Two perfectly specified models can disagree by 5-10 percentage points on the same game state simply due to sampling variation in the training data.

This is not a flaw in any particular model. It’s a property of the problem. Win probability is an unobservable quantity - we never know the “true” probability that a team will win from a given state. We can only estimate it. And the estimation carries irreducible uncertainty that is rarely communicated to the audience.

Cross-sport summary of failure modes

Failure ModeBaseballBasketballFootballSoccerHockeyTennisGarbage timeMinimalSignificantSignificantModerateModerateN/AEnd-of-game strategyMinimalSevereSignificantModerateSignificant (empty net)MinimalRare state estimationLow (small state space)LowHighModerateModerateLow (exact model)Independence assumptionWeak violationWeak violationModerate violationModerate (goal clustering)Moderate (goal clustering)Weak violationDraw/overtimeN/A (extra innings)Overtime model gapOvertime model gapDraw underestimationOT/shootout discontinuityN/AHome advantage biasLowModerateModerateModerateModerateN/ACalibration at extremesLowModerateModerate-HighLow-ModerateLow-ModerateLowPitcher/goalie varianceSignificantN/AN/AModerateSignificantN/A

Why this matters for SportChartz

Win probability is the raw input to everything SportChartz does. The transformation from win probability to a price-like signal, the reindexing from clock time to game events, the technical indicators computed on the transformed data - all of it sits on top of a win probability curve that carries the imperfections described in this post.

This isn’t a reason to distrust the system. It’s a reason to understand what the system is built on. A technical indicator that detects a divergence between the signal and RSI is detecting a pattern in transformed win probability data. If the underlying win probability estimate is noisy or biased in the current game state - late-game fouling, garbage time, a rare state with sparse historical data - the indicator’s signal may reflect model noise rather than genuine game dynamics.

The CMTs who read the charts well understand this intuitively. They know that a Bollinger squeeze in the first quarter of a blowout means something different than a squeeze in the fourth quarter of a 2-point game. The game context informs how much to trust the chart, and the chart is only as trustworthy as the win probability model underneath it.

Building a multi-sport platform means understanding these failure modes across sports, because the same transformation applied to football win probability and basketball win probability will inherit different types of error. The chart looks the same. The confidence behind the data driving it is not the same.

That’s what this series was built to document.

Neal Foster is Co-Founder & CTO of SportChartz and Founder & Partner of Vybe Capital.

References

Dixon, M.J. and Coles, S.G. (1997). “Modelling Association Football Scores and Inefficiencies in the Football Betting Market.” Journal of the Royal Statistical Society: Series C, 46(2), 265-280.

FiveThirtyEight. “How Our NBA Predictions Work.” https://fivethirtyeight.com/methodology/how-our-nba-predictions-work/

Gabel, A. and Redner, S. (2012). “Random Walk Picture of Basketball Scoring.” Journal of Quantitative Analysis in Sports, 8(1).

Gorman, E. et al. (2023). “Exploring the Difficulty of Estimating Win Probability: A Simulation Study.” arXiv:2406.16171.

Jamieson, K. and Goldblatt, N. (2024). “Blown Leads in the NFL.” Wharton Sports Analytics.

Klaassen, F.J.G.M. and Magnus, J.R. (2001). “On the Independence and Identical Distribution of Points in Tennis.” Journal of the American Statistical Association, 96(454), 500-509.

Stern, H.S. (1994). “A Brownian Motion Model for the Progress of Sports Scores.” Journal of the American Statistical Association, 89(427), 1128-1134.

nfosignal

Discussion about this post

Ready for more?