A single regression model trained on NBA game logs predicts that Joel Embiid will play 11 minutes in a game where he's listed as OUT. The model has never seen a confident zero. Every row in the training data has some minutes played, because the standard NBA API endpoint only returns logs for games where a player was active. The model knows what 28 minutes looks like and what 34 minutes looks like. It has no idea what zero looks like.
This is the root problem behind the two-stage minutes engine in CourtVision.
The Training Data Gap
The NBA API's PlayerGameLog endpoint returns one row per game for every game a player appeared in. If Embiid sits, there's no row. If Tyrese Maxey plays 38 minutes, there's a row. The dataset is survivor-biased: it only contains games where players actually played.
Train a regressor on this dataset, feed it features for a player who's clearly going to sit, and the model interpolates. It finds the nearest neighborhood in the feature space and returns a plausible minutes number, never zero. For a scenario engine that needs to know whether a player will play before predicting how many minutes they'll get, this is a fundamental flaw.
The fix was ingest-side, not model-side. The ingestion pipeline in ingest.py creates synthetic zero-minute rows for every player on a team's roster who doesn't appear in the game log. If the Sacramento Kings played on January 15th and De'Aaron Fox's PlayerID doesn't appear in the API response, the pipeline inserts a row: PlayerID=Fox, GameID=..., Minutes=0, PTS=0, REB=0, AST=0, Status=INACTIVE. This gives the classifier real negative examples to learn from.
The Two-Stage Architecture
Stage A is a HistGradientBoostingClassifier. It takes the feature vector and outputs play_probability, the probability that the player will appear in the game at all. The features it receives are the same ones the regressor sees: Minutes_Avg, Rest_Days, Is_Home, Opponent_Pace, USG_Avg, Games_Played, plus one-hot encoded player role (Star, Starter, Rotation, Bench).
Stage B is a HistGradientBoostingRegressor, trained only on rows where Minutes > 0. If you train it on all rows including the synthetic zeros, the zeros dominate: there are more DNP rows than active rows for bench players, and the regressor learns to underpredict minutes for everyone.
At inference, the pipeline calls both models. If the classifier returns play_probability >= 0.5, the regressor's output becomes the minutes prediction. If it's below 0.5, minutes are set to zero and the stats predictor never runs. The code for this sits in predict_minutes():
play_proba = classifier.predict_proba(X)[:, 1]
minutes_pred = regressor.predict(X)
final_pred = np.where(play_proba >= threshold, minutes_pred, 0)
There's a second filter downstream that I didn't anticipate needing. The predict_player_performance() method in ScenarioEngine has a ghost player check: if the regressor predicts fewer than 20 minutes, the player gets dropped from the output entirely. Without this, the prediction sheets were cluttered with bench players projected for 12-14 minutes and 4.2 points. Valid predictions technically, but not useful for prop betting. The 20-minute cutoff was chosen empirically. Below 20 minutes, the MAE on points prediction climbs sharply because low-minute players have high variance: they might score 2 or they might score 14 depending on whether the game is a blowout.
The Leakage Problem
Every rolling statistic in the feature set uses .expanding().mean().shift(1). This is the single most important design decision in the feature engineering module, and I almost got it wrong.
The expanding window calculates the cumulative average of all games up to the current row. The .shift(1) moves the result down one row, so game N's features contain the average of games 1 through N-1, never game N itself. Without the shift, the model trains on features that include the target game's outcome. It looks like a strong model during training. It fails completely in production because you never have the current game's stats when making a prediction before tipoff.
I added a validation function (validate_no_leakage) that takes a specific player, picks game 5, manually calculates the average of games 0-4, and compares it against the PTS_Avg feature value for game 5. If they don't match within 0.01, the function prints a leakage warning. It caught a bug in the first version where I'd applied the shift before the expanding window instead of after.
The usage rate calculation has the same shift:
df['USG_Avg'] = (df.groupby('PlayerID')['USG']
.expanding()
.mean()
.reset_index(level=0, drop=True)
.shift(1))
USG is an approximation: (FGA + 0.44 * FTA) / Minutes * 48, capped at 50% to prevent division artifacts from very short appearances. The 0.44 coefficient is a standard basketball analytics adjustment that accounts for and-one free throws and technical free throws not being possessions.
Where the Usage Vacuum Fits
The scenario engine's core premise is that when a high-usage player sits, their possessions don't disappear. They redistribute to teammates based on position. If Embiid (Big, ~32% usage) is OUT, his teammates in the Big position group absorb 60% of that missing usage. Guards and Wings split the remaining 40%.
This redistribution feeds back into the feature vector. The usg_avg field for each remaining player is boosted by the appropriate share of the missing player's usage. The stats predictor then takes the boosted usage and the projected minutes (also affected, because starters typically play more minutes when a star sits) and generates new PTS, REB, AST predictions.
The 60/40 split is hardcoded in generate_injury_scenario(). I chose this over a data-driven approach for a specific reason: the dataset is too small to reliably learn redistribution weights per team per position. With 2,847 game-player records after feature engineering, there aren't enough "star sits" examples per team to fit a regression on redistribution patterns. The 60/40 approximation comes from league-wide studies on usage redistribution that I referenced from publicly available basketball analytics research. It's wrong in specific cases (the Warriors' motion offense distributes more evenly than a Sixers team centered on Embiid), but it's directionally correct across the league.
What Surprised Me
The classifier trains on the full dataset including synthetic zeros. The regressor trains only on games with Minutes > 0. I expected the regressor to be the harder model to train. It wasn't. The classifier was worse because the class distribution is heavily imbalanced: most players in the dataset played in most games. The synthetic zeros for DNP players create some negative examples, but for stars and starters, there are almost no negative examples. The classifier achieves a high AUC-ROC but has poor recall on the rare "will not play" class. In practice, this doesn't matter much because the ghost player filter catches the edge cases, but it means the play_probability output is overconfident for healthy starters. It says 98% when the real probability might be 92%.
The other surprise was the MAE on the stats predictor. I expected PTS to be harder to predict than REB or AST because scoring has higher variance. The opposite happened: PTS MAE was 5-6 points while REB and AST hovered around the same range. The stats model is a MultiOutputRegressor wrapping three separate HistGradientBoostingRegressor instances. Minutes is the dominant feature for all three targets. Once you get minutes roughly right, the stats predictions follow within a band. The Opponent_Pace feature barely moved the needle, partly because I'm using a placeholder value of 100.0 for every game instead of the actual opponent pace. Fixing that is on the list.
If I were starting this system over, I'd build the classifier and regressor as a single pipeline with scikit-learn's Pipeline class instead of running them as separate scripts with separate model files saved to disk. The current approach (train each model in isolation, serialize to pickle, load both at inference) works but adds unnecessary file I/O and version coupling. A combined pipeline would guarantee that both models are always trained on the same feature set from the same data split.