EST. 2026  ·  A CS 451 DATA SCIENCE STUDY
VOLUME I
NUMBER ONE
PAPER EDITION

The Prospect & Almanac

— projecting four-year NBA value from college statistics, recruiting profiles, and a tree of shallow boosted decisions, from the draft classes of 2000 to 2022 —
EDITOR
Colin Hanley
CS 451
OBS TRAIN TEST FEAT XGB R² MAE AUC SPLIT

ABSTRACT

We ask how much of an NBA player's early career impact, measured in this case by four-year cumulative Value Over Replacement Player (VORP), can be recovered through modeling their collegiate statistics and high school recruiting information. We assembled a dataset of 1042 college players drafted into the NBA from 2000-2022, collecting their college statistics and engineering them (computing position-relative z-scores, volume-weighted shooting metrics and a defensive versatility metric). Using this dataset, we trained three regression models (Ridge, XG Boost, and an MLP) to predict four-year cumulative VORP. The XGBoost model yielded the best results, with an MAE of 1.67 VORP, $R^2$ of .202, and an AUC of .591 on a binarised VORP classification task (positive vs negative). In addition, we used unsupervised K-means modeling within position groups, which identified eight player archetypes whose mean cumulative VORP differs by factors of ten. From this modeling, we can conclude that the value of purely statistical college scouting using our current methods is bounded. Future work might include processing strength of schedule faced and more in-depth physical measurement data.

§ I

Build a Prospect

LIVE · RIDGE COEFFICIENTS

The Ridge model is interpretable enough that we can ship its weights to your browser and let you drive the inputs yourself. The sliders are initialized at positional mean, move them to change the prospect's outlook

Position
Points / game 13.0
Rebounds / game 5.5
Assists / game 2.2
Steals / game 1.0
Blocks / game 0.8
TS% (shooting eff.) 0.567
Win Shares / 40 0.178
3-pt attempt rate 0.26
3-pt % (raw) 0.330
Recruit rank (1=best, 101=unrkd) 64
College seasons 3
§ III

The Models, on Trial

TEMPORAL SPLIT · TRAIN 2000–2018 · TEST 2019–2022
Regression — predicting 4-year VORP
ModelMAERMSE

Classification — predicting positive 4-year VORP
ModelAccuracyF1AUC

Three estimators were set against one another: a plain Ridge regression for its interpretability, a randomly-searched XGBoost for its raw accuracy, and a shallow MLP for its ability to fit non-linearities we haven't named. The boosted trees win — narrowly — on every metric that matters. Below, the coefficients and gain-based importances they each settled on.

XGBoost · Top 15 by gain
§ IV

The Archetypes

K-MEANS · PCA PROJECTION · HOVER TO NAME
Principal Components I (horiz.) × II (vert.)

What the clusters actually are.

The unsupervised k-means doesn't know anything about NBA outcomes; it only sees the college-era features. At this resolution it carves the draft pool into a handful of recognizable basketball types, each with its own mean four-year value. Archetype names are derived on the fly from whichever stats most distinguish the cluster.

§ V

The Prospects — an index of all 1,042

CLICK ROW FOR DOSSIER
Expand the index — all 1,042 drafted players, 2000–2022 Browse
Player College Yr Pos PPG RPG APG TS% WS/40 Rec Act. VORP Pred. VORP ▼ P(VORP+)
Sources

Basketball-Reference.com — NBA draft history, college career totals, advanced metrics, four-year NBA outcomes.

247Sports composite — high school recruiting rankings and grades.

Method

Temporal train/test split at 2019. Features engineered: position-relative z-scores, interactions, and a volume-weighted three-point percentage that shrinks low-attempt players toward the league mean.

Models: XGBoost (best), Ridge (interpretable), MLP (comparison).

Set in

Fraunces (display) and IBM Plex Sans / Mono (text and figures).

All predictions generated by the exported Ridge coefficients, executed entirely in the reader's browser.

Loading almanac data