ABSTRACT
We ask how much of an NBA player's early career impact, measured in this case by four-year cumulative Value Over Replacement Player (VORP), can be recovered through modeling their collegiate statistics and high school recruiting information. We assembled a dataset of 1042 college players drafted into the NBA from 2000-2022, collecting their college statistics and engineering them (computing position-relative z-scores, volume-weighted shooting metrics and a defensive versatility metric). Using this dataset, we trained three regression models (Ridge, XG Boost, and an MLP) to predict four-year cumulative VORP. The XGBoost model yielded the best results, with an MAE of 1.67 VORP, $R^2$ of .202, and an AUC of .591 on a binarised VORP classification task (positive vs negative). In addition, we used unsupervised K-means modeling within position groups, which identified eight player archetypes whose mean cumulative VORP differs by factors of ten. From this modeling, we can conclude that the value of purely statistical college scouting using our current methods is bounded. Future work might include processing strength of schedule faced and more in-depth physical measurement data.
Build a Prospect
The Ridge model is interpretable enough that we can ship its weights to your browser and let you drive the inputs yourself. The sliders are initialized at positional mean, move them to change the prospect's outlook
The 2026 Big Board
A top-to-bottom ranking of the incoming draft class, with each prospect's four-year VORP projected by the XGBoost model trained on 2000–2018 and validated on 2019–2022. The top ten are shown in full below. Click any entry for the dossier — stat profile, SHAP drivers, and the five most similar historical draftees.
The Models, on Trial
| Model | R² | MAE | RMSE |
|---|
| Model | Accuracy | F1 | AUC |
|---|
Three estimators were set against one another: a plain Ridge regression for its interpretability, a randomly-searched XGBoost for its raw accuracy, and a shallow MLP for its ability to fit non-linearities we haven't named. The boosted trees win — narrowly — on every metric that matters. Below, the coefficients and gain-based importances they each settled on.
XGBoost · Top 15 by gain
The Archetypes
What the clusters actually are.
The unsupervised k-means doesn't know anything about NBA outcomes; it only sees the college-era features. At this resolution it carves the draft pool into a handful of recognizable basketball types, each with its own mean four-year value. Archetype names are derived on the fly from whichever stats most distinguish the cluster.
The Prospects — an index of all 1,042
Expand the index — all 1,042 drafted players, 2000–2022 Browse
| Player | College | Yr | Pos | PPG | RPG | APG | TS% | WS/40 | Rec | Act. VORP | Pred. VORP ▼ | P(VORP+) |
|---|
Sources
Basketball-Reference.com — NBA draft history, college career totals, advanced metrics, four-year NBA outcomes.
247Sports composite — high school recruiting rankings and grades.
Method
Temporal train/test split at 2019. Features engineered: position-relative z-scores, interactions, and a volume-weighted three-point percentage that shrinks low-attempt players toward the league mean.
Models: XGBoost (best), Ridge (interpretable), MLP (comparison).
Set in
Fraunces (display) and IBM Plex Sans / Mono (text and figures).
All predictions generated by the exported Ridge coefficients, executed entirely in the reader's browser.