The Prospect Almanac — A CS 451 Study

ABSTRACT

We ask how much of an NBA player's early career impact, measured in this case by four-year cumulative Value Over Replacement Player (VORP), can be recovered through modeling their collegiate statistics and high school recruiting information. We assembled a dataset of 1042 college players drafted into the NBA from 2000-2022, collecting their college statistics and engineering them (computing position-relative z-scores, volume-weighted shooting metrics and a defensive versatility metric). Using this dataset, we trained three regression models (Ridge, XG Boost, and an MLP) to predict four-year cumulative VORP. The XGBoost model yielded the best results, with an MAE of 1.67 VORP, $R^2$ of .202, and an AUC of .591 on a binarised VORP classification task (positive vs negative). In addition, we used unsupervised K-means modeling within position groups, which identified eight player archetypes whose mean cumulative VORP differs by factors of ten. From this modeling, we can conclude that the value of purely statistical college scouting using our current methods is bounded. Future work might include processing strength of schedule faced and more in-depth physical measurement data.

§ I

Build a Prospect

LIVE · RIDGE COEFFICIENTS

The Ridge model is interpretable enough that we can ship its weights to your browser and let you drive the inputs yourself. The sliders are initialized at positional mean, move them to change the prospect's outlook

Position

Guard Forward Center

Points / game 13.0

Rebounds / game 5.5

Assists / game 2.2

Steals / game 1.0

Blocks / game 0.8

TS% (shooting eff.) 0.567

Win Shares / 40 0.178

3-pt attempt rate 0.26

3-pt % (raw) 0.330

Recruit rank (1=best, 101=unrkd) 64

College seasons 3

§ III

The Models, on Trial

TEMPORAL SPLIT · TRAIN 2000–2018 · TEST 2019–2022

Regression — predicting 4-year VORP
Model	R²	MAE	RMSE

Classification — predicting *positive* 4-year VORP
Model	Accuracy	F1	AUC

Three estimators were set against one another: a plain Ridge regression for its interpretability, a randomly-searched XGBoost for its raw accuracy, and a shallow MLP for its ability to fit non-linearities we haven't named. The boosted trees win — narrowly — on every metric that matters. Below, the coefficients and gain-based importances they each settled on.

XGBoost · Top 15 by gain

§ IV

The Archetypes

K-MEANS · PCA PROJECTION · HOVER TO NAME

Principal Components I (horiz.) × II (vert.)

What the clusters actually are.

The unsupervised k-means doesn't know anything about NBA outcomes; it only sees the college-era features. At this resolution it carves the draft pool into a handful of recognizable basketball types, each with its own mean four-year value. Archetype names are derived on the fly from whichever stats most distinguish the cluster.

§ V

The Prospects — an index of all 1,042

CLICK ROW FOR DOSSIER

Expand the index — all 1,042 drafted players, 2000–2022 Browse

Position Year Split Archetype —

Player	College	Yr	Pos	PPG	RPG	APG	TS%	WS/40	Rec	Act. VORP	Pred. VORP ▼	P(VORP+)

Sources

Basketball-Reference.com — NBA draft history, college career totals, advanced metrics, four-year NBA outcomes.

247Sports composite — high school recruiting rankings and grades.

Method

Temporal train/test split at 2019. Features engineered: position-relative z-scores, interactions, and a volume-weighted three-point percentage that shrinks low-attempt players toward the league mean.

Models: XGBoost (best), Ridge (interpretable), MLP (comparison).

Set in

Fraunces (display) and IBM Plex Sans / Mono (text and figures).

All predictions generated by the exported Ridge coefficients, executed entirely in the reader's browser.

The Prospect & Almanac

Build a Prospect

The 2026 Big Board

The Models, on Trial

XGBoost · Top 15 by gain

The Archetypes

What the clusters actually are.

The Prospects — an index of all 1,042

Sources

Method

Set in