Bayes nets → utility → VPI → counterfactuals 3 interactive widgets CMSI 3300 + 4320

The Price of Knowing You

An ad agent wants to show you the right ad. To do it well, it has to climb a ladder — from seeing correlations, to deciding under uncertainty, to asking what a fact about you is worth, all the way to imagining the ad it didn't show. This is a topologically-sorted tour of that ladder: Bayesian networks → utility → decision networks → value of perfect information → interventions → counterfactuals. Each rung has a worked example you can poke at.

The spine

One agent — the Defendotron, a real homework I wrote and have since modernized so it runs again — appears twice. At rung 2 it asks "which ad, and is your demographic worth buying?" (value of information). At rung 3 the same agent asks "given you clicked, would you have clicked under a different ad?" (a counterfactual). Watching it climb is the whole story.

The framing follows Pearl's Ladder of Causation as taught by Andrew Forney (LMU CMSI 4320), with the probabilistic machinery from Russell & Norvig's AIMA. The three tiers:

TierQuestionExampleDenizen
3 · imagineWhat if $X$ had been $x'$?"Would I be home had I left the 405?"Structural Causal Models
2 · doWhat if I do $X=x$?"Will exercise lower my cholesterol?"Causal Bayes nets · RL
1 · seeWhat if I see $X=x$?"Is symptom $X$ tied to disease $Y$?"Bayesian networks
Each tier subsumes the ones below it. A tool stuck on tier 1 cannot answer a tier-2 question — which is why even a fluent associational model (an LLM) stumbles on "what would have happened if…".

Act I · Seeing
Bayesian networks — what do I believe?

Start at the bottom. A Bayesian network answers "what do I believe?" by storing a big joint distribution cheaply: a DAG of cause-ish arrows, plus one small conditional probability table (CPT) per node. The joint factorizes along the structure:

$$P(X_1,\dots,X_n)=\prod_i P\!\left(X_i \mid \text{parents}(X_i)\right)$$

Our running example (Forney's, and the one we'll ride all the way up): Stressed, Vapes, Exercises, develops Heart-disease, with $S\to V$, $S\to E$, $V\to H$, $E\to H$. Instead of $2^4-1=15$ free numbers, we store four little CPTs:

$$P(S)\,P(V\!\mid\!S)\,P(E\!\mid\!S)\,P(H\!\mid\!V,E)$$

d-separation — independence you can read off the picture

The graph also tells you what's independent of what, via three local rules. Conditioning on the middle node blocks a fork ($X\!\leftarrow\!Z\!\to\!Y$) or a chain ($X\!\to\!Z\!\to\!Y$); but conditioning on a collider ($X\!\to\!Z\!\leftarrow\!Y$) opens it — observing a common effect makes its causes dependent ("explaining away"). That collider quirk is the seed of every spurious correlation we'll fight in Act III.

Inference — asking the network questions

Given evidence $e$, we want $P(Q\mid e)$. Forney's four-step recipe: find $P(Q,e)=\sum_y P(Q,e,y)$ from the factorization, find $P(e)=\sum_q P(q,e)$, divide. That's enumeration (AIMA's enumeration_ask); variable elimination is the same idea with the sums pushed inward. When the network is too big, you sample (likelihood weighting, Gibbs).

▶ Widget 1 — Bayesian-network inference

The vaping → heart-disease network. Click to fix evidence (—, 0, or 1) on any variable; the posteriors $P(\cdot=1\mid e)$ for the rest update by exact enumeration over all 16 worlds. Try fixing $V=1$: heart-disease risk jumps. Then add $S=1$ and watch it shift again.

Act II · Deciding
Utility & decision networks — what should I do?

Beliefs don't act. To decide, add what you want: a utility $U(s)$ over outcomes. A rational agent picks the action with the highest expected utility:

$$\mathrm{EU}(a\mid e)=\sum_s P(s\mid e,a)\,U(s),\qquad \mathrm{MEU}(e)=\max_a \mathrm{EU}(a\mid e).$$

A decision network is a Bayes net plus decision nodes (your actions) and a utility node. You run inference inside an argmax over actions. That's AIMA's DecisionNetwork — and it's exactly what the Defendotron does.

The Defendotron — a decision network on real data

For CMSI 3300 I built an ad-selection agent: learn a Bayes net from a demographic CSV, then choose which ad to serve to maximize expected payoff. It has hand-rolled eu(), meu(), and vpi(). The catch Forney's students hit years later: it was written against pomegranate 0.x and a 2022 pgmpy, and stopped running. So I modernized it (BicScore→BIC, BayesianNetwork→DiscreteBayesianNetwork, fixed a utility-variable indexing bug) and re-ran it on the original adbot-data.csv:

# pgmpy 1.1.2 — verified MEU + VPI on the original adbot-data.csv
MEU(no evidence)        → serve {Ad1:1, Ad2:0}   E[utility] ≈ 746.8
MEU(T=0, G=0)           → serve {Ad1:0, Ad2:0}   E[utility] ≈ 797.0
VPI("G" | -)            ≈ 20.77      # gender is worth ~21 utility units to learn
VPI("F" | -)            ≈ 0          # feature F can't change the decision → worthless
VPI("G" | H=0,T=1,P=0)  ≈ 66.8       # in context, the same fact is worth 3× more
The reproduction

The modernized run reproduced the original test-suite's headline numbers — MEU≈746.8 (author's value 746.72) and VPI(G)=20.77 exactly. Where the structure-learner lands on a slightly different DAG across pgmpy versions, the downstream VPI shifts too (and can even dip numerically negative — which is why you clamp it). That sensitivity is itself the lesson: VPI is only as stable as the graph beneath it.

Value of Perfect Information — the crown jewel

Now the question this whole project is named for. Before you act, is it worth learning one more fact? The value of perfect information of observing $E'$ is the expected best-you-could-do after seeing it, minus the best-you-could-do now — denominated in utility:

$$\mathrm{VPI}_e(E')=\Big[\textstyle\sum_v P(E'\!=\!v\mid e)\,\mathrm{MEU}\!\big(e\cup\{E'\!=\!v\}\big)\Big]-\mathrm{MEU}(e).$$

Three properties worth holding onto: VPI is never negative (information can't hurt in expectation); it's non-additive (two facts can be redundant or synergistic); and it is zero exactly when the fact can't change your decision — VPI measures decision-relevance, not Shannon information. That's why VPI(F)≈0 above: F is informative about the world but irrelevant to which ad wins.

My hand-rolled vpi() turns out to implement the identical formula to AIMA's InformationGatheringAgent.vpi() (Fig 16.9) — I re-derived the textbook on real data without knowing it.

▶ Widget 2 — VPI calculator

A clean ad-targeting net: a hidden viewer trait $X$ (young/old), two ads, and a click. Drag the click-probabilities and the base rate. The widget computes the best ad blind, the best ad if you could see $X$ first, and the gap — the dollar value of knowing $X$. Push the sliders until one ad dominates regardless of $X$ and watch VPI collapse to $0$.

$P(X=\text{young})$0.50
$P(\text{click}\mid A,\text{young})$0.80
$P(\text{click}\mid A,\text{old})$0.20
$P(\text{click}\mid B,\text{young})$0.30
$P(\text{click}\mid B,\text{old})$0.70
click value$

Act III · Doing
Why correlations aren't enough

The Defendotron is powerful, but it lives on tier 1: it knows how variables covary, not what happens if you act. Ask it "does vaping cause heart disease?" and the naive answer $P(H\!=\!1\mid V\!=\!1)-P(H\!=\!1\mid V\!=\!0)$ is wrong, for three reasons:

  • Observational equivalence — $S\!\to\!V$ and $V\!\to\!S$ encode the same independences but opposite causes.
  • Spurious correlation — conditioning lets information leak along the back-path $V\!\leftarrow\!S\!\to\!E\!\to\!H$, tangential to the effect we want.
  • Latent confounders — variables we never recorded that drive two we did.

Forney's tagline: causal queries demand causal assumptions, encoded in the model's structure. Promote the Bayes net to a Causal Bayesian Network, where an arrow $\text{Parents}(Y)\to Y$ means $Y=f_Y(\text{Parents}(Y))$ — forcing a cause changes the effect, but not vice-versa.

The do-operator — surgery on the graph

Observing differs from intervening. We write $do(X=x)$ for forcing $X$ to $x$ apart from its normal causes. Structurally, $do(X=x)$ builds the mutilated graph $G_x$: the original graph with every inbound edge to $X$ cut. $X$ is now set by fiat, so its parents no longer explain it:

worldgraphfactorization
observe (nature)$S\!\to\!V,\,S\!\to\!E,\,VE\!\to\!H$$P(S)\,P(V\!\mid\!S)\,P(E\!\mid\!S)\,P(H\!\mid\!V,E)$
$do(V=v)$cut $S\!\to\!V$$P(S)\,\underline{\ 1\ }\,P(E\!\mid\!S)\,P(H\!\mid\!V\!=\!v,E)$
The $P(V\mid S)$ factor drops out — that's the only mechanism the intervention touches (this is the truncated factorization). Everything non-downstream of $V$ is untouched, which is why a Markovian causal effect is computable from purely observational CPTs.

The payoff is striking: you can compute the answer to a randomized experiment you never ran — provided your causal assumptions hold.

Backdoor & adjustment

When a confounder muddies $X\to Y$, block the non-causal back-paths by conditioning on an admissible set $Z$ (the backdoor criterion):

$$P(Y\mid do(X\!=\!x))=\sum_z P(Y\mid X\!=\!x,Z\!=\!z)\,P(Z\!=\!z).$$

That adjustment formula is the heart of CMSI 4320's Assignment 4 ("Back Doors, Do, and other Funny Sounding Things"): translate an English query to a $do$-expression, then hand-compute it — e.g. $P(H\!=\!1\mid do(E\!=\!1),S\!=\!1)$ on the vaping net. A randomized controlled trial is a physical $do()$; the algebra lets us emulate one from observational data.

Act IV · Imagining
Structural causal models & the top rung

The most expressive model writes down the mechanisms themselves. A structural causal model $M=\langle U,V,F\rangle$ has endogenous variables $V$, exogenous "luck" variables $U$ (one per $V$, $E[U_i]=0$), and structural equations $V_i\gets f_{V_i}(\text{pa}_i,U_{V_i})$. Fix the $U$ and everything is determined — that determinism is what makes counterfactuals computable. When two variables secretly share a $U$ (in the linear case, $\mathrm{Cov}(U_X,U_Y)\neq 0$), you have an unobserved confounder.

Counterfactuals — abduction, action, prediction

A counterfactual conditions on what actually happened and asks about a world where one thing was different — two worlds sharing the same luck. You can't just intervene (that throws away the evidence about the luck); you must first recover it. The three-step engine, on a linear afterschool→GPA model ($Z\gets 3X+U_Z,\ Y\gets 2X+4Z+U_Y$):

  1. Abduction — solve the $U$ from the observed facts.
  2. Action — apply $do(X=x')$ to the mutilated model, carrying the $U$ forward.
  3. Prediction — recompute the outcome.

▶ Widget 3 — Counterfactual stepper

A student did $X$ hours of an afterschool program, $Z$ hours of homework, and saw a GPA change $Y$. What would $Y$ have been with $X'$ hours instead? The model is linear, so abduction, action, and prediction are exact algebra. Edit the factual story and the counterfactual $X'$, then step through. The twin networks share the abducted luck $U$.

factual: $X=$ $Z=$ $Y=$  ·  counterfactual $X'=$

The ladder closes — VPI becomes ETT

Here is the payoff the project was built around. CMSI 4320's Assignment 5 ("Ifs, Onlys, and Buts") pushes counterfactuals into named quantities. The headline one is the Effect of Treatment on the Treated:

$$\mathrm{ETT}=E\big[Y_{x'}-Y_{x}\ \big|\ X=x\big]$$

— among those who actually took action $x$, what would have happened under $x'$? And the assignment's worked instance is a web-advertising system: demographics $\to$ ad shown $\to$ clickthrough. Given the people who clicked, would they have clicked under a different ad?

The spine, completed

That is the Defendotron's own question, one rung higher. At tier 2 it asked "is your demographic worth buying before I pick an ad?" (VPI). At tier 3 the same ad agent asks "given you clicked, would you have clicked under the ad I didn't show?" (ETT). Same machinery — beliefs, utilities, the value of information — climbed from seeing to imagining. The ad agent walked the whole ladder.

The same act yields attributable risk and probability of necessity (the drug-liability question: of those harmed after taking the drug, what fraction were harmed because of it?), and Forney's research thread, MABUC — multi-armed bandits with unobserved confounders, where an agent uses its own intent (a counterfactual signal) to beat a confounded slot machine. That's the bridge from this curriculum into reinforcement learning.

Where this shows up

The same ladder, eight domains:

DomainTier-1/2 (see / decide)Tier-3 (imagine)
Advertising (the spine)Defendotron: which ad? VPI of a demographicETT: would they have clicked under another ad?
Medicinevaping → heart-disease; exercise → cholesteroldrug-liability attributable risk; physician drug choice
Public policyagricultural subsidy decisionterrorism-predictor causal structure
Educationstudy-habits MDPafterschool → GPA counterfactual
Hiringscreen on predicted performancecounterfactual fairness: same hire under a different background?
Navigationtaxi-dispatch / grid-world MDP"home by now had I left the 405?"
GenomicsDNA-sequence HMM (Viterbi)
Risk under uncertaintytiger-doors POMDP; value of listening
Sibling project

The tier-2 bridge — MDPs, bandits, reinforcement learning — is its own story. In robot-arm-sim a crawler teaches itself to walk by Q-learning, and the punchline is the same as here: a model-based method (one that can imagine roads not taken) beats every model-free one. Decision theory and that RL project are two halves of the same ladder.

What's mine vs. what's cited. The Defendotron ad agent (the eu/meu/vpi engine, structure learning, and the modernization that got it running again) is my CMSI 3300 homework, author-confirmed in the source. The causal-hierarchy framing, the vaping and afterschool examples, and the do/counterfactual machinery are from Andrew Forney's CMSI 4320 lectures and assignments (Pearl's Ladder of Causation). The inference and VPI reference implementations are from Russell & Norvig's AIMA (probability.py, making_simple_decision4e.py, Ch. 13–16). The three widgets above are my own, written to mirror the verified Python.

Built as a node of a larger indexed archive of my coursework. Reproduce the Defendotron with pip install pgmpy pandas and the original adbot-data.csv.