Bayes nets → utility → VPI → counterfactuals 3 interactive widgets 5 parts · ~30 min read built & verified by Scott Nelson

The Value of Knowing You

An ad agent wants to show you the right ad, and to do it well it has to climb a ladder: from seeing patterns, to deciding under uncertainty, to putting a number on what one more fact about you is worth, all the way up to imagining the ad it never showed. This page is a topologically-sorted curriculum for that climb, and every rung has a worked example you can poke at. The destination is genuinely exciting: by the end, the machinery answers questions courts, central banks, and doctors ask every day.

The spine

One agent, the Defendotron (a real homework of mine, since modernized so it runs again), appears twice. At rung 2 it asks "which ad, and is your demographic worth buying?" (the value of information). At rung 3 the same agent asks "given you clicked, would you have clicked under a different ad?" (a counterfactual). Watching one agent climb the whole ladder is the story.

What I built

The Defendotron's eu()/meu()/vpi() engine, modernized from dead pomegranate-0.x and 2022-pgmpy code to pgmpy 1.1.2, fixing a latent indexing bug on the way.
Every number printed on this page is verified against fresh runs.
The three interactive widgets below, vanilla JS mirroring the verified Python. Full provenance.

Prerequisites

Comfort with conditional probability notation $P(A\mid B)$, summation notation $\sum$, and reading a diagram of circles and arrows. That's it; every other object used on this page is defined on this page. If $P(A\mid B)$ is new, any introduction to Bayes' rule is the right warm-up.

The ladder we're climbing

The organizing image is the Ladder of Causation from Judea Pearl and Dana Mackenzie's The Book of Why (2018), the popular companion to Pearl's Causality. The version here follows Andrew Forney's classroom rendering (LMU CMSI 4320), with the probabilistic machinery from Russell & Norvig's AIMA:

Tier	Question	Example	Denizen
3 · imagine	What if $X$ had been $x'$?	"Would I be home by now had I gotten off the 405?"	Structural Causal Models
2 · do	What if I do $X=x$?	"Will exercise lower my cholesterol?"	Causal Bayes nets · RL
1 · see	What if I see $X=x$?	"Is symptom $X$ tied to disease $Y$?"	Bayesian networks

Pearl & Mackenzie, The Book of Why (2018). Each tier subsumes the ones below it; a tool stuck on tier 1 cannot answer tier-2 or tier-3 questions, which is arguably why a purely associational model (an LLM trained on text alone) stumbles on "what would have happened if…".

Act I · Seeing
Bayesian networks: what do I believe?

First, the words

Three definitions carry this whole act. They're short, and everything later leans on them.

Definition · joint distribution

A random variable is a quantity that takes one of several values with some probability ($S$ below is "is this person stressed?", values $\{0,1\}$). The joint distribution $P(X_1,\dots,X_n)$ assigns a probability to every combination of values at once; it is the complete statistical description of the system. The catch is its size: $n$ binary variables need a table with $2^n$ rows ($2^n-1$ free numbers, since they must sum to 1). Four variables: 15 numbers, fine. Thirty variables: over a billion. The joint is the thing we want and cannot afford to store raw.

Definition · DAG

A directed acyclic graph is a set of nodes (one per variable) connected by one-way arrows, with no cycles: you can never leave a node, follow arrows, and return to it. No variable is its own ancestor. The nodes pointing into $X$ are its parents, written $\mathrm{pa}(X)$.

Definition · what an arrow claims (at this tier)

At tier 1, an arrow $S\!\to\!V$ is a claim about direct statistical dependence: $V$'s distribution is written as a function of $S$, and once you know a variable's parents, its non-descendants tell you nothing more about it (the local Markov property). An arrow is not yet a claim that $S$ causes $V$. That upgrade is precisely what Act III earns. Hold the distinction now: association is about what co-occurs; causation is about what would change if you reached in and changed something. The same data can support arrows drawn in either direction (we'll meet this as "observational equivalence"), and a useful slogan to carry: correlation does not imply causation, but there is no correlation without causation somewhere generating it.

A Bayesian network is a DAG plus one small conditional probability table (CPT) per node: $P(X_i \mid \mathrm{pa}(X_i))$. The promise: those small tables, multiplied together, reconstruct the entire unaffordable joint.

The SEVH story: where the arrows come from

Our running network models heart-disease risk in a population with four binary variables: Stressed, Vapes, Exercises, Heart disease. The graph is $S\!\to\!V$, $S\!\to\!E$, $V\!\to\!H$, $E\!\to\!H$. But a graph should never be handed down from the sky; each arrow is a posited, defensible modeling claim, the kind a clinician or epidemiologist supplies:

$S\to V$: chronic stress plausibly drives people toward nicotine; knowing someone's stress level changes our expectation that they vape.
$S\to E$: stress reshapes exercise habits; in the lecture's population the stressed are actually likelier to hit the gym as an outlet ($P(E{=}1\mid S{=}1)=0.8$ versus $0.4$ among the unstressed). The dependence runs through lifestyle, not chance.
$V\to H$ and $E\to H$: vaping loads the cardiovascular system; exercise protects it. Heart disease "listens to" both (Pearl's phrase: an effect listens to its causes to decide its value).

The SEVH network. Blue: the direct edge $V\to H$. Red: the back-path $V\leftarrow S\to E\to H$, open by default, through which association can flow even if vaping did nothing.

The arrows we refused to draw are claims too, and just as important: no $V\!\to\!E$ (vaping doesn't make anyone exercise; the vaping–exercise correlation in data is fully explained by their shared parent $S$), and no $H\!\to\!S$ (in the time window we model, the disease hasn't yet fed back into stress). One subtlety about "information flow": the arrows are one-way as structural claims, but evidence travels both directions along them. Observing $V=1$ raises our belief in $S=1$ (reading the arrow backwards is legitimate inference), and through $S$ it shifts our belief about $E$. The graph disciplines these flows; it doesn't forbid them.

From graph to numbers: the factorization, term by term

Start from something that is always true, graph or no graph, the chain rule of probability. For any ordering of the variables:

$$P(X_1,\dots,X_n)=\prod_{i=1}^{n} P\!\left(X_i \mid X_1,\dots,X_{i-1}\right).$$

Here $i$ indexes the variables one at a time, and each factor conditions on everything before it. No savings yet; the last factor is as big as ever. The graph's gift is that each variable only needs its parents, not its entire history:

$$P(X_1,\dots,X_n)=\prod_{i=1}^{n} P\!\left(X_i \mid \mathrm{pa}(X_i)\right) \qquad\text{(the Markov factorization)}$$

where $i$ runs over the $n$ nodes in any order that respects the arrows (parents before children), and each factor $P(X_i\mid\mathrm{pa}(X_i))$ is exactly one CPT. Watch it play out on SEVH, ordering $S,V,E,H$:

$$\begin{aligned} P(S,V,E,H) &= P(S)\;P(V\mid S)\;P(E\mid S,V)\;P(H\mid S,V,E) &&\text{chain rule}\\[2pt] &= P(S)\;P(V\mid S)\;P(E\mid S)\;P(H\mid V,E) &&\text{graph} \end{aligned}$$

Two deletions happened on that second line, and each is one of our modeling claims cashing out: $P(E\mid S,V)$ became $P(E\mid S)$ because $E$'s only parent is $S$ (the missing $V\!\to\!E$ arrow), and $P(H\mid S,V,E)$ became $P(H\mid V,E)$ because $H$'s parents are exactly $\{V,E\}$ (no direct $S\!\to\!H$ arrow; stress reaches the heart only through behavior). Count the parameters: the raw joint needed $2^4-1=15$ numbers; the factored form needs $1+2+2+4=9$ (one number for $P(S{=}1)$, two for $P(V{=}1\mid S)$, two for $P(E{=}1\mid S)$, four for $P(H{=}1\mid V,E)$). Nine instead of fifteen is cute; at thirty variables the same move is the difference between a billion numbers and a few hundred.

d-separation: independence you can read off the picture

The structure also tells you what is independent of what, by three local rules about paths. Conditioning on the middle node blocks a fork ($X\!\leftarrow\!Z\!\to\!Y$) and a chain ($X\!\to\!Z\!\to\!Y$); but a collider ($X\!\to\!Z\!\leftarrow\!Y$) is blocked by default and conditioning on it opens it: observing a common effect makes its causes compete to explain it ("explaining away"). An unblocked fork is the confounding we fight in Act III. The collider quirk is the mirror trap: a correlation you manufacture by conditioning. Berkson's paradox is this rule loose in the world: among hospital inpatients, two diseases independent in the general population anticorrelate, because admission is their common effect. Any dataset filtered on an outcome has a collider built into its sampling; we will meet it again as survivorship bias in Part V.

Two predictions you can test in Widget 1 below. First, the fork: fix $S{=}1$ and note $P(E{=}1)$. Decide before clicking: will adding $V{=}1$ move it? Second, the collider: fix $H{=}1$ and note $P(V{=}1)$. Will learning the patient exercises ($E{=}1$) raise or lower your belief that they vape? Commit to answers, then use the preset buttons to check.

Asking the network questions: inference, as an outline

Given evidence $e$ (things we observed), we want the posterior $P(Q\mid e)$ for a query variable $Q$. The by-hand method, enumeration, is a four-step outline:

Label the variables: query $Q$, evidence $e$, and everything else hidden, $Y$. The target is $P(Q\mid e)=P(Q,e)/P(e)$.
Numerator. Compute $P(Q,e)=\sum_{y} P(Q,e,y)$, where each term in the sum is a full joint probability, read directly off the Markov factorization (a product of CPT lookups).
Denominator. Compute $P(e)=\sum_{q} P(q,e)$, reusing the very table you built in step 2 and summing over the query's values.
Divide: $P(Q\mid e)=\dfrac{P(Q,e)}{P(e)}$.

Everything fancier is an efficiency upgrade on that outline: variable elimination pushes the sums inward so you never materialize the whole table, and when networks get too large for exact answers, sampling (rejection, likelihood weighting, Gibbs) approximates the same posterior by simulating worlds and counting. The widget below runs the honest four-step version, because with four binary variables there are only sixteen worlds to sum.

▶ Widget 1 · Bayesian-network inference 1 · see

Question: $P(Q\mid e)$, a belief over worlds as observed. The vaping → heart-disease network with the lecture's CPTs. Click to fix evidence (—, 0, or 1) on any variable; the posteriors $P(\cdot=1\mid e)$ for the rest update by exact enumeration over all 16 worlds. Try fixing $V=1$: heart-disease risk jumps. Then add $S=1$: risk dips slightly (0.649 to 0.640), because stress raises the chance of protective exercise here.

Check yourself · Act I

SEVH needs how many free parameters, versus the raw joint? 9 versus 15; every saving is a missing arrow.
Observing $V{=}1$ raises belief in $S{=}1$. Which way does the arrow point, and why is reading it backwards legitimate? $S\to V$; arrows are one-way structural claims, but evidence flows both ways along them.
Which path rule does explaining away come from? The collider: blocked by default, opened by conditioning on the common effect.

Act II · Deciding
Utility, belief, and the first step toward causation

This act is the hinge of the whole curriculum. Acts I and III/IV get the press (Bayes nets! counterfactuals!), but the move that happens here, from passively describing the world to acting in it, is where The Book of Why locates the difference between seeing and doing. An agent that chooses an action is already, implicitly, asking a tier-2 question: "what happens if I do this?" We will build the choosing machinery first with associational tools, see how far it gets (surprisingly far), and let its limits push us up the ladder. Two new objects are needed: a preference you can score, and a belief you can compute with.

Utility theory: what do I want?

A utility function $U(s)$ assigns a number to each outcome, encoding preference. The von Neumann–Morgenstern theorem says that if an agent's preferences are coherent (completeness, transitivity, continuity, independence), they behave as if maximizing the expectation of some $U$:

$$\mathrm{EU}(a\mid e)=\sum_s P(s\mid e,a)\,U(s),\qquad \mathrm{MEU}(e)=\max_a \mathrm{EU}(a\mid e).$$

Read it in words: for each action $a$, imagine every outcome $s$, weight how much you'd like it by how likely it is, and sum. Then pick the action with the best total. That maximum, $\mathrm{MEU}(e)$, is the score of acting optimally given what you currently know.

What is a belief? (the landmark)

The formula above quietly fused two things: $U(s)$, a desire, and $P(s\mid e,a)$, a belief. We've been using beliefs all through Act I without saying what one is. It's worth a real answer, because the same object climbs every rung from here. Three lenses:

Statistics. A belief is a degree of credence: a probability distribution over the ways the world could be, updated by Bayes' rule as evidence arrives. This is the subjective (Bayesian) reading of probability, where $P$ measures confidence rather than long-run frequency.
Computation. A belief is a data structure: the factored posterior a Bayes net maintains, or the "belief state" an agent carries. In partially observable decision problems (POMDPs), the belief state is provably a sufficient statistic: an agent that tracks only its posterior over hidden states loses nothing for optimal action. Belief is what a rational agent is licensed to compress its whole history into.
Neuroscience. The Bayesian-brain research program treats perception itself as inference: neural populations encode distributions, perception combines prior and likelihood, and prediction errors drive updates. The dopamine system's reward-prediction-error signal mirrors, almost equation for equation, the temporal-difference error of reinforcement learning. Wetware appears to run a cousin of the very update rules in this curriculum.

And the deepest answer runs the arrow backwards. You might worry that "degree of belief" is circular: where do the numbers come from? Frank Ramsey and Bruno de Finetti's move: measure belief by behavior. Your degree of belief in an event is the betting odds you'd accept on it. If your credences violate the probability axioms, a bookie can assemble a set of bets you'd accept that loses you money no matter what happens (a "Dutch book"), so coherence forces beliefs to be probabilities. Leonard Savage completed the construction: an agent whose preferences satisfy a short list of rationality axioms behaves exactly as if it has a unique probability distribution (its beliefs) and a utility function (its desires), and maximizes expected utility. So belief doesn't merely combine with weighted utility; in the foundations, belief emerges from utility-weighted choice. Decision theory isn't built on top of probability; the two co-emerge from coherent preference.

Landmark · carry this up the ladder

A belief is a probability distribution over ways a world could be. The ladder never changes that object; it changes which worlds the distribution ranges over. Rung 1: the world as observed, $P(s\mid e)$. Rung 2: worlds we would create, $P(s\mid do(a))$. Rung 3: worlds that never happened, $P(s_{x'}\mid e)$ (read $s_{x'}$ as: the value $s$ would have taken had $X$ been forced to $x'$; Act IV defines this properly). Every formula in Acts III and IV is this landmark pointed at a new set of worlds.

Decision networks: belief + preference + choice

A decision network (influence diagram) is a Bayes net extended with decision nodes (the actions under your control) and a utility node. The agent runs Act I inference inside an argmax over actions. This one pattern is everywhere once you see it:

an ER triage system: beliefs about acuity from vitals; utilities over outcomes; choose who's seen next;
a central bank (we'll come back to this one): beliefs about inflation's path; utilities over the dual mandate; set the policy rate;
a spam filter: beliefs about spamminess; asymmetric utilities (a lost real email hurts more than a leaked spam); deliver or quarantine.

Different stakes, same skeleton: beliefs × utilities → argmax. Our worked instance is deliberately smaller and a little mischievous.

The Defendotron: a decision network on real data

For CMSI 3300 I built an ad-selection agent: learn a Bayes net from a demographic CSV, then choose which ad to serve to maximize expected payoff. It has hand-rolled eu(), meu(), and vpi(). The catch, years later: it was written against pomegranate 0.x and a 2022 pgmpy, and it stopped running. So I modernized it (BicScore→BIC, BayesianNetwork→DiscreteBayesianNetwork). The modernization also surfaced a latent bug: eu() looked up the utility table by the first decision variable's name rather than the utility node's, and the modernized run turned that silent slip into a hard KeyError. Diagnosing it meant tracing how the learned network indexes its factors; reproducing the original test-suite's numbers afterward is what confirmed the fix. One reading note: G, F, T, H, P below are the ad dataset's anonymized demographic columns (G is gender); they have nothing to do with Act I's $S,V,E,H$, and in particular this H is not heart disease. The re-run on the original adbot-data.csv:

# pgmpy 1.1.2: verified MEU + VPI on the original adbot-data.csv
MEU(no evidence)        → serve {Ad1:1, Ad2:0}   E[utility] ≈ 746.8
MEU(T=0, G=0)           → serve {Ad1:0, Ad2:0}   E[utility] ≈ 797.0
VPI("G" | -)            ≈ 20.77      # gender is worth ~21 utility units to learn
VPI("F" | -)            ≈ 0          # feature F can't change the decision → worthless
VPI("G" | H=0,T=1,P=0)  ≈ 66.8       # in context, the same fact is worth 3× more

The reproduction

The modernized run reproduced the original test-suite's headline numbers: MEU≈746.8 (author's value 746.72) and VPI(G)=20.77 exactly. Where the structure learner lands on a slightly different DAG across pgmpy versions, the downstream VPI shifts too (and can even dip numerically negative, which is why you clamp it). That sensitivity is itself a lesson: VPI is only as stable as the graph beneath it.

Value of Perfect Information: the crown jewel

Now the question this page is named for. Before you act, is it worth learning one more fact? The value of perfect information of observing $E'$ is the expected best-you-could-do after seeing it, minus the best-you-could-do now, denominated in utility:

$$\mathrm{VPI}_e(E')=\Big[\textstyle\sum_v P(E'\!=\!v\mid e)\,\mathrm{MEU}\!\big(e\cup\{E'\!=\!v\}\big)\Big]-\mathrm{MEU}(e).$$

Three properties worth holding onto: VPI is never negative (information can't hurt in expectation; exactly true inside a fixed, correct model, while an estimated model can produce numerically negative VPI, which is why the Defendotron clamps it); it's non-additive (two facts can be redundant or synergistic); and it is zero exactly when the fact can't change your decision. VPI measures decision-relevance, not Shannon information. That's why VPI(F)≈0 above: F is informative about the world but irrelevant to which ad wins.

My hand-rolled vpi() turns out to implement the identical formula to AIMA's InformationGatheringAgent.vpi() (Fig 16.9); I re-derived the textbook on real data without knowing it.

Before you drag anything, the widget's opening position is a derivation you can do by hand. Blind: $EU(A)=\$10\,(0.5\cdot 0.8+0.5\cdot 0.2)=\$5.00$, and $EU(B)=\$5.00$ too, a dead tie. Informed: young means serve A and earn $\$8.00$; old means serve B and earn $\$7.00$; so $\mathrm{MEU}=0.5(8)+0.5(7)=\$7.50$. The gap is $\mathrm{VPI}=\$2.50$, exactly what the widget shows on load. Confirm it, then drag.

▶ Widget 2 · VPI calculator 2 · decide

Question: is the MEU after seeing $X$ worth more than the MEU now? A clean ad-targeting net: a hidden viewer trait $X$ (young/old), two ads, and a click. Drag the click-probabilities and the base rate. The widget computes the best ad blind, the best ad if you could see $X$ first, and the gap: the dollar value of knowing $X$. Push the sliders until one ad dominates regardless of $X$ and watch VPI collapse to $0$.

$P(X=\text{young})$0.50

$P(\text{click}\mid A,\text{young})$0.80

$P(\text{click}\mid A,\text{old})$0.20

$P(\text{click}\mid B,\text{young})$0.30

$P(\text{click}\mid B,\text{old})$0.70

click value (dollars, min 0)$

Check yourself · Act II

What two objects does $\mathrm{EU}(a\mid e)$ fuse? A belief $P(s\mid e,a)$ and a desire $U(s)$.
Why is $\mathrm{VPI}(F)=0$ even though $F$ is informative about the world? VPI measures decision-relevance, not Shannon information; $F$ cannot change which ad wins.

Branch off the ladder · applied playground

Want to use this machinery rather than just read it? The applied decision-theory playground is a companion page of worked word-problems — insurance, markets, medicine, and law — each solved step by step, with a live break-even calculator you can drag. Same engine (expected utility, $\mathrm{MEU}$, and the value of information), pointed at real decisions.

Act III · Doing
Why seeing isn't enough

The Defendotron is powerful, but it lives on tier 1: it knows how variables covary, not what happens if you act. Ask it "does vaping cause heart disease?" and the naive answer $P(H\!=\!1\mid V\!=\!1)-P(H\!=\!1\mid V\!=\!0)$ is wrong, for three reasons:

Observational equivalence; $S\!\to\!V$ and $V\!\to\!S$ encode the same independences but opposite causes, so data alone cannot pick between them.
Spurious correlation; the back-path $V\!\leftarrow\!S\!\to\!E\!\to\!H$ is open by default, so association leaks through it unless something blocks it, and the naive comparison blocks nothing.
Latent confounders; variables we never recorded may drive two that we did.

Forney's tagline: causal queries demand causal assumptions, encoded in the model's structure. Promote the Bayes net to a Causal Bayesian Network, where an arrow $\text{Parents}(Y)\to Y$ now means $Y=f_Y(\text{Parents}(Y))$: forcing a cause changes the effect, but not vice-versa. The arrows finally mean what Act I warned you they didn't yet.

The do-operator: surgery on the graph

Observing differs from intervening. We write $do(X=x)$ for forcing $X$ to $x$ apart from its normal causes. Structurally, $do(X=x)$ builds the mutilated graph $G_x$: the original graph with every inbound edge to $X$ cut. $X$ is now set by fiat, so its parents no longer explain it:

world	graph	factorization
observe (nature)	$S\!\to\!V,\,S\!\to\!E,\,VE\!\to\!H$	$P(S)\,P(V\!\mid\!S)\,P(E\!\mid\!S)\,P(H\!\mid\!V,E)$
$do(V=v)$	cut $S\!\to\!V$	$P(S)\,\cdot\,1\,\cdot\,P(E\!\mid\!S)\,P(H\!\mid\!V\!=\!v,E)$

The truncated factorization: under $do(V=v)$, the factor $P(V\mid S)$ is deleted and everything else survives unchanged.

Graph surgery: under $do(V{=}v)$ the inbound edge $S\to V$ is cut. The back-path can no longer reach $V$, so whatever association remains between $V$ and $H$ is causal.

About that lone 1 in the second row: in the observational world, $V$ carries the factor $P(V\mid S)$ because $V$ listens to $S$. The intervention unplugs that wire. $V$ now takes the value $v$ with certainty, so its factor becomes the constant $1$, while $S$ keeps its prior, $E$ keeps $P(E\mid S)$, and $H$ keeps $P(H\mid V{=}v,E)$. Deleting a factor is the algebraic shadow of deleting an arrow. And because only $V$'s own mechanism changed, a Markovian causal effect is computable from purely observational CPTs; you can calculate the result of a randomized experiment you never ran, provided the causal assumptions in the graph hold.

Backdoor & adjustment, with the intuition first

Picture association as water flowing through every unblocked path between $X$ and $Y$, in either direction along the pipes. The causal effect is only the water moving along directed paths $X\!\to\!\dots\!\to\!Y$. Any path that enters $X$ through an inbound arrow (through $X$'s "back door", like $V\!\leftarrow\!S\!\to\!E\!\to\!H$) leaks non-causal association into your comparison. Conditioning is the valve system: conditioning on a node in a fork or chain closes that pipe; conditioning on a collider opens one that was shut. The backdoor criterion asks for a set $Z$ that closes every back-door pipe without opening any collider and without touching descendants of $X$. Find one, and what's left flowing is causation:

$$P(Y \mid do(X{=}x)) = \sum_z P(Y \mid X{=}x, Z{=}z)\; P(Z{=}z) \qquad\text{(adjustment formula)}$$

Concretely, for "does vaping cause heart disease?": comparing vapers to non-vapers naively mixes two differences, the effect of vaping and the fact that vapers are more often stressed (and stress also shifts exercise habits). The fix is exactly what a careful epidemiologist would do by hand: compare vapers to non-vapers within the stressed group, compare them within the unstressed group, then average the two answers, weighting by how common stress is in the population. That sentence is the adjustment formula. One subtlety that rewards a pause: the weight is $P(z)$, the population's stress distribution, not $P(z\mid x)$, the stress distribution among vapers. Using the latter would smuggle the confounding right back in; the whole point of $do$ is to ask what happens when vaping is assigned independently of stress.

On the vaping net the whole argument fits in four lines, every number read off Widget 1's CPTs. Doing: $P(H{=}1\mid do(V{=}1))=\sum_s P(s)\sum_e P(e\mid s)\,P(H{=}1\mid V{=}1,e) =0.3(0.72)+0.7(0.64)=0.664$, and $do(V{=}0)$ gives $0.164$. Seeing: the observational $P(H{=}1\mid V{=}1)=0.649$ and $P(H{=}1\mid V{=}0)=0.183$. So seeing estimates a risk difference of $0.466$ while doing gives the true causal $0.500$; the gap is exactly the confounding leaking through $S$. The numbers differ even in this four-node toy, which is the entire point of the rung.

Simpson's paradox

A treatment can look worse in the pooled table yet better within every stratum; the classic kidney-stone dataset does exactly this, because the harder cases were given the better treatment. Both tables are computed correctly from the same data, and rung 1 cannot tell you which to believe. The graph can: if $Z$ causes both treatment and outcome, stratify; if $Z$ is a mediator or a collider, stratifying is what creates the error. "Which table do I trust" is a rung-2 question, and the most common way published statistics mislead.

This is the substance of CMSI 4320's Assignment 4 ("Back Doors, Do, and other Funny Sounding Things"): translate an English query into a $do$-expression, find an admissible $Z$, and hand-compute, e.g., $P(H\!=\!1\mid do(E\!=\!1),S\!=\!1)$ on the vaping net. The adjustment formula is how you emulate a randomized trial with observational data when the graph permits (the trial itself appears in Part V's data table as a physical $do()$).

Digression · the biggest do() in economics

When a central bank raises its policy rate, that is a literal intervention: $do(\text{rate}{=}r)$ propagates through banks' funding costs into loan pricing, and credit creation slows. The reason this is hard to measure is pure Act III: rates and credit covary in the data, but the bank also reacts to the economy (it hikes when credit booms), which is confounding by indication, the same structure as a physician prescribing the strongest drug to the sickest patients. Economists' fixes are causal-inference moves wearing macro clothing: high-frequency "surprise" components of rate announcements serve as quasi-randomization (a natural $do()$), and narrative records of policy deliberations are used to argue back-door paths shut. The counterfactual version ("had the Fed held rates in 2022, would inflation have persisted?") needs Act IV machinery: a structural model of the economy, which is what DSGE models are, SCMs at national scale.

Check yourself · Act III

The adjustment weight is $P(z)$, not $P(z\mid x)$; what does using $P(z\mid x)$ smuggle back in? The confounding: it reimports the stress distribution among vapers instead of assigning vaping independently of stress.
Why is $P(H{=}1\mid V{=}1)=0.649$ but $P(H{=}1\mid do(V{=}1))=0.664$? Seeing $V{=}1$ also raises belief in stress, and the stressed exercise more here, which protects; the intervention instead assigns vaping across the population's actual stress mix.

Act IV · Imagining
Structural causal models & the top rung

The most expressive model writes down the mechanisms themselves. A structural causal model $M=\langle U,V,F\rangle$ has endogenous variables $V$, exogenous "luck" variables $U$ (one per $V$ in the Markovian case), and structural equations $V_i\gets f_{V_i}(\text{pa}_i,U_{V_i})$. Fix the $U$ and everything is determined; that determinism is what makes counterfactuals computable. When two variables secretly share a $U$ (in the linear case, $\mathrm{Cov}(U_X,U_Y)\neq 0$), you have an unobserved confounder. And recall the Act II landmark: a belief is a distribution over ways a world could be. The SCM is the machine that generates those worlds, one per setting of $U$, which is exactly why it can range over worlds that never happened.

Counterfactuals: abduction, action, prediction

A counterfactual conditions on what actually happened and asks about a world where one thing was different: two worlds sharing the same background luck. You cannot just intervene (that throws away the evidence about the luck); you must first recover it. Our worked model is a linear SCM of an afterschool program's effect on grades. Define the variables precisely:

Variable	Meaning	Structural equation
$X$	hours per week the student attends the afterschool program	$X \gets U_X$ (chosen outside the model)
$Z$	hours per week of homework completed (the program assigns and supervises homework, so $X$ drives $Z$)	$Z \gets 3X + U_Z$
$Y$	change in GPA quality points since enrolling (positive = improvement)	$Y \gets 2X + 4Z + U_Y$
$U_X,U_Z,U_Y$	everything else, bundled per variable: aptitude, sleep, home environment, luck	exogenous, mean 0

The path coefficients say: each program-hour directly adds 2 quality points (mentoring), and each homework-hour adds 4; the program also generates 3 homework-hours per program-hour. Now the query: a student did $X{=}1$ hour, completed $Z{=}2$ homework-hours, and actually lost 2 quality points ($Y{=}{-}2$). What would their GPA change have been with $X{=}3$ hours instead? The three-step engine:

Abduction. Solve this student's luck from the observed facts: $U_Z = Z-3X = -1$ (they did one homework-hour fewer than the program predicts) and $U_Y = Y-2X-4Z = -12$ (a large unlucky residual: something outside the model was dragging this student's grades hard).
Action. Intervene $do(X{=}3)$ on the mutilated model, carrying the abducted $U_Z, U_Y$ forward unchanged; the counterfactual world keeps the same luck.
Prediction. Recompute: $Z' = 3(3)+(-1) = 8$, then $Y' = 2(3)+4(8)+(-12) = 26$.

Because the model is linear, every step is exact algebra, which is why this is the act's widget: edit the factual story, watch the twin worlds share their $U$.

▶ Widget 3 · Counterfactual stepper 3 · imagine

Question: in the world where the facts happened, what would $Y$ have been had $X$ been $X'$? Variables: $X$ = weekly hours in the afterschool program · $Z$ = weekly homework hours · $Y$ = GPA change in quality points · $U_Z,U_Y$ = this student's bundled "everything else". Edit the factual story and the counterfactual $X'$, then step through Abduction → Action → Prediction. The twin networks share the abducted luck $U$.

factual: $X=$ $Z=$ $Y=$ · counterfactual $X'=$

The courtroom: "but for", necessity, and sufficiency

Here is a framing where counterfactuals stop being philosophy and start deciding verdicts. Tort law's standard test of causation is the but-for test: but for the defendant's act, would the harm have occurred? That sentence is literally a rung-3 query. With exposure $X$ and harm $Y$, the court is asking about $Y_{X=0}$ for a plaintiff whose actual history is $X{=}1, Y{=}1$. Read $Y_{X=0}$ as: the value $Y$ would have taken for this same individual had $X$ been forced to $0$. Same person, same exogenous luck $U$, different action. Subscript means intervened; the conditioning bar means observed. So $P(Y_{X=0}{=}0\mid X{=}1,Y{=}1)$ mixes a world we make with facts we saw, which is exactly what the three-step engine computes: abduct $U$ from the facts, then intervene. Pearl gives the two faces of this question names:

$$\mathrm{PN} = P\big(Y_{X=0}{=}0 \;\big|\; X{=}1,\,Y{=}1\big) \qquad \mathrm{PS} = P\big(Y_{X=1}{=}1 \;\big|\; X{=}0,\,Y{=}0\big)$$

Probability of necessity (PN): among those who were exposed and harmed, how likely is it the harm would not have happened but for the exposure? This is the liability question, and the civil "preponderance of the evidence" standard reads naturally as $\mathrm{PN} > 0.5$: more likely than not, the exposure was necessary for this plaintiff's harm.
Probability of sufficiency (PS): among the unexposed and unharmed, how likely is it the exposure would have produced the harm? This is the regulator's question: how dangerous is it to introduce this thing to people currently fine?

The distinction has teeth. A drug can be rarely necessary (most injured users would have been injured anyway) yet often sufficient (it reliably injures the healthy), or the reverse; liability and regulation hinge on different counterfactuals. And this is exactly where my CMSI 4320 Assignment 5 drug-liability problem lived: under monotonicity (the drug never prevents the harm) plus exogeneity (the exposure unconfounded with the harm), PN collapses to the excess risk ratio, $\big(P(y\mid x)-P(y\mid x')\big)/P(y\mid x)$, the classic "attributable fraction among the exposed", computable from data. Drop either assumption and point identification from raw conditionals fails: with confounding you also need the experimental quantity $P(y\mid do(x'))$, and without monotonicity PN can only be bounded; we return to those bounds on the advanced shelf. Event attribution science reports the same quantity under another name: the fraction of attributable risk behind a headline like "climate change doubled the odds of this heatwave" is this excess risk ratio, computed against a counterfactual Earth that exists only inside a structural model. The Book of Why devotes a chapter to exactly this courtroom logic.

Courts also fail one rung down. Presenting $P(\text{evidence}\mid\text{innocent})$ as if it were $P(\text{innocent}\mid\text{evidence})$ is the prosecutor's fallacy: in the Sally Clark case, a one-in-73-million figure for two natural infant deaths was allowed to stand in for the probability of innocence, ignoring both the base rate (double SIDS, while rare, is far more common than double infanticide) and the dependence between the two deaths. That is the gallery's "base-rate reasoning" entry: rung 1, misread.

The ladder closes: VPI becomes ETT

The same assignment pushes counterfactuals into one more named quantity, the Effect of Treatment on the Treated:

$$\mathrm{ETT}=E\big[Y_{x'}-Y_{x}\ \big|\ X=x\big]$$

Among those who actually took action $x$, what would have happened under $x'$? (Written here with the alternative first, since our question asks about the ad not shown; Pearl's convention, $E[Y_x - Y_{x'}\mid X=x]$, is the negative of this.) The assignment's worked instance is a web-advertising system: demographics, then ad shown, then clickthrough. Given the people who clicked, would they have clicked under a different ad?

The spine, completed

That is the Defendotron's own question, one rung higher. At tier 2 it asked "is your demographic worth buying before I pick an ad?" (VPI). At tier 3 the same ad agent asks "given you clicked, would you have clicked under the ad I didn't show?" (ETT). Same machinery, beliefs and utilities and the value of information, climbed from seeing to imagining. The ad agent walked the whole ladder.

Check yourself · Act IV

The model says a generic student scores $Y=14X$, so $do(X{=}3)$ gives $42$; why does our student get only $26$? Abduction found $U_Z=-1$ and $U_Y=-12$. Rung 3 keeps this student's luck; rung 2 averages over everyone's.
In $P(Y_{X=0}{=}0\mid X{=}1,Y{=}1)$, which part is a world we make and which is facts we saw? The subscript $X{=}0$ is the intervention; the conditioning bar holds the observed history. Mixing them is exactly what abduction-action-prediction computes.

Part V · In the wild

The curriculum is done; this last part is the payoff tour, ordered from familiar to frontier. First the gallery of domains, then the two practical questions every new reader should ask (where do the graphs come from? what data do you need?), and finally the advanced shelf.

Field guide · which rung is this claim on?

"linked to / associated with" is rung 1. Ask: what shared cause could generate it, and was the sample filtered on a common effect (a collider)?
"causes / cuts risk by" needs rung 2. Ask: where did the $do()$ come from: randomization, a natural experiment, or adjustment, and adjusted for what?
"would have / root cause / to blame" is rung 3. Ask: what model supplied the counterfactual? An incident postmortem's "root cause" is a but-for test, and "five whys" is abduction done by hand.

The gallery

Domain	Tier-1/2 (see / decide / do)	Tier-3 (imagine)
Advertising (the spine)	Defendotron: which ad? VPI of a demographic	ETT: would they have clicked under another ad?
A/B testing & recommenders	an A/B test is a physical $do()$; a recommender trained on its own logs inherits its past policy's confounding, so rung-1 retraining amplifies its own biases	uplift modeling: spend only on persuadables, users whose click the ad causes; "would have clicked anyway" is sufficiency without necessity
Central banking	$do(\text{raise rate})$: transmission into credit creation; surprise-based identification	"Had the Fed held rates, would inflation have persisted?" (structural macro models)
Medicine	vaping → heart-disease; exercise → cholesterol; treatment choice under uncertainty	drug-liability PN; "would this patient have recovered untreated?"
Law	screening evidence; base-rate reasoning	but-for causation; necessity vs sufficiency; preponderance as PN > 0.5
Public policy	quasi-experiments: difference-in-differences, regression discontinuity, instrumental variables (the central-bank surprise method is an IV); each hunts a natural $do()$ in observational data	policy counterfactuals ("had the program not run…")
Education	study-habits decisions	afterschool → GPA counterfactual (the widget)
Hiring	screen on predicted performance	counterfactual fairness: same hire under a different background?
Navigation & ops	taxi-dispatch and grid-world decisions	"home by now had I left the 405?"
Genomics	DNA-sequence HMMs (Viterbi); Mendelian randomization: genotype as nature's coin flip, a free $do()$ for exposures you could never ethically randomize	—
Risk under uncertainty	tiger-doors POMDP; paying for one more observation (VPI again)	—

Where do the DAGs come from?

Every result above conditioned on "given the graph". Three honest sources, usually combined:

Domain expertise. The SEVH story scaled up: a human who understands the mechanisms posits each arrow and, just as deliberately, each absence. In serious applications this is an elicitation exercise with experts, and every edge should survive the challenge "defend this arrow, and defend the arrows you didn't draw."
Score-based learning. Treat structure as a search problem: propose a DAG, score how well it explains the data penalized for complexity (e.g. BIC), and hill-climb through edge-additions/deletions/reversals. This is exactly what the Defendotron does (HillClimbSearch + BIC over adbot-data.csv).
Constraint-based learning. Run conditional independence tests on the data, build the skeleton of edges consistent with them, then orient what logic allows (colliders first). This is the PC algorithm; I used it via CMU's Tetrad on an exam-anxiety study in CMSI 4320's Assignment 4.

And the structural limit from Act III applies with force: observational data identifies a graph only up to its equivalence class. Some arrows can never be oriented by associations alone; they need an intervention, time order (causes precede effects), or background knowledge ("biological sex cannot be caused by weight"). Structure learning is where Act I's association and Act III's causation actually meet.

What data feeds which rung?

The ladder's tiers correspond to kinds of data, and knowing which kind you hold tells you which questions you've earned:

Data	Rung it natively answers	Character
Observational (logs, surveys, medical records, market data)	1 · seeing	Cheap, plentiful, and confounded; the world chose who got "treated."
Experimental (RCTs, A/B tests)	2 · doing	A physical $do()$: randomization cuts every inbound arrow to the treatment. Expensive, slow, sometimes unethical or impossible (you cannot randomize smoking, or interest rates).
Counterfactual	3 · imagining	Does not exist. No dataset contains $Y_{x'}$ for an individual who received $x$; the road not taken is never logged.

That last row is the deep one. Act III's adjustment machinery exists precisely to promote rung-1 data to rung-2 answers when the graph permits; that's its whole job. Rung 3 can never be reached by data alone: counterfactual questions always borrow strength from a model (the SCM's structural equations), with data serving to pin down the model's parameters and your individual's exogenous luck.

Practical notes for the data you collect:

timestamps help orient arrows (causes precede effects);
record the variables your experts name as confounders before you need them;
filtering a dataset on a common effect is conditioning on a collider, and it will manufacture associations out of nothing (survivorship bias is d-separation's revenge; Act I's Berkson example is the same trap).

So which data are "most applicable" for counterfactual questions? The honest answer: a combination. Observational data captures how the world behaves under nature's choices; experimental data captures behavior under forced choices; an SCM stitches them together; and remarkably, the pair of them constrains counterfactual quantities more tightly than either alone.

The advanced shelf

Saved for last, as promised: the live edge of this material.

Bounding the unidentifiable. When assumptions like monotonicity fail, PN and PS cannot be computed exactly, but Tian & Pearl showed that combining observational and experimental data yields provable bounds: a court can sometimes know $\mathrm{PN}\in[0.8, 1.0]$ without any untestable modeling leap, which is more than enough for a preponderance standard.
ETT identification. The treated-group counterfactual from the spine is identifiable from observational data under backdoor-style conditions; CMSI 4320's "counterfactual backdoor" material works out when the twin network lets adjustment reach across worlds.
Twin networks as a computation device. The factual and counterfactual worlds drawn as one graph, sharing exogenous $U$ nodes; counterfactual inference becomes ordinary Bayes-net inference on the doubled graph (this is exactly what Widget 3 renders).
Counterfactuals inside the decision loop: MABUC. Forney's research thread, multi-armed bandits with unobserved confounders: an agent that conditions on its own intent (what it was about to do is a symptom of the confounder) earns a counterfactual signal worth real regret reductions. This is the bridge from this page back into reinforcement learning, and the frontier where "the value of knowing" becomes "the value of knowing yourself."
Where it's heading. Counterfactual fairness (would the decision differ had the applicant's protected attribute differed?), harm attribution for autonomous systems (a robotic but-for test), and personalized medicine's holy grail, the effect of treatment on this patient, which is ETT asked at the bedside.

Sibling project

The tier-2 bridge into sequential decisions (MDPs, bandits, reinforcement learning) is its own story. In robot-arm-sim a crawler teaches itself to walk by Q-learning, and the punchline is the same as here: a model-based method, one that can imagine roads not taken, beats every model-free one. Decision theory and that RL project are two halves of the same ladder.

Appendix · the cheat sheet

Symbol	Reads as	Rung	Defined in
$P(Q\mid e)$	belief in $Q$ given what was observed	1	Inference
$\mathrm{EU},\ \mathrm{MEU}$	expected utility of an action; the best achievable	1–2	Utility
$\mathrm{VPI}_e(E')$	what learning $E'$ is worth before acting	2	VPI
$do(X{=}x)$	force $X{=}x$; cut its inbound arrows	2	do-operator
$\sum_z P(Y\mid x,z)P(z)$	emulate a trial: stratify, then average by $P(z)$	2	Backdoor
$\langle U,V,F\rangle$	mechanisms plus luck: a world generator	3	SCM
$Y_x$	$Y$ had $X$ been forced to $x$, same luck $U$	3	Courtroom
$\mathrm{PN},\ \mathrm{PS}$	was it necessary? would it have sufficed?	3	Courtroom
$\mathrm{ETT}$	effect of treatment on those actually treated	3	Ladder closes