Probabilistic Graphical Models

The goal of causal inference is to figure out a causal relationship, or the lack thereof, between two sets of variables; causes and outcomes. It is thus only natural to think of how we determine whether any particular variable is a cause or an outcome. It is often relatively more straightforward to determine what an outcome variable is, as this determination is done based on our subjective interest. For instance, a typical outcome variable in medicine is a disease-free survival rate within some years since diagnosis or treatment. This choice is natural, since this variable, or its quantity, is what we want to maximize. It is however much less clear how to determine which variable should be considered a cause. For instance, in the classical example of ‘smoking causes lung cancer’, what makes us choose ‘whether someone smokes ciagarettes’ as a cause variable rather than ‘a mutation in a particular gene’? It becomse even more mind-boggling once we realize that this choice of ‘smoking’ as a cause meant that we decided to ignore many variables, such as ‘whether a farmer decided to grow tobacco’. It is thus an important, if not the most important, job for practitioners of causal inference to convincingly argue why some variables are included and others were omitted. They also must argue why some of the included variables are considered potential causes and why they chose a particular variable as an outcome. This process can be thought of as defining a small universe in which causal inference must be performed. There are many different ways to define and describe such a universe, and in this lecture note, we largely stick to using a probabilistic graphical model, or a corresponding structural causal model, as a way to describe each universe, which is the main topic of this chapter.

Probababilistic Graphical Models

In this course, we rely on probabilistic graphical models quite extensively in order to describe their statistical and causal relationships among random variables. A probabilistic graphical model, which is also referred as a Bayesian graphical model, is a directed graph [math]G=(V,E)[/math], where [math]V[/math] is a set of vertices/nodes and [math]E[/math] is a set of directed edges. Each node [math]v \in V[/math] corresponds to a random variable, and each edge [math]e = (v_s, v_e)[/math] represents the dependence of [math]v_e[/math] on [math]v_s[/math]. Let us for now assume that this graph is acyclic, that is, there is no cycle within this graph. In other words, [math]G[/math] is a directed acyclic graph, throughout the rest of this course. For each node [math]v \in V[/math], we define a probability distribution [math]p_v(v | \mathrm{pa}(v))[/math] over this variable conditioned on all the parent nodes

[[math]] \begin{align} \mathrm{pa}(v) = \left\{ v' \in V | (v', v) \in E \right\}. \end{align} [[/math]]

We can then write a joint distribution over all the variables as

[[math]] \begin{align} p_V (V) = \prod_{v \in V} p_v(v | \mathrm{pa}(v)), \end{align} [[/math]]

following the chain rule of probabilities. When [math]\mathrm{pa}(v) = \emptyset[/math], i.e. [math]v[/math] does not have any incoming edges, we call the associated distribution [math]p_v(v)[/math] a prior distribution. With [math]P[/math] a set of all conditional probabilities [math]p_v[/math]'s, we can denote any probabilistic graphical model as a triplet [math](V, E, P)[/math]. From this joint distribution, we can derive all kinds of conditional distributions by marginalizing variables and applying the definition of a conditional probability. If we are not interested in a particular node [math]\bar{v} \in V[/math], we can marginalize out this variable by

[[math]] \begin{align} p(V \backslash \{\bar{v}\}) = \sum_{\bar{v}} p_V (V). \end{align} [[/math]]

If [math]\bar{v}[/math] is a continuous random variable, we replace [math]\sum[/math] with [math]\int[/math]. We can always turn a joint probability into a conditional probability by

[[math]] \begin{align} p(V \backslash \{\tilde{v}\} | \tilde{v}) = \frac{p_V(V)}{p_{\tilde{v}}(\tilde{v})}. \end{align} [[/math]]

Using the definition of the conditional probability, we can write marginalization in the following way:

[[math]] \begin{align} p(V \backslash \{\bar{v}\}) = \sum_{\bar{v}} \frac{p_V(V)}{p_{\tilde{v}}(\tilde{v})} p_{\tilde{v}}(\tilde{v}) = \sum_{\bar{v}} p(V \backslash \{\bar{v}\}) p_{\tilde{v}}(\tilde{v}). \end{align} [[/math]]

Marginalization corresponds to computing the weighted sum of the conditional probability of the remaining variables according to the marginal probability of the variable being marginalized. Let's assume we can sample readily from any conditional probability [math]p_v[/math] with [math]v\in V[/math]. We can then draw a sample from the joint distribution readily by breadth-first sweeping of all the variables. That is,

[[math]] \begin{align} \tilde{v} \sim p_v (v | \tilde{\mathrm{pa}}(v)), \end{align} [[/math]]

where [math]\tilde{\mathrm{pa}}(v) = \left\{ \tilde{v}' \sim p_{v'}(v' | \tilde{\mathrm{pa}}(v')) | v' \in \mathrm{pa}(v) \right\}[/math]. This procedure is called ancetral sampling and is an exact, unbiased way to sample from this joint distribution. If we set aside efficiency, ancestral sampling is an extremely powerful tool, as it allows us to sample from any marginal distribution as well as any conditional distribution. In the former case, we simply discard the draws that correspond to the uninteresting variables (that are being marginalized). In the latter case, we only keep samples whose values, corresponding to the conditioning variables (that are on the right hand side of [math]|[/math]), are precisely those that we want the distribution to be conditioned on. Both of these approaches are not efficient, and it is often much better to use a more sophisticated sampling algorithm, often based on the idea of Markov Chain Monte Carlos (MCMC). Any probability distribution can be expressed as a table (though, this table may have infinitely many rows) that consists of two columns. The first column takes the value of interest and the second column its probability (either density or mass). The probability function [math]p_v[/math] above works by hashing [math]v[/math] into the row index in this table and retrieving the associated probability, i.e. [math]p_v: \mathbb{V} \to \mathbb{R}_+[/math]. This function satisfies the following normalization property:

[[math]] \begin{align} 1 = \begin{cases} \sum_{v \in \mathbb{V}} p_v (v),&\text{ if } v \text{ is discrete.} \\ \int_{v \in \mathbb{V}} p_v (v) \mathrm{d}v,&\text{ if } v \text{ is continuous.} \\ \end{cases} \end{align} [[/math]]

This view allows us to effectively turn a set of samples into the original distribution from which these samples were drawn (of course, with some variance.) Let [math]S = \left( \tilde{v}_1, \tilde{v}_2, \ldots, \tilde{v}_N \right)[/math] be a multi-set of samples drawn from an unknown distribution over a discrete variable [math]v[/math], without loss of generality. We can then recover the original distribution by constructing a table where each row is

[[math]] \begin{align} \left(v, \left. \sum_{n=1}^N \mathds{1}(\tilde{v}_n = v) \right/ N \right), \end{align} [[/math]]

with [math]v \in \mathbb{V}[/math]. This corresponds to maximum likelihood learning without any regularization. In this table, we can think of all these rows' contents as the parameters of this model we have built to approximate the underlying distribution from which [math]S[/math] was drawn. This view will come handy later when we discuss using a deep neural network instead of an explicit table to represent a probability distribution. A major downside of this explicit table based approach is that [math]q_v(v) = 0[/math] for all [math]v \not \in S[/math]. Such an extreme value (the probability of [math]0[/math]) should not be used when we estimate these probabilities from a small number of data points, since we cannot rule out the fact that we simply did not see that particular instance just because we did not draw enough samples. Furthermore, this prevents us from properly defining a conditional probability [math]p_{v'} (v' | v)[/math], since

[[math]] \begin{align} p_{v'} (v' | v) = \frac{p_{v',v} (v', v)}{p_v(v)}. \end{align} [[/math]]

If [math]v[/math] is set such that [math]p(v) = 0[/math], this conditional probability is not well defined. We thus have to regularize maximum likelihood learning. This probabilistic graphical model is a good way to abstract out some of the details on how to implement individual probability distributions for studying causal inference in machine learning, as it frees us from worrying about the aspect of learning until it is absolutely necessary. Learning in this context refers to inferring the underlying distribution from which data points were drawn, and the table-based approach above is the most naive one that is not really useful in practice. We can however for now assume that this table-based approach works well and continue studying causal inference with already inferred conditional distributions.

Structural Causal Models

Although a directed edge in the probabilistic graphical model looks like it encodes the order in which the variables are generated, it is not necessarily so from the perspective of probability. According to the Bayes' rule,

[[math]] \begin{align} p_v(v | v') = \frac{p_{v'}(v' | v) p_v(v)}{p_{v'}(v')}. \end{align} [[/math]]

This implies that we can always flip the arrow of the edge between [math]v[/math] and [math]v'[/math] without altering the joint as well as conditional probabilities.^{[Notes 1]} This lack of direct relevance of the direction of each edge in [math]G[/math] to the joint probability raises a lot of confusion, when we try to use the probabilistic graphical model in the context of causal inference and reasoning. Instead, we may want to use a slightly different way to express the same generative process underlying a given probability graphical model [math]G=(V,E)[/math]. We do so by writing the process of sampling a value associated with each variable [math]v \in V[/math] rather than its distribution, as the combination of a deterministic function [math]f_v[/math] and external (exogenous) noise [math]\epsilon_v[/math]:

[[math]] \begin{align} v \leftarrow f_{v}(\mathrm{pa}(v), \epsilon_v). \end{align} [[/math]]

This says that the value of [math]v[/math] is computed based on the values of its parent nodes [math]\mathrm{pa}(v)[/math] and external noise [math]\epsilon_v[/math] by the deterministic function [math]f_{v}[/math]. This way is much more explicit about the generating process than before, since the use of the function [math]f[/math] clearly suggests that perturbing the output of the function would not change the input to the function, although perturbing the input to the function would change the output. For instance, you can imagine that [math]f_v[/math] corresponds to pushing a book on your desk using your hand with force of [math]v'[/math] and that [math]v[/math] encodes the new position of the book. [math]\epsilon_v[/math] can be an unknown level of friction cause by your earlier (but forgotten) choice of your desk. Changing force [math]v'[/math] of course affects the new position of the book [math]v[/math] together with some changing level of [math]\epsilon_v[/math], but changing the position of the book would not change the force I have not applied yet to the book. We call this representation of the generative process a structural causal model. Similarly to the probabilistic graphical model above, we can represent any structural causal model as a triplet [math](V, F, U)[/math], where [math]V[/math] is a set of variables, [math]F[/math] is a set of corresponding functions and [math]U[/math] is a set of corresponding noise variables. Any structural causal model can be turned into a probabilistic graphical model by using the change of variables, i.e., [math]f_{v}(\mathrm{pa}(v), \cdot)[/math] and assuming the known prior distribution over the external noise variable [math]\epsilon_v[/math]. Or, more simply, we can do so because we can find a distribution over [math]v \sim f_v (\mathrm{pa}(v), \epsilon_v)[/math]. We can draw samples from any given structural causal model, just like with the probabilistic graphical models above, by ancestral sampling. Once we have samples from the joint distribution, we can infer various conditional distributions. Among these, the posterior distribution over the external noise variables is of a particular interest for the purpose of counterfactual reasoning. Let [math]q(U)[/math] be a distribution that corresponds to all samples of [math]\epsilon_v[/math]'s that led to a particular configuration [math]\hat{V}[/math]. Then the posterior distribution [math]q(v)[/math] can be thought of as the distribution over the external (uncontrollable) factors that led to the particular outcome. We then can answer the question what would have happened to a target variable [math]v[/math] had some of the variables were set differently, by fixing the external factors to follow [math]q(v)[/math] and the rest of the variables to the original values [math]\hat{V}[/math]. This corresponds to counterfactual reasoning (had I done something differently, what would have happened?)

Learning and a generative process

Learning, in machine learning, refers to the process by which we configure a black box predictive model to capture the predictive relationship between a set of variables. In the simplest form, let us consider having two variables; input [math]v[/math] and output [math]v'[/math]. [math]g_\theta[/math] is a predictive function that models the relationship between [math]v[/math] and [math]v'[/math] by mapping an instance of [math]v[/math] to the corresponding [math]v'[/math], i.e., [math]g_{\theta}(v')[/math] is the prediction of [math]v'[/math] given [math]v[/math]. [math]\theta[/math] is a collection of parameters that a learning algorithm configures to make [math]g[/math] as predictive of [math]v[/math] given [math]v'[/math] as possible. Learning starts from data which is a set of examples drawn from an unknown data generating process [math]\mathcal{G}[/math]. This data generating process can be described using either a probabilistic graphical model or equivalently a structural causal model. Let [math]D=\left\{ (v_1, v_1'), \ldots, (v_N, v_N') \right\}[/math] be our training dataset. The goal is then to use this dataset to infer [math]\theta[/math] that would make [math]g[/math] highly predictive of [math]v[/math] given [math]v'[/math]. There are many ways to measure how predictive [math]g[/math] is given a pair [math](v, v')[/math], but we use here a log probability:

[[math]] \begin{align} r(\theta; v,v') = \log p(v | v'; g_\theta(v')). \end{align} [[/math]]

This means that [math]g_{\theta}(v')[/math] parameterized a conditional distribution over [math]v[/math] given [math]v'[/math], or equivalently, [math]g_{\theta}(v')[/math] outputs a conditional distribution over [math]v[/math]. If this quantity is large, it means that [math]v[/math] is highly probable under the predictive distribution by [math]g[/math] given [math]v'[/math]. Learning is then to solve the following optimization problem:

[[math]] \begin{align} \arg\max_{\theta} \frac{1}{N} \sum_{n=1}^N r(\theta; v_n, v_n'). \end{align} [[/math]]

Once learning is over, we can readily compute

[[math]] \begin{align} p (v|v') \approx p(v|v'; g_{\theta}(v')) = p(v|v'; \theta). \end{align} [[/math]]

If we assume that [math]p(v | v'; \theta)[/math] is a great approximation of [math]p(v|v')[/math], we can use the former in place of the latter without too much worry. In other words, learning corresponds to figuring out a conditional distribution [math]p(v|v')[/math] from data. This data was produced from the underlying data generating process [math]\mathcal{G}[/math] which may have more variables in addition to [math]v[/math] and [math]v'[/math]. From this description of learning, it is immediately clear how this can be a replacement of the table-based approach from earlier, and that the table-based approach earlier was a special case of this learning-based approach, where [math]\theta[/math] corresponded to all the entries within a table. Once we take this learning-based approach, we can free ourselves from having to explicitly construct a large table and can also use various regularization techniques to avoid the issue of [math]0[/math] probability as well as benefit from generalization. Let [math]\mathcal{G}=(V, E, P)[/math] with [math]v, v' \in V[/math] and [math]\overline{V} = V \backslash \left\{ v, v' \right\}[/math]. Then,

[[math]] \begin{align} p(v | v') = \frac{\sum_{\overline{V}} p(\{v,v'\} \cup \overline{V})}{\sum_{\{v\} \cup \overline{V}} p(\{v'\} \cup V')}. \end{align} [[/math]]

That is, we marginalize out all variables in [math]\overline{V}[/math] and then divide it by the marginal probability of [math]v'[/math] to get the conditional probability of [math]v[/math] given [math]v'[/math]. On one hand, this tells us that learning allows us to recover any arbitrary conditional distribution induced by an underlying data generating process as long as we have data produced following the same data generating process. On the other hand, this also tells us that there can be many different data generating processes that result in the exactly same conditional distribution given a pair of variables. In fact, as long as we do not use all variables from the generating process, there will always be ambiguities that cannot be resolved based on the learning procedure laid out above. As an example, consider the following two structural causal models:

Causal Model 1.

[[math]] \begin{align} &v' \leftarrow \epsilon_{v'} \\ &v^l \leftarrow v' + a + \epsilon_{v^l} \\ &v^r \leftarrow v' + b + \epsilon_{v^r} \\ &v \leftarrow v^l + v^r + \epsilon_{v}, \end{align} [[/math]]

where [math]\epsilon_{v^l}[/math] and [math]\epsilon_{v^r}[/math] both follow standard Normal distributions ([math]\mathcal{N}(0, 1^2)[/math]. [math]\epsilon_{v}[/math] also follows standard Normal distribution.

Causal Model 2.

[[math]] \begin{align} &v' \leftarrow \epsilon_{v'} \\ &v^c \leftarrow v' + a + b + \epsilon_{v^c} \\ &v \leftarrow v^c + \epsilon_{v}, \end{align} [[/math]]

where [math]\epsilon_{v^c} \sim \mathcal{N}(0, 2)[/math]. Then, [math]p(v|v') = \mathcal{N}(v; v'+a+b, 3)[/math] for both causal models, although these two models are very different. This ambiguity plays an important role in both causal inference and so-called out-of-distribution generalization. We will study both more carefully later in the course.

Latent variable models are not necessarily causal models

When we build a predictive model on a pair [math](v, v')[/math], there are variables that are left out from the original data generating process. Those unused variables may be the ones for which we simply threw away observations, because they were not in our interest, or the ones we actually do not observe. These unobserved variables contribute to the complexity of [math]p(v|v')[/math] via the process of marginalization. It is not trivial to build a predictive model, or equivalently in our context a deep neural network, that outputs a highly complex predictive distribution, as these complex distributions often do not have easily-parametrizable analytical forms. In these cases, there are two commonly used approaches; (1) autoregressive models and (2) latent variable models. The former relies on the chain rule of probabilities and decomposes the conditional probability as a product of coordinate-wise conditional probabilities:

[[math]] \begin{align} p(v|v'; \theta) = \prod_{i=1}^d p(v_i | v'; \theta), \end{align} [[/math]]

where [math]v = [v_1, \ldots, v_d][/math]. By assuming that each coordinate-wise conditional probability is of a simpler form, we can still use a simple deep neural network to approximate a highly complex joint conditional probability. Because each conditional probability on the right hand side is parametrized with the same set of parameters [math]\theta[/math], it is possible to build such an autoregressive model for a variable-sized observation, that is, [math]\mathrm{dim}(v)[/math] is not fixed a priori. This approach is behind the recently successful and popular large-scale language models~^[1]. Unlike autoregressive modeling, latent variable models explicitly introduce an unobserved variable [math]u[/math] that represents the missing portion of the underlying data generating process. Just like the missing (unobserved) portion gave rise to the highly complex predictive distribution via the process of marginalization, the introduced (anonymous) latent variable [math]u[/math] does the same:

[[math]] \begin{align} p(v|v'; \theta) = \sum_{u} p_u(u) p(v|v', u; \theta). \end{align} [[/math]]

If [math]u[/math] is continuous, we replace [math]\sum[/math] with [math]\int[/math]. Because this marginalization is difficult almost always, it is natural to resort to sampling-based approximation. Because we are often interested in gradient-based learning, we consider sampling-based gradient approximation:

[[math]] \begin{align} \nabla \log p(v|v'; \theta) &= \sum_u \frac{p_u (u) p(v|v', u; \theta) \nabla \log p(v|v', u; \theta) } {\sum_{u'} p_u(u') p(v|v', u'; \theta)} \\ &= \sum_u p(u | v, v'; \theta) \nabla \log p(v|v', u; \theta) \\ &\approx \frac{1}{M} \sum_{m=1}^M \nabla \log p(v|v', u^m; \theta), \end{align} [[/math]]

where [math]u^m[/math] is the [math]m[/math]-th posterior sample. It is however as challenging to compute the posterior distribution [math]p(u | v, v'; \theta)[/math] nor to sample from it. It is thus more common these days to maximize the lower bound to [math]p(v|v'; \theta)[/math] while amortizing approximate posterior inference into a separate neural network. This is called a variational autoencoder~^[2]. Despite the seemingly similarity, these latent variables are not closely related to actual variables in the data generating process. They may or may not. If they indeed correspond to actual variables in the data generating process that we simply did not observe nor decided not to use data from, we may be able to derive conditions under which we can identify these unobserved variables and their associated distributions from partial data alone. It is however more common and realistic to think of these latent variables as a means to improving the expressive power of a deep neural network. In other words, latent variables are useful even if there are truly two variables, [math]v[/math] and [math]v'[/math], in the data generating process, since true [math]p_v(v|v')[/math] may be complicated on its own. Nevertheless, it is a useful tool to model any complex distribution, and thus we have spent a little bit of time discussing them.

Summary

In this chapter, we have established the very foundations on which we can discuss causal inference and its relationships to machine learning in the rest of the course:

A brief discussion on the necessity of defining a universe over which causal inference is performed;
Two (equivalent) ways to define a universe, probabilistic graphical models and structural causal models;
What learning is, once the universe is defined.

Based on these, we begin our journey into causal inference in machine learning.

General references

Cho, Kyunghyun (2024). "A Brief Introduction to Causal Inference in Machine Learning". arXiv:2405.08793 [cs.LG].

Notes

We will assume from here on that [math]p(x) \gt 0[/math] for any [math]x[/math] and [math]p[/math].

References

"Language models are few-shot learners" (2020). Advances in neural information processing systems 33.
"Auto-encoding variational bayes" (2013). arXiv preprint arXiv:1312.6114.

[1] We will assume from here on that [math]p(x) \gt 0[/math] for any [math]x[/math] and [math]p[/math].

[brown2020language-2] "Language models are few-shot learners" (2020). Advances in neural information processing systems 33.

[kingma2013auto-3] "Auto-encoding variational bayes" (2013). arXiv preprint arXiv:1312.6114.

[Notes 1]

[1]

[2]