Regression: Causal Inference can be Trivial

[math] \newcommand{\indep}[0]{\ensuremath{\perp\!\!\!\perp}} \newcommand{\dpartial}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\abs}[1]{\left| #1 \right|} \newcommand\autoop{\left(} \newcommand\autocp{\right)} \newcommand\autoob{\left[} \newcommand\autocb{\right]} \newcommand{\vecbr}[1]{\langle #1 \rangle} \newcommand{\ui}{\hat{\imath}} \newcommand{\uj}{\hat{\jmath}} \newcommand{\uk}{\hat{k}} \newcommand{\V}{\vec{V}} \newcommand{\half}[1]{\frac{#1}{2}} \newcommand{\recip}[1]{\frac{1}{#1}} \newcommand{\invsqrt}[1]{\recip{\sqrt{#1}}} \newcommand{\halfpi}{\half{\pi}} \newcommand{\windbar}[2]{\Big|_{#1}^{#2}} \newcommand{\rightinfwindbar}[0]{\Big|_{0}^\infty} \newcommand{\leftinfwindbar}[0]{\Big|_{-\infty}^0} \newcommand{\state}[1]{\large\protect\textcircled{\textbf{\small#1}}} \newcommand{\shrule}{\\ \centerline{\rule{13cm}{0.4pt}}} \newcommand{\tbra}[1]{$\bra{#1}$} \newcommand{\tket}[1]{$\ket{#1}$} \newcommand{\tbraket}[2]{$\braket{1}{2}$} \newcommand{\infint}[0]{\int_{-\infty}^{\infty}} \newcommand{\rightinfint}[0]{\int_0^\infty} \newcommand{\leftinfint}[0]{\int_{-\infty}^0} \newcommand{\wavefuncint}[1]{\infint|#1|^2} \newcommand{\ham}[0]{\hat{H}} \newcommand{\mathds}{\mathbb}[/math]

Assume for now that we are given a set of data points drawn from this graph [math]G[/math]:

[[math]] \begin{align} D = \left\{ (a_1, y_1, x_1), \ldots, (a_N, y_N, x_N) \right\}. \end{align} [[/math]]

For every instance, we observe all of the action [math]a[/math], outcome [math]y[/math] and covariate [math]x[/math]. Furthermore, we assume all these data points were drawn from the same fixed distribution

[[math]] \begin{align} p^*(a, y, x) = p^*(x) p^*(a|x) p^*(y|a,x) \end{align} [[/math]]

and that [math]N[/math] is large. In this case, we can use a non-parametric estimator, such as tables, deep neural networks and gradient boosted trees, to reverse-engineer each individual conditional distribution from this large dataset [math]D[/math]. This is just like what we have discussed earlier in \SLearning and a generative process. Among three conditional distributions above, we are only interested in learning [math]p^*(x)[/math] and [math]p^*(y|a,x)[/math] from data, resulting [math]p(x;\theta)[/math] and [math]p(y|a,x;\theta)[/math], where [math]\theta[/math] refers to the parameters of each deep neural network.[Notes 1] Once learning is over, we can use it to approximate ATE as

[[math]] \begin{align} \mathrm{ATE} &\approx \sum_{y} y \mathbb{E}_{x \sim p(x;\theta)} \left[ p(y|a=1, x; \theta) \right] - \sum_{y} y \mathbb{E}_{x \sim p(x;\theta)} \left[ p(y|a=0, x; \theta) \right] \\ &= \sum_{y} y \mathbb{E}_{x \sim p(x;\theta)} \left[ p(y|a=1, x; \theta) - p(y|a=0, x; \theta) \right]. \end{align} [[/math]]


There are two conditions that make this regression-based approach to causal inference work:

  • No unobserved confounder: we observe the covariate [math]x[/math] in [math]G[/math];
  • Large [math]N[/math]: we have enough examples to infer [math]p^*(y|a,x)[/math] with a low variance.

If there is any dimension of [math]x[/math] that is not observed in the dataset, it is impossible for any learner to infer neither [math]p^*(y|a,x)[/math] nor [math]p^*(x)[/math] correctly. “Correctly” here in particular refers to identifying the true [math]p^*(y|a,x)[/math]. This is not really important if the goal is to approximate the conditional probability [math]p^*(y|a)[/math], since we can simply drop [math]x[/math] and use [math](a,y)[/math] pairs. It is however critical if the goal is to approximate the interventional probaiblity [math]p^*(y|\mathrm{do}(a))[/math] because this necessitates us to access [math]p^*(y|a,x)[/math] (approximately). Large [math]N[/math] is necessary for two reasons. First, the problem may be ill-posed when [math]N[/math] is small. Consider rewriting [math]p(y|a,x)[/math] as

[[math]] \begin{align} p(y|a,x) = \frac{p(y,a,x)}{p(a|x)p(x)}. \end{align} [[/math]]

This quantity has in the denominator both [math]p(a|x)[/math] and [math]p(x)[/math]. If [math]N[/math] is small to the point that we do not observe all possible combination of [math](a,x)[/math] for which [math]p(x) \gt 0[/math], this conditional probability is not well-defined. This connects to one of the major assumptions in causal inference, called positivity, which we will discuss further later in the semester. The second, perhaps less important, reason is that the variance of the estimator is often inversely proportional to [math]N[/math]. That is, with more [math]N[/math], we can approximate [math]p^*(y|a,x)[/math] with less variance. The variance of this estimate is critical, as it directly leads to that of ATE. If the variance of ATE is high, we cannot draw a confident conclusion whether the treatment is effective. This section tells us that causal inference can be done trivially by statistical regression when the following conditions are satisfied:

  • There are no unobserved confounder: We observe every variable.
  • Positivity: All possible combinations of [math](a,y,x)[/math] are observed in data.
  • We have enough data.

Unfortunately, it is rare that all these conditions are satisfied in real life.

General references

Cho, Kyunghyun (2024). "A Brief Introduction to Causal Inference in Machine Learning". arXiv:2405.08793 [cs.LG].

Notes

  1. Although there is no reason to prefer deep neural networks over random forests or other non-parametric learners, we will largely stick to deep neural networks, as I like them more.