guide:Ffdca025ef: Difference between revisions

From Stochiki
No edit summary
 
mNo edit summary
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
<div class="d-none"><math>
\newcommand{\indep}[0]{\ensuremath{\perp\!\!\!\perp}}
\newcommand{\dpartial}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\abs}[1]{\left| #1 \right|}
\newcommand\autoop{\left(}
\newcommand\autocp{\right)}
\newcommand\autoob{\left[}
\newcommand\autocb{\right]}
\newcommand{\vecbr}[1]{\langle #1 \rangle}
\newcommand{\ui}{\hat{\imath}}
\newcommand{\uj}{\hat{\jmath}}
\newcommand{\uk}{\hat{k}}
\newcommand{\V}{\vec{V}}
\newcommand{\half}[1]{\frac{#1}{2}}
\newcommand{\recip}[1]{\frac{1}{#1}}
\newcommand{\invsqrt}[1]{\recip{\sqrt{#1}}}
\newcommand{\halfpi}{\half{\pi}}
\newcommand{\windbar}[2]{\Big|_{#1}^{#2}}
\newcommand{\rightinfwindbar}[0]{\Big|_{0}^\infty}
\newcommand{\leftinfwindbar}[0]{\Big|_{-\infty}^0}
\newcommand{\state}[1]{\large\protect\textcircled{\textbf{\small#1}}}
\newcommand{\shrule}{\\ \centerline{\rule{13cm}{0.4pt}}}
\newcommand{\tbra}[1]{$\bra{#1}$}
\newcommand{\tket}[1]{$\ket{#1}$}
\newcommand{\tbraket}[2]{$\braket{1}{2}$}
\newcommand{\infint}[0]{\int_{-\infty}^{\infty}}
\newcommand{\rightinfint}[0]{\int_0^\infty}
\newcommand{\leftinfint}[0]{\int_{-\infty}^0}
\newcommand{\wavefuncint}[1]{\infint|#1|^2}
\newcommand{\ham}[0]{\hat{H}}
\newcommand{\mathds}{\mathbb}</math></div>


Assume for now that we are given a set of data points drawn from this graph <math>G</math>:
<math display="block">
\begin{align}
  D = \left\{ (a_1, y_1, x_1), \ldots, (a_N, y_N, x_N) \right\}.
\end{align}
</math>
For every instance, we observe all of the action <math>a</math>, outcome <math>y</math> and covariate <math>x</math>. Furthermore, we assume all these data points were drawn from the same fixed distribution
<math display="block">
\begin{align}
  p^*(a, y, x) = p^*(x) p^*(a|x) p^*(y|a,x)
\end{align}
</math>
and that <math>N</math> is large.
In this case, we can use a non-parametric estimator, such as tables, deep neural networks and gradient boosted trees, to reverse-engineer each individual conditional distribution from this large dataset <math>D</math>. This is just like what we have discussed earlier in \S[[guide:598a4e5342#sec:learning |Learning and a generative process]]. Among three conditional distributions above, we are only interested in learning <math>p^*(x)</math> and <math>p^*(y|a,x)</math> from data, resulting <math>p(x;\theta)</math> and <math>p(y|a,x;\theta)</math>, where <math>\theta</math> refers to the parameters of each deep neural network.<ref group="Notes" >
  Although there is no reason to prefer deep neural networks over random forests or other non-parametric learners, we will largely stick to deep neural networks, as I like them more.
</ref>
Once learning is over, we can use it to approximate ATE as
<math display="block">
\begin{align}
  \mathrm{ATE}
  &\approx
  \sum_{y}
  y
  \mathbb{E}_{x \sim p(x;\theta)} \left[
  p(y|a=1, x; \theta)
  \right]
  -
  \sum_{y}
  y
  \mathbb{E}_{x \sim p(x;\theta)} \left[
  p(y|a=0, x; \theta)
  \right]
  \\
  &=
  \sum_{y}
  y
  \mathbb{E}_{x \sim p(x;\theta)} \left[
  p(y|a=1, x; \theta)
  -
  p(y|a=0, x; \theta)
  \right].
\end{align}
</math>
There are two conditions that make this regression-based approach to causal inference work:
<ul><li> '' No unobserved confounder'': we observe the covariate <math>x</math> in <math>G</math>;
  </li>
<li> Large <math>N</math>: we have enough examples to infer <math>p^*(y|a,x)</math> with a low variance.
</li>
</ul>
If there is any dimension of <math>x</math> that is not observed in the dataset, it is impossible for any learner to infer neither <math>p^*(y|a,x)</math> nor <math>p^*(x)</math> correctly. “Correctly” here in particular refers to identifying the true <math>p^*(y|a,x)</math>. This is not really important if the goal is to approximate the conditional probability <math>p^*(y|a)</math>, since we can simply drop <math>x</math> and use <math>(a,y)</math> pairs. It is however critical if the goal is to approximate the interventional probaiblity <math>p^*(y|\mathrm{do}(a))</math> because this necessitates us to access <math>p^*(y|a,x)</math> (approximately).
Large <math>N</math> is necessary for two reasons. First, the problem may be ill-posed when <math>N</math> is small. Consider rewriting <math>p(y|a,x)</math> as
<math display="block">
\begin{align}
  p(y|a,x) = \frac{p(y,a,x)}{p(a|x)p(x)}.
\end{align}
</math>
This quantity has in the denominator both <math>p(a|x)</math> and <math>p(x)</math>. If <math>N</math> is small to the point that we do not observe all possible combination of <math>(a,x)</math> for which <math>p(x)  >  0</math>, this conditional probability is not well-defined. This connects to one of the major assumptions in causal inference, called '' positivity'', which we will discuss further later in the semester.
The second, perhaps less important, reason is that the variance of the estimator is often inversely proportional to <math>N</math>. That is, with more <math>N</math>, we can approximate <math>p^*(y|a,x)</math> with less variance. The variance of this estimate is critical, as it directly leads to that of ATE. If the variance of ATE is high, we cannot draw a confident conclusion whether the treatment is effective.
This section tells us that causal inference can be done trivially by statistical regression when the following conditions are satisfied:
<ul><li> There are ''' no unobserved confounder''': We observe every variable.
  </li>
<li> ''' Positivity''': All possible combinations of <math>(a,y,x)</math> are observed in data.
  </li>
<li> We have enough data.
</li>
</ul>
Unfortunately, it is rare that all these conditions are satisfied in real life.
==General references==
{{cite arXiv|last1=Cho|first1=Kyunghyun|year=2024|title=A Brief Introduction to  Causal Inference in Machine Learning|eprint=2405.08793|class=cs.LG}}
==Notes==
{{Reflist|group=Notes}}

Latest revision as of 00:18, 19 May 2024

[math] \newcommand{\indep}[0]{\ensuremath{\perp\!\!\!\perp}} \newcommand{\dpartial}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\abs}[1]{\left| #1 \right|} \newcommand\autoop{\left(} \newcommand\autocp{\right)} \newcommand\autoob{\left[} \newcommand\autocb{\right]} \newcommand{\vecbr}[1]{\langle #1 \rangle} \newcommand{\ui}{\hat{\imath}} \newcommand{\uj}{\hat{\jmath}} \newcommand{\uk}{\hat{k}} \newcommand{\V}{\vec{V}} \newcommand{\half}[1]{\frac{#1}{2}} \newcommand{\recip}[1]{\frac{1}{#1}} \newcommand{\invsqrt}[1]{\recip{\sqrt{#1}}} \newcommand{\halfpi}{\half{\pi}} \newcommand{\windbar}[2]{\Big|_{#1}^{#2}} \newcommand{\rightinfwindbar}[0]{\Big|_{0}^\infty} \newcommand{\leftinfwindbar}[0]{\Big|_{-\infty}^0} \newcommand{\state}[1]{\large\protect\textcircled{\textbf{\small#1}}} \newcommand{\shrule}{\\ \centerline{\rule{13cm}{0.4pt}}} \newcommand{\tbra}[1]{$\bra{#1}$} \newcommand{\tket}[1]{$\ket{#1}$} \newcommand{\tbraket}[2]{$\braket{1}{2}$} \newcommand{\infint}[0]{\int_{-\infty}^{\infty}} \newcommand{\rightinfint}[0]{\int_0^\infty} \newcommand{\leftinfint}[0]{\int_{-\infty}^0} \newcommand{\wavefuncint}[1]{\infint|#1|^2} \newcommand{\ham}[0]{\hat{H}} \newcommand{\mathds}{\mathbb}[/math]

Assume for now that we are given a set of data points drawn from this graph [math]G[/math]:

[[math]] \begin{align} D = \left\{ (a_1, y_1, x_1), \ldots, (a_N, y_N, x_N) \right\}. \end{align} [[/math]]

For every instance, we observe all of the action [math]a[/math], outcome [math]y[/math] and covariate [math]x[/math]. Furthermore, we assume all these data points were drawn from the same fixed distribution

[[math]] \begin{align} p^*(a, y, x) = p^*(x) p^*(a|x) p^*(y|a,x) \end{align} [[/math]]

and that [math]N[/math] is large. In this case, we can use a non-parametric estimator, such as tables, deep neural networks and gradient boosted trees, to reverse-engineer each individual conditional distribution from this large dataset [math]D[/math]. This is just like what we have discussed earlier in \SLearning and a generative process. Among three conditional distributions above, we are only interested in learning [math]p^*(x)[/math] and [math]p^*(y|a,x)[/math] from data, resulting [math]p(x;\theta)[/math] and [math]p(y|a,x;\theta)[/math], where [math]\theta[/math] refers to the parameters of each deep neural network.[Notes 1] Once learning is over, we can use it to approximate ATE as

[[math]] \begin{align} \mathrm{ATE} &\approx \sum_{y} y \mathbb{E}_{x \sim p(x;\theta)} \left[ p(y|a=1, x; \theta) \right] - \sum_{y} y \mathbb{E}_{x \sim p(x;\theta)} \left[ p(y|a=0, x; \theta) \right] \\ &= \sum_{y} y \mathbb{E}_{x \sim p(x;\theta)} \left[ p(y|a=1, x; \theta) - p(y|a=0, x; \theta) \right]. \end{align} [[/math]]


There are two conditions that make this regression-based approach to causal inference work:

  • No unobserved confounder: we observe the covariate [math]x[/math] in [math]G[/math];
  • Large [math]N[/math]: we have enough examples to infer [math]p^*(y|a,x)[/math] with a low variance.

If there is any dimension of [math]x[/math] that is not observed in the dataset, it is impossible for any learner to infer neither [math]p^*(y|a,x)[/math] nor [math]p^*(x)[/math] correctly. “Correctly” here in particular refers to identifying the true [math]p^*(y|a,x)[/math]. This is not really important if the goal is to approximate the conditional probability [math]p^*(y|a)[/math], since we can simply drop [math]x[/math] and use [math](a,y)[/math] pairs. It is however critical if the goal is to approximate the interventional probaiblity [math]p^*(y|\mathrm{do}(a))[/math] because this necessitates us to access [math]p^*(y|a,x)[/math] (approximately). Large [math]N[/math] is necessary for two reasons. First, the problem may be ill-posed when [math]N[/math] is small. Consider rewriting [math]p(y|a,x)[/math] as

[[math]] \begin{align} p(y|a,x) = \frac{p(y,a,x)}{p(a|x)p(x)}. \end{align} [[/math]]

This quantity has in the denominator both [math]p(a|x)[/math] and [math]p(x)[/math]. If [math]N[/math] is small to the point that we do not observe all possible combination of [math](a,x)[/math] for which [math]p(x) \gt 0[/math], this conditional probability is not well-defined. This connects to one of the major assumptions in causal inference, called positivity, which we will discuss further later in the semester. The second, perhaps less important, reason is that the variance of the estimator is often inversely proportional to [math]N[/math]. That is, with more [math]N[/math], we can approximate [math]p^*(y|a,x)[/math] with less variance. The variance of this estimate is critical, as it directly leads to that of ATE. If the variance of ATE is high, we cannot draw a confident conclusion whether the treatment is effective. This section tells us that causal inference can be done trivially by statistical regression when the following conditions are satisfied:

  • There are no unobserved confounder: We observe every variable.
  • Positivity: All possible combinations of [math](a,y,x)[/math] are observed in data.
  • We have enough data.

Unfortunately, it is rare that all these conditions are satisfied in real life.

General references

Cho, Kyunghyun (2024). "A Brief Introduction to Causal Inference in Machine Learning". arXiv:2405.08793 [cs.LG].

Notes

  1. Although there is no reason to prefer deep neural networks over random forests or other non-parametric learners, we will largely stick to deep neural networks, as I like them more.