guide:2b78d3ea91: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
<div class="d-none"><math> | |||
\newcommand{\indep}[0]{\ensuremath{\perp\!\!\!\perp}} | |||
\newcommand{\dpartial}[2]{\frac{\partial #1}{\partial #2}} | |||
\newcommand{\abs}[1]{\left| #1 \right|} | |||
\newcommand\autoop{\left(} | |||
\newcommand\autocp{\right)} | |||
\newcommand\autoob{\left[} | |||
\newcommand\autocb{\right]} | |||
\newcommand{\vecbr}[1]{\langle #1 \rangle} | |||
\newcommand{\ui}{\hat{\imath}} | |||
\newcommand{\uj}{\hat{\jmath}} | |||
\newcommand{\uk}{\hat{k}} | |||
\newcommand{\V}{\vec{V}} | |||
\newcommand{\half}[1]{\frac{#1}{2}} | |||
\newcommand{\recip}[1]{\frac{1}{#1}} | |||
\newcommand{\invsqrt}[1]{\recip{\sqrt{#1}}} | |||
\newcommand{\halfpi}{\half{\pi}} | |||
\newcommand{\windbar}[2]{\Big|_{#1}^{#2}} | |||
\newcommand{\rightinfwindbar}[0]{\Big|_{0}^\infty} | |||
\newcommand{\leftinfwindbar}[0]{\Big|_{-\infty}^0} | |||
\newcommand{\state}[1]{\large\protect\textcircled{\textbf{\small#1}}} | |||
\newcommand{\shrule}{\\ \centerline{\rule{13cm}{0.4pt}}} | |||
\newcommand{\tbra}[1]{$\bra{#1}$} | |||
\newcommand{\tket}[1]{$\ket{#1}$} | |||
\newcommand{\tbraket}[2]{$\braket{1}{2}$} | |||
\newcommand{\infint}[0]{\int_{-\infty}^{\infty}} | |||
\newcommand{\rightinfint}[0]{\int_0^\infty} | |||
\newcommand{\leftinfint}[0]{\int_{-\infty}^0} | |||
\newcommand{\wavefuncint}[1]{\infint|#1|^2} | |||
\newcommand{\ham}[0]{\hat{H}} | |||
\newcommand{\mathds}{\mathbb}</math></div> | |||
Beside curiosity, the goal of causal inference is to use the inferred causal relationship for better outcomes in the future. Once we estimate <math>\mathbb{E}_G\left[ y | \mathrm{do}(a)\right]</math> using RCT, we would simply choose the following action for all future subject: | |||
<math display="block"> | |||
\begin{align} | |||
\hat{a} = \arg\max_{a \in \mathcal{A}} \mathbb{E}_G\left[ y | \mathrm{do}(a)\right], | |||
\end{align} | |||
</math> | |||
where <math>\mathcal{A}</math> is a set of all possible actions. This approach has however one downside that we had to give an incorrect treatment (e.g. placebo) to many trial participants who lost their opportunities to have a better outcome (e.g. protection against the infectious disease.) | |||
Consider an RCT where subjects arrive and are tested serially, that is, one at a time. | |||
If <math>t</math> subjects have participated in the RCT so far, we have | |||
<math display="block"> | |||
\begin{align} | |||
D = \left\{ (a_1, y_1), \ldots, (a_t, y_t) \right\}. | |||
\end{align} | |||
</math> | |||
Based on <math>D</math>, we can estimate the outcome of each action by | |||
<math display="block"> | |||
\begin{align} | |||
\hat{y}_t(a) | |||
= | |||
\frac{\sum_{t'=1}^t | |||
\mathds{1}(a_{t'} = a) | |||
y_{t'}} | |||
{\sum_{t''=1}^N \mathds{1}(a_{t''} = a)}. | |||
\end{align} | |||
</math> | |||
This estimate would be unbiased (correct on average), if every <math>a_{t'}</math> was drawn from an action distribution that is independent of the covariate <math>x</math> from an identical distribution <math>q(a)</math>.<ref group="Notes" > | |||
This is another constraint on RCT, that every subject must be assigned according to the same assignment policy <math>q(a)</math>. | |||
</ref> | |||
More generally, the bias (the degree of incorrectness) would be proportional to | |||
<math display="block"> | |||
\begin{align} | |||
\epsilon_{\leq t-1} = \frac{1}{t} | |||
\sum_{t'=1}^t \mathds{1}(a_{t'} \text{ was drawn independently of } x_{t'}). | |||
\end{align} | |||
</math> | |||
If <math>\epsilon_{\leq t-1} = 1</math>, the estimate is unbiased, corresponding to causal inference. | |||
If <math>\epsilon_{\leq t-1} = 0</math>, what we have is not interventional but conditional. | |||
Assuming <math>\epsilon_t \gg 0</math> and <math>t \gg 1</math>, we have a reasonable causal estimate of the outcome <math>y</math> given each action <math>a</math>. Then, in order to maximize the outcome of the next subject (<math>x_{t+1}</math>), we want to assign them to an action sampled from the following Boltzmann distribution: | |||
<math display="block"> | |||
\begin{align} | |||
q_t(a) = | |||
\frac{\exp\left( \frac{1}{\beta_t} \hat{y}_t(a) \right)} | |||
{\sum_{a' \in \mathcal{A}} \exp\left( \frac{1}{\beta_t} \hat{y}_t(a') \right)}, | |||
\end{align} | |||
</math> | |||
where <math>\beta_t \in [0, \infty)</math> is a temperature parameter. | |||
When <math>\beta_t \to \infty</math> (a high temperature), this is equivalent to sampling the action <math>a_{t+1}</math> from a uniform distribution, which | |||
implies that we do not trust the causal estimates of the outcomes, perhaps due to small <math>t</math>. | |||
On the other hand, when <math>\beta_t \to 0</math> (a low temperature), the best-outcome action would be selected, as | |||
<math display="block"> | |||
\begin{align} | |||
q_t(a) | |||
=_{\beta \to \infty} | |||
\begin{cases} | |||
1, &\text{ if } \hat{y}(a) = \max_{a'} \hat{y}_t(a') \\ | |||
0, &\text{ otherwise} | |||
\end{cases} | |||
\end{align} | |||
</math> | |||
assuming there is a unique action that leads to the best outcome. In this case, we are fully trusting our causal estimates of the outcomes and simply choose the best action accordingly, which corresponds to '' outcome maximization''. | |||
We now combine these two in order to make a trade off between causal inference and outcome maximization. At time <math>t</math>, we sample the action <math>a_t</math> for a new participant from | |||
<math display="block"> | |||
\begin{align} | |||
\label{eq:bandit_policy} | |||
q_t(a) | |||
= | |||
\epsilon_t \frac{1}{|\mathcal{A}|} | |||
+ | |||
(1-\epsilon_t) | |||
\frac{\exp\left( \frac{1}{\beta_t} \hat{y}_t(a) \right)} | |||
{\sum_{a' \in \mathcal{A}} \exp\left( \frac{1}{\beta_t} \hat{y}_t(a') \right)}, | |||
\end{align} | |||
</math> | |||
where <math>\epsilon_t \in [0, 1]</math> and <math>|\mathcal{A}|</math> is the number of all possible actions. | |||
We can sample from this mixture distribution by | |||
<ul><li> Sample <math>e_t \in \left\{0,1\right\}</math> from a Bernoulli distribution of mean <math>\epsilon_t</math>. | |||
</li> | |||
<li> Check <math>e_t</math> | |||
<ul style{{=}}"list-style-type:lower-roman"><li>If <math>e_t=1</math>, we uniformly choose <math>a_t</math> at random.</li> | |||
<li>If <math>e_t=0</math>, we sample <math>a_t</math> proportionally to <math>\hat{y}(a)</math>.</li> | |||
</ul> | |||
</li> | |||
</ul> | |||
As we continue the RCT according to this assignment policy, we assign participants increasingly more to actions with better outcomes, because our causal estimate gets better over time. We however ensure that participants are randomly assigned to actions at a reasonable rate of <math>\epsilon_t</math>, in order to estimate the causal quantity rather than the statistical quantity. It is common to start with a larger <math>\epsilon_t \approx 1</math> and gradually anneal it toward <math>0</math>, as we want to ensure we quickly estimate the correct causal effect early on. It is also usual to start with a large <math>\beta_t \geq 1</math> and anneal it toward <math>0</math>, as the early estimate of the causal effect is often not trustworthy. | |||
When the first component (the uniform distribution) is selected, we say that we are exploring, and otherwise, we say we are exploiting. <math>e_t</math> is a hyperparameter that allows us to compromise between exploration and exploitation, while <math>\beta_t</math> is how we express our belief in the current estimate of the causal effect. | |||
This whole procedure is a variant of EXP-3, which stands for the exponential-weight algorithm for exploration and exploitation~<ref name="allesiardo2017non">{{cite journal||last1=Allesiardo|first1=Robin|last2=F{\'e}raud|first2=Rapha{\"e}l|last3=Maillard|first3=Odalric-Ambrym|journal=International Journal of Data Science and Analytics|year=2017|title=The non-stationary stochastic multi-armed bandit problem|volume=3|publisher=Springer}}</ref>, that is used to solve the '' multi-armed bandit problem''. However, with an appropriate choice of <math>\epsilon_t</math> and <math>\beta_t</math>, we obtain as a special case RCT that can estimate the causal effect of the action on the outcome. For instance, we can use the following schedules of these two hyperparameters: | |||
<math display="block"> | |||
\begin{align} | |||
\epsilon_t = \beta_t = | |||
\begin{cases} | |||
1,& \text{ if } t < T \\ | |||
0,& \text{ if } t \geq T | |||
\end{cases} | |||
\end{align} | |||
</math> | |||
with a larger <math>T \gg 1</math>. | |||
We can however choose smoother schedulers for <math>\epsilon_t</math> and <math>\beta_t</math> in order to make a better compromise between causal inference and outcome maximization, in order to avoid assigning too many subjects to a placebo (that is, ineffective) group. | |||
The choice of <math>\epsilon_t</math> and <math>\beta_t</math> also affects the bias-variance trade-off. Although this is out of the scope of this course, it is easy to guess that higher <math>\epsilon_t</math> and higher <math>\beta_t</math> lead to a higher variance but a lower bias, and vice versa. | |||
'''A never-ending trial.''' | |||
A major assumption that must be satisfied for RCT is the stationarity. Both the causal distribution <math>p^*(y|a,x)</math> and the covariate distribution <math>p^*(x)</math> must be stationary in that they do not change throughout the trial as well as after the trial. Especially when these distributions drift after the trial, that is, after running the trial with <math>T</math> participants, our causal estimate as well as the decision based on it will become less accurate. We see such instances often in the real world. For instance, as viruses mutate, the effectiveness of vaccination wanes over time, although the very same vaccine was found to be effective by an earlier RCT. | |||
When the underlying conditional distributions are all stationary, we do not need to keep the entire set of collected data points in order to compute the approximate causal effect, because | |||
<math display="block"> | |||
\begin{align} | |||
\hat{y}_t(a) | |||
= | |||
\frac{\sum_{t'=1}^t | |||
\mathds{1}(a_{t'} = a) | |||
y_{t'}} | |||
{\sum_{t''=1}^N \mathds{1}(a_{t''} = a)} | |||
= | |||
\frac{\sum_{t'=1}^{t-1} | |||
\mathds{1}(a_{t'} = a)}{\sum_{t'=1}^t | |||
\mathds{1}(a_{t'} = a)} \hat{y}_{t-1}(a) | |||
+ | |||
\frac{ | |||
\mathds{1}(a_{t} = a)}{\sum_{t'=1}^t | |||
\mathds{1}(a_{t'} = a)} | |||
y_{t}. | |||
\end{align} | |||
</math> | |||
In other words, we can just keep a single scalar <math>\hat{y}_t(a)</math> for each action to maintain the causal effect over time. | |||
We can tweak this recursive formula to cope with slow-drifting underlying distributions by emphasizing recent data points much more so than older data points. This can be implemented with exponential moving average,<ref group="Notes" > | |||
‘exponential-weight’ in EXP-3 comes from this choice. | |||
</ref> | |||
as follows: | |||
<math display="block"> | |||
\begin{align} | |||
\hat{y}_t(a) | |||
= | |||
\begin{cases} | |||
\eta \hat{y}_{t-1}(a) | |||
+ | |||
(1-\eta) | |||
y_t,& \text{ if }a_t = a | |||
\\ | |||
\hat{y}_{t-1}(a),& \text{ if }a_t \neq a | |||
\end{cases} | |||
\end{align} | |||
</math> | |||
where <math>\eta \in [0, 1)</math>. | |||
As <math>\eta \to 0</math>, we consider an increasingly smaller window into the past and do not trust what we have seen happen given a particular action. On the other hand, when <math>\eta \to 1</math>, we do not trust what happens now but rather what we already know about the causal effect of an action <math>a</math> should be. | |||
By keeping track of the causal effect with exponential moving average, we can continuously run the trial. When doing so, we have to be careful in choosing the schedules of <math>\epsilon_t</math> and <math>\beta_t</math>. Unlike before, <math>\epsilon_t</math> should not be monotonically annealed toward <math>0</math>, as earlier exploration may not be useful later when the underlying distributions drift. <math>\beta_t</math> also should not be annealed toward <math>0</math>, as the estimate of the causal effect we have at any moment cannot be fully trusted due to the unanticipated drift of underlying distributions. It is thus reasonable to simply set both <math>\beta_t</math> and <math>\epsilon_t</math> to reasonably large constants. | |||
'''Checking the bias.''' | |||
At time <math>t</math>, we have accumulated | |||
<math display="block"> | |||
\begin{align} | |||
D_t = \left\{ (e_1, a_1, y_1), \ldots, (e_t, a_t, y_t) \right\}, | |||
\end{align} | |||
</math> | |||
where <math>e_{t'}</math> indicates whether we explored (<math>1</math>) or exploited (<math>0</math>) at time <math>t'</math>. | |||
We can then get the unbiased estimate of the causal effect of <math>a</math> on <math>y</math> by only using the triplets <math>(e,a,y)</math> for which <math>e=1</math>. That is, | |||
<math display="block"> | |||
\begin{align} | |||
\tilde{y}_t(a) | |||
= | |||
\frac{\sum_{t'=1}^t | |||
\mathds{1}(e_{t'} = 1) | |||
\mathds{1}(a_{t'} = a) | |||
y_{t'}} | |||
{\sum_{t''=1}^t | |||
\mathds{1}(e_{t''} = 1) | |||
\mathds{1}(a_{t''} = a) | |||
}. | |||
\end{align} | |||
</math> | |||
This estimate is unbiased, unlike <math>\hat{y}(a)</math> from EXP-3 above, since we only used the action-outcome pairs when the action was selected randomly from the same uniform distribution. | |||
Assuming a large <math>t</math> (so as to minimize the impact of a high variance,) we can then compute the (noisy) bias of the causal effect estimated and used by EXP-3 above as | |||
<math display="block"> | |||
\begin{align} | |||
b_t = (\hat{y}_t(a) - \tilde{y}_t(a))^2. | |||
\end{align} | |||
</math> | |||
Of course this estimate of the bias is noisy and especially so when <math>t</math> is small, since the effective number of data points used to estimate <math>\tilde{y}_t</math> is on average | |||
<math display="block"> | |||
\begin{align} | |||
\sum_{t'=1}^t \epsilon_{t'} \leq t. | |||
\end{align} | |||
</math> | |||
==General references== | |||
{{cite arXiv|last1=Cho|first1=Kyunghyun|year=2024|title=A Brief Introduction to Causal Inference in Machine Learning|eprint=2405.08793|class=cs.LG}} | |||
==Notes== | |||
{{Reflist|group=Notes}} | |||
==References== | |||
{{reflist}} |
Latest revision as of 23:53, 18 May 2024
Beside curiosity, the goal of causal inference is to use the inferred causal relationship for better outcomes in the future. Once we estimate [math]\mathbb{E}_G\left[ y | \mathrm{do}(a)\right][/math] using RCT, we would simply choose the following action for all future subject:
where [math]\mathcal{A}[/math] is a set of all possible actions. This approach has however one downside that we had to give an incorrect treatment (e.g. placebo) to many trial participants who lost their opportunities to have a better outcome (e.g. protection against the infectious disease.) Consider an RCT where subjects arrive and are tested serially, that is, one at a time. If [math]t[/math] subjects have participated in the RCT so far, we have
Based on [math]D[/math], we can estimate the outcome of each action by
This estimate would be unbiased (correct on average), if every [math]a_{t'}[/math] was drawn from an action distribution that is independent of the covariate [math]x[/math] from an identical distribution [math]q(a)[/math].[Notes 1] More generally, the bias (the degree of incorrectness) would be proportional to
If [math]\epsilon_{\leq t-1} = 1[/math], the estimate is unbiased, corresponding to causal inference. If [math]\epsilon_{\leq t-1} = 0[/math], what we have is not interventional but conditional. Assuming [math]\epsilon_t \gg 0[/math] and [math]t \gg 1[/math], we have a reasonable causal estimate of the outcome [math]y[/math] given each action [math]a[/math]. Then, in order to maximize the outcome of the next subject ([math]x_{t+1}[/math]), we want to assign them to an action sampled from the following Boltzmann distribution:
where [math]\beta_t \in [0, \infty)[/math] is a temperature parameter. When [math]\beta_t \to \infty[/math] (a high temperature), this is equivalent to sampling the action [math]a_{t+1}[/math] from a uniform distribution, which implies that we do not trust the causal estimates of the outcomes, perhaps due to small [math]t[/math]. On the other hand, when [math]\beta_t \to 0[/math] (a low temperature), the best-outcome action would be selected, as
assuming there is a unique action that leads to the best outcome. In this case, we are fully trusting our causal estimates of the outcomes and simply choose the best action accordingly, which corresponds to outcome maximization. We now combine these two in order to make a trade off between causal inference and outcome maximization. At time [math]t[/math], we sample the action [math]a_t[/math] for a new participant from
where [math]\epsilon_t \in [0, 1][/math] and [math]|\mathcal{A}|[/math] is the number of all possible actions. We can sample from this mixture distribution by
- Sample [math]e_t \in \left\{0,1\right\}[/math] from a Bernoulli distribution of mean [math]\epsilon_t[/math].
- Check [math]e_t[/math]
- If [math]e_t=1[/math], we uniformly choose [math]a_t[/math] at random.
- If [math]e_t=0[/math], we sample [math]a_t[/math] proportionally to [math]\hat{y}(a)[/math].
As we continue the RCT according to this assignment policy, we assign participants increasingly more to actions with better outcomes, because our causal estimate gets better over time. We however ensure that participants are randomly assigned to actions at a reasonable rate of [math]\epsilon_t[/math], in order to estimate the causal quantity rather than the statistical quantity. It is common to start with a larger [math]\epsilon_t \approx 1[/math] and gradually anneal it toward [math]0[/math], as we want to ensure we quickly estimate the correct causal effect early on. It is also usual to start with a large [math]\beta_t \geq 1[/math] and anneal it toward [math]0[/math], as the early estimate of the causal effect is often not trustworthy. When the first component (the uniform distribution) is selected, we say that we are exploring, and otherwise, we say we are exploiting. [math]e_t[/math] is a hyperparameter that allows us to compromise between exploration and exploitation, while [math]\beta_t[/math] is how we express our belief in the current estimate of the causal effect. This whole procedure is a variant of EXP-3, which stands for the exponential-weight algorithm for exploration and exploitation~[1], that is used to solve the multi-armed bandit problem. However, with an appropriate choice of [math]\epsilon_t[/math] and [math]\beta_t[/math], we obtain as a special case RCT that can estimate the causal effect of the action on the outcome. For instance, we can use the following schedules of these two hyperparameters:
with a larger [math]T \gg 1[/math]. We can however choose smoother schedulers for [math]\epsilon_t[/math] and [math]\beta_t[/math] in order to make a better compromise between causal inference and outcome maximization, in order to avoid assigning too many subjects to a placebo (that is, ineffective) group. The choice of [math]\epsilon_t[/math] and [math]\beta_t[/math] also affects the bias-variance trade-off. Although this is out of the scope of this course, it is easy to guess that higher [math]\epsilon_t[/math] and higher [math]\beta_t[/math] lead to a higher variance but a lower bias, and vice versa.
A never-ending trial. A major assumption that must be satisfied for RCT is the stationarity. Both the causal distribution [math]p^*(y|a,x)[/math] and the covariate distribution [math]p^*(x)[/math] must be stationary in that they do not change throughout the trial as well as after the trial. Especially when these distributions drift after the trial, that is, after running the trial with [math]T[/math] participants, our causal estimate as well as the decision based on it will become less accurate. We see such instances often in the real world. For instance, as viruses mutate, the effectiveness of vaccination wanes over time, although the very same vaccine was found to be effective by an earlier RCT. When the underlying conditional distributions are all stationary, we do not need to keep the entire set of collected data points in order to compute the approximate causal effect, because
In other words, we can just keep a single scalar [math]\hat{y}_t(a)[/math] for each action to maintain the causal effect over time. We can tweak this recursive formula to cope with slow-drifting underlying distributions by emphasizing recent data points much more so than older data points. This can be implemented with exponential moving average,[Notes 2] as follows:
where [math]\eta \in [0, 1)[/math]. As [math]\eta \to 0[/math], we consider an increasingly smaller window into the past and do not trust what we have seen happen given a particular action. On the other hand, when [math]\eta \to 1[/math], we do not trust what happens now but rather what we already know about the causal effect of an action [math]a[/math] should be. By keeping track of the causal effect with exponential moving average, we can continuously run the trial. When doing so, we have to be careful in choosing the schedules of [math]\epsilon_t[/math] and [math]\beta_t[/math]. Unlike before, [math]\epsilon_t[/math] should not be monotonically annealed toward [math]0[/math], as earlier exploration may not be useful later when the underlying distributions drift. [math]\beta_t[/math] also should not be annealed toward [math]0[/math], as the estimate of the causal effect we have at any moment cannot be fully trusted due to the unanticipated drift of underlying distributions. It is thus reasonable to simply set both [math]\beta_t[/math] and [math]\epsilon_t[/math] to reasonably large constants.
Checking the bias. At time [math]t[/math], we have accumulated
where [math]e_{t'}[/math] indicates whether we explored ([math]1[/math]) or exploited ([math]0[/math]) at time [math]t'[/math]. We can then get the unbiased estimate of the causal effect of [math]a[/math] on [math]y[/math] by only using the triplets [math](e,a,y)[/math] for which [math]e=1[/math]. That is,
This estimate is unbiased, unlike [math]\hat{y}(a)[/math] from EXP-3 above, since we only used the action-outcome pairs when the action was selected randomly from the same uniform distribution. Assuming a large [math]t[/math] (so as to minimize the impact of a high variance,) we can then compute the (noisy) bias of the causal effect estimated and used by EXP-3 above as
Of course this estimate of the bias is noisy and especially so when [math]t[/math] is small, since the effective number of data points used to estimate [math]\tilde{y}_t[/math] is on average
General references
Cho, Kyunghyun (2024). "A Brief Introduction to Causal Inference in Machine Learning". arXiv:2405.08793 [cs.LG].