Causal Inference vs. Outcome Maximization

[math] \newcommand{\indep}[0]{\ensuremath{\perp\!\!\!\perp}} \newcommand{\dpartial}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\abs}[1]{\left| #1 \right|} \newcommand\autoop{\left(} \newcommand\autocp{\right)} \newcommand\autoob{\left[} \newcommand\autocb{\right]} \newcommand{\vecbr}[1]{\langle #1 \rangle} \newcommand{\ui}{\hat{\imath}} \newcommand{\uj}{\hat{\jmath}} \newcommand{\uk}{\hat{k}} \newcommand{\V}{\vec{V}} \newcommand{\half}[1]{\frac{#1}{2}} \newcommand{\recip}[1]{\frac{1}{#1}} \newcommand{\invsqrt}[1]{\recip{\sqrt{#1}}} \newcommand{\halfpi}{\half{\pi}} \newcommand{\windbar}[2]{\Big|_{#1}^{#2}} \newcommand{\rightinfwindbar}[0]{\Big|_{0}^\infty} \newcommand{\leftinfwindbar}[0]{\Big|_{-\infty}^0} \newcommand{\state}[1]{\large\protect\textcircled{\textbf{\small#1}}} \newcommand{\shrule}{\\ \centerline{\rule{13cm}{0.4pt}}} \newcommand{\tbra}[1]{$\bra{#1}$} \newcommand{\tket}[1]{$\ket{#1}$} \newcommand{\tbraket}[2]{$\braket{1}{2}$} \newcommand{\infint}[0]{\int_{-\infty}^{\infty}} \newcommand{\rightinfint}[0]{\int_0^\infty} \newcommand{\leftinfint}[0]{\int_{-\infty}^0} \newcommand{\wavefuncint}[1]{\infint|#1|^2} \newcommand{\ham}[0]{\hat{H}} \newcommand{\mathds}{\mathbb}[/math]

Beside curiosity, the goal of causal inference is to use the inferred causal relationship for better outcomes in the future. Once we estimate [math]\mathbb{E}_G\left[ y | \mathrm{do}(a)\right][/math] using RCT, we would simply choose the following action for all future subject:

[[math]] \begin{align} \hat{a} = \arg\max_{a \in \mathcal{A}} \mathbb{E}_G\left[ y | \mathrm{do}(a)\right], \end{align} [[/math]]

where [math]\mathcal{A}[/math] is a set of all possible actions. This approach has however one downside that we had to give an incorrect treatment (e.g. placebo) to many trial participants who lost their opportunities to have a better outcome (e.g. protection against the infectious disease.) Consider an RCT where subjects arrive and are tested serially, that is, one at a time. If [math]t[/math] subjects have participated in the RCT so far, we have

[[math]] \begin{align} D = \left\{ (a_1, y_1), \ldots, (a_t, y_t) \right\}. \end{align} [[/math]]

Based on [math]D[/math], we can estimate the outcome of each action by

[[math]] \begin{align} \hat{y}_t(a) = \frac{\sum_{t'=1}^t \mathds{1}(a_{t'} = a) y_{t'}} {\sum_{t''=1}^N \mathds{1}(a_{t''} = a)}. \end{align} [[/math]]

This estimate would be unbiased (correct on average), if every [math]a_{t'}[/math] was drawn from an action distribution that is independent of the covariate [math]x[/math] from an identical distribution [math]q(a)[/math].[Notes 1] More generally, the bias (the degree of incorrectness) would be proportional to

[[math]] \begin{align} \epsilon_{\leq t-1} = \frac{1}{t} \sum_{t'=1}^t \mathds{1}(a_{t'} \text{ was drawn independently of } x_{t'}). \end{align} [[/math]]

If [math]\epsilon_{\leq t-1} = 1[/math], the estimate is unbiased, corresponding to causal inference. If [math]\epsilon_{\leq t-1} = 0[/math], what we have is not interventional but conditional. Assuming [math]\epsilon_t \gg 0[/math] and [math]t \gg 1[/math], we have a reasonable causal estimate of the outcome [math]y[/math] given each action [math]a[/math]. Then, in order to maximize the outcome of the next subject ([math]x_{t+1}[/math]), we want to assign them to an action sampled from the following Boltzmann distribution:

[[math]] \begin{align} q_t(a) = \frac{\exp\left( \frac{1}{\beta_t} \hat{y}_t(a) \right)} {\sum_{a' \in \mathcal{A}} \exp\left( \frac{1}{\beta_t} \hat{y}_t(a') \right)}, \end{align} [[/math]]

where [math]\beta_t \in [0, \infty)[/math] is a temperature parameter. When [math]\beta_t \to \infty[/math] (a high temperature), this is equivalent to sampling the action [math]a_{t+1}[/math] from a uniform distribution, which implies that we do not trust the causal estimates of the outcomes, perhaps due to small [math]t[/math]. On the other hand, when [math]\beta_t \to 0[/math] (a low temperature), the best-outcome action would be selected, as

[[math]] \begin{align} q_t(a) =_{\beta \to \infty} \begin{cases} 1, &\text{ if } \hat{y}(a) = \max_{a'} \hat{y}_t(a') \\ 0, &\text{ otherwise} \end{cases} \end{align} [[/math]]

assuming there is a unique action that leads to the best outcome. In this case, we are fully trusting our causal estimates of the outcomes and simply choose the best action accordingly, which corresponds to outcome maximization. We now combine these two in order to make a trade off between causal inference and outcome maximization. At time [math]t[/math], we sample the action [math]a_t[/math] for a new participant from

[[math]] \begin{align} \label{eq:bandit_policy} q_t(a) = \epsilon_t \frac{1}{|\mathcal{A}|} + (1-\epsilon_t) \frac{\exp\left( \frac{1}{\beta_t} \hat{y}_t(a) \right)} {\sum_{a' \in \mathcal{A}} \exp\left( \frac{1}{\beta_t} \hat{y}_t(a') \right)}, \end{align} [[/math]]

where [math]\epsilon_t \in [0, 1][/math] and [math]|\mathcal{A}|[/math] is the number of all possible actions. We can sample from this mixture distribution by

  • Sample [math]e_t \in \left\{0,1\right\}[/math] from a Bernoulli distribution of mean [math]\epsilon_t[/math].
  • Check [math]e_t[/math]
    • If [math]e_t=1[/math], we uniformly choose [math]a_t[/math] at random.
    • If [math]e_t=0[/math], we sample [math]a_t[/math] proportionally to [math]\hat{y}(a)[/math].

As we continue the RCT according to this assignment policy, we assign participants increasingly more to actions with better outcomes, because our causal estimate gets better over time. We however ensure that participants are randomly assigned to actions at a reasonable rate of [math]\epsilon_t[/math], in order to estimate the causal quantity rather than the statistical quantity. It is common to start with a larger [math]\epsilon_t \approx 1[/math] and gradually anneal it toward [math]0[/math], as we want to ensure we quickly estimate the correct causal effect early on. It is also usual to start with a large [math]\beta_t \geq 1[/math] and anneal it toward [math]0[/math], as the early estimate of the causal effect is often not trustworthy. When the first component (the uniform distribution) is selected, we say that we are exploring, and otherwise, we say we are exploiting. [math]e_t[/math] is a hyperparameter that allows us to compromise between exploration and exploitation, while [math]\beta_t[/math] is how we express our belief in the current estimate of the causal effect. This whole procedure is a variant of EXP-3, which stands for the exponential-weight algorithm for exploration and exploitation~[1], that is used to solve the multi-armed bandit problem. However, with an appropriate choice of [math]\epsilon_t[/math] and [math]\beta_t[/math], we obtain as a special case RCT that can estimate the causal effect of the action on the outcome. For instance, we can use the following schedules of these two hyperparameters:

[[math]] \begin{align} \epsilon_t = \beta_t = \begin{cases} 1,& \text{ if } t \lt T \\ 0,& \text{ if } t \geq T \end{cases} \end{align} [[/math]]

with a larger [math]T \gg 1[/math]. We can however choose smoother schedulers for [math]\epsilon_t[/math] and [math]\beta_t[/math] in order to make a better compromise between causal inference and outcome maximization, in order to avoid assigning too many subjects to a placebo (that is, ineffective) group. The choice of [math]\epsilon_t[/math] and [math]\beta_t[/math] also affects the bias-variance trade-off. Although this is out of the scope of this course, it is easy to guess that higher [math]\epsilon_t[/math] and higher [math]\beta_t[/math] lead to a higher variance but a lower bias, and vice versa.

A never-ending trial. A major assumption that must be satisfied for RCT is the stationarity. Both the causal distribution [math]p^*(y|a,x)[/math] and the covariate distribution [math]p^*(x)[/math] must be stationary in that they do not change throughout the trial as well as after the trial. Especially when these distributions drift after the trial, that is, after running the trial with [math]T[/math] participants, our causal estimate as well as the decision based on it will become less accurate. We see such instances often in the real world. For instance, as viruses mutate, the effectiveness of vaccination wanes over time, although the very same vaccine was found to be effective by an earlier RCT. When the underlying conditional distributions are all stationary, we do not need to keep the entire set of collected data points in order to compute the approximate causal effect, because

[[math]] \begin{align} \hat{y}_t(a) = \frac{\sum_{t'=1}^t \mathds{1}(a_{t'} = a) y_{t'}} {\sum_{t''=1}^N \mathds{1}(a_{t''} = a)} = \frac{\sum_{t'=1}^{t-1} \mathds{1}(a_{t'} = a)}{\sum_{t'=1}^t \mathds{1}(a_{t'} = a)} \hat{y}_{t-1}(a) + \frac{ \mathds{1}(a_{t} = a)}{\sum_{t'=1}^t \mathds{1}(a_{t'} = a)} y_{t}. \end{align} [[/math]]

In other words, we can just keep a single scalar [math]\hat{y}_t(a)[/math] for each action to maintain the causal effect over time. We can tweak this recursive formula to cope with slow-drifting underlying distributions by emphasizing recent data points much more so than older data points. This can be implemented with exponential moving average,[Notes 2] as follows:

[[math]] \begin{align} \hat{y}_t(a) = \begin{cases} \eta \hat{y}_{t-1}(a) + (1-\eta) y_t,& \text{ if }a_t = a \\ \hat{y}_{t-1}(a),& \text{ if }a_t \neq a \end{cases} \end{align} [[/math]]

where [math]\eta \in [0, 1)[/math]. As [math]\eta \to 0[/math], we consider an increasingly smaller window into the past and do not trust what we have seen happen given a particular action. On the other hand, when [math]\eta \to 1[/math], we do not trust what happens now but rather what we already know about the causal effect of an action [math]a[/math] should be. By keeping track of the causal effect with exponential moving average, we can continuously run the trial. When doing so, we have to be careful in choosing the schedules of [math]\epsilon_t[/math] and [math]\beta_t[/math]. Unlike before, [math]\epsilon_t[/math] should not be monotonically annealed toward [math]0[/math], as earlier exploration may not be useful later when the underlying distributions drift. [math]\beta_t[/math] also should not be annealed toward [math]0[/math], as the estimate of the causal effect we have at any moment cannot be fully trusted due to the unanticipated drift of underlying distributions. It is thus reasonable to simply set both [math]\beta_t[/math] and [math]\epsilon_t[/math] to reasonably large constants.

Checking the bias. At time [math]t[/math], we have accumulated

[[math]] \begin{align} D_t = \left\{ (e_1, a_1, y_1), \ldots, (e_t, a_t, y_t) \right\}, \end{align} [[/math]]

where [math]e_{t'}[/math] indicates whether we explored ([math]1[/math]) or exploited ([math]0[/math]) at time [math]t'[/math]. We can then get the unbiased estimate of the causal effect of [math]a[/math] on [math]y[/math] by only using the triplets [math](e,a,y)[/math] for which [math]e=1[/math]. That is,

[[math]] \begin{align} \tilde{y}_t(a) = \frac{\sum_{t'=1}^t \mathds{1}(e_{t'} = 1) \mathds{1}(a_{t'} = a) y_{t'}} {\sum_{t''=1}^t \mathds{1}(e_{t''} = 1) \mathds{1}(a_{t''} = a) }. \end{align} [[/math]]

This estimate is unbiased, unlike [math]\hat{y}(a)[/math] from EXP-3 above, since we only used the action-outcome pairs when the action was selected randomly from the same uniform distribution. Assuming a large [math]t[/math] (so as to minimize the impact of a high variance,) we can then compute the (noisy) bias of the causal effect estimated and used by EXP-3 above as

[[math]] \begin{align} b_t = (\hat{y}_t(a) - \tilde{y}_t(a))^2. \end{align} [[/math]]

Of course this estimate of the bias is noisy and especially so when [math]t[/math] is small, since the effective number of data points used to estimate [math]\tilde{y}_t[/math] is on average

[[math]] \begin{align} \sum_{t'=1}^t \epsilon_{t'} \leq t. \end{align} [[/math]]


General references

Cho, Kyunghyun (2024). "A Brief Introduction to Causal Inference in Machine Learning". arXiv:2405.08793 [cs.LG].

Notes

  1. This is another constraint on RCT, that every subject must be assigned according to the same assignment policy [math]q(a)[/math].
  2. ‘exponential-weight’ in EXP-3 comes from this choice.

References

  1. "The non-stationary stochastic multi-armed bandit problem" (2017). International Journal of Data Science and Analytics 3. Springer.