Revision as of 01:23, 19 May 2024 by Admin

When Some Confounders are Observed

[math] \newcommand{\indep}[0]{\ensuremath{\perp\!\!\!\perp}} \newcommand{\dpartial}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\abs}[1]{\left| #1 \right|} \newcommand\autoop{\left(} \newcommand\autocp{\right)} \newcommand\autoob{\left[} \newcommand\autocb{\right]} \newcommand{\vecbr}[1]{\langle #1 \rangle} \newcommand{\ui}{\hat{\imath}} \newcommand{\uj}{\hat{\jmath}} \newcommand{\uk}{\hat{k}} \newcommand{\V}{\vec{V}} \newcommand{\half}[1]{\frac{#1}{2}} \newcommand{\recip}[1]{\frac{1}{#1}} \newcommand{\invsqrt}[1]{\recip{\sqrt{#1}}} \newcommand{\halfpi}{\half{\pi}} \newcommand{\windbar}[2]{\Big|_{#1}^{#2}} \newcommand{\rightinfwindbar}[0]{\Big|_{0}^\infty} \newcommand{\leftinfwindbar}[0]{\Big|_{-\infty}^0} \newcommand{\state}[1]{\large\protect\textcircled{\textbf{\small#1}}} \newcommand{\shrule}{\\ \centerline{\rule{13cm}{0.4pt}}} \newcommand{\tbra}[1]{$\bra{#1}$} \newcommand{\tket}[1]{$\ket{#1}$} \newcommand{\tbraket}[2]{$\braket{1}{2}$} \newcommand{\infint}[0]{\int_{-\infty}^{\infty}} \newcommand{\rightinfint}[0]{\int_0^\infty} \newcommand{\leftinfint}[0]{\int_{-\infty}^0} \newcommand{\wavefuncint}[1]{\infint|#1|^2} \newcommand{\ham}[0]{\hat{H}} \newcommand{\mathds}{\mathbb}[/math]

The assignment of an individual [math]x[/math] to a particular treatment option [math]a[/math] is called a policy. In the case of RCT, this policy was a uniform distribution over all possible actions [math]a[/math], and in the case of EXP-3, it was a mixture of a uniform policy and a effect-proportional policy from Eq.eq:bandit_policy. In both cases, the policy was not conditioned on the covariate [math]x[/math], meaning that no information about the individual was used for assignment. This is how we addressed the issue of unobserved confounders. Such an approach is however overly restrictive in many cases, as some treatments may only be effective for a subset of the population that share a certain trait. For instance, consider the problem of inferring the effect of a monoclonal antibody therapeutics called Trastuzumab (or Herceptin) for breast cancer on the disease-free survival of a patient[1]. If we run RCT without taking into account any covariate information as above, we very likely would not see any positive effect on the patient's disease-free survival, because Trastuzumab was specifically designed to work for HER2-positive breast cancer. That is, only breast cancer patients with over-expressed ERBB2 gene (which encodes the HER2 receptor) would benefit from Trastuzumab. In these cases, we are interested in conditional average treatment effect (CATE) from earlier, that is, to answer the question of what causal effect Trastuzumab has on patients given their gene expression profile. CATE given the overly-expressed ERBB2 gene of Trastuzumab would be positive, while CATE without over-expression of ERBB2 gene would be essentially zero. When we observe some confounders, such as the gene expression profile of a subject in the example above, and do not observe all the other confounders, we can mix RCT (Randomized Controlled Trials) and regression (Regression: Causal Inference can be Trivial) to estimate the conditional causal effect conditioned on the observed confounders. The graph [math]G[/math] with the partially-observed confounder is depicted as below with the unobserved [math]x[/math] and the observed [math]x'[/math]: \begin{center} \begin{tikzpicture}

  \node[obs] (a) {[math]a[/math]};   \node[obs, right=1cm of a] (y) {[math]y[/math]};   \node[latent, above=0.5cm of a, xshift=0.5cm] (x) {[math]x[/math]};   \node[obs, right=0.5cm of x] (xc) {[math]x'[/math]};   
  \edge{x}{a}; 
\edge{x}{y};
\edge{xc}{a}; 
\edge{xc}{y};
\edge{a}{y};

\end{tikzpicture} \end{center} Each subject in RCT then corresponds to the action-outcome-observed-covariate triplet [math](a_t, y_t, x'_t)[/math]. Assume we have enrolled and experimented with [math]t[/math] participants so far, resulting in

[[math]] \begin{align} D_t = \left\{ (a_1, y_1, x'_1), \ldots, (a_t, y_t, x'_t) \right\}. \end{align} [[/math]]

Let [math]\hat{x}'[/math] be the condition of interest, such as the overexpression of ERBB2. we can then create a subset [math]D(\hat{x}')[/math] as

[[math]] \begin{align} D_t(\hat{x}') = \left\{ (a, y, x') \in D_t | x' = \hat{x}' \right\} \subseteq D_t. \end{align} [[/math]]


We can then use this subset [math]D_t(\hat{x}')[/math] as if it were the set from the ordinary RCT, in order to estimate the conditional causal effect, as follows.

[[math]] \begin{align} \hat{y}_t(a|x') \approx \frac{ \sum_{(a_i,y_i,x_i') \in D_t(x')} \mathds{1}(a_i=a) y_i } { \sum_{(a_j,y_j,x_j') \in D_t(x')} \mathds{1}(a_j=a) }. \end{align} [[/math]]

Just like what we did earlier, we can easily turn this into a recursive version in order to save memory:

[[math]] \begin{align} \hat{y}_t(a|x') = \begin{cases} \hat{y}_{t-1}(a|x'),&\text{ if } x'_t \neq x' \\ \frac{ \hat{y}_{t-1}(a|x') \sum_{(a_j,y_j,x_j') \in D_{t-1}(x')} \mathds{1}(a_j=a) + y_t \mathds{1}(a_t = a) } { \sum_{(a_k,y_k,x_k') \in D_{t}(x')} \mathds{1}(a_k=a) },&\text{ if }x'_t = x' \end{cases} \end{align} [[/math]]

Similarly to earlier, we can instead of exponential moving average in order to cope with the distribution drift over the sequential RCT, as

[[math]] \begin{align} \hat{y}_t(a|x') = \begin{cases} \hat{y}_{t-1}(a|x'),&\text{ if } x'_t \neq x' \\ \eta_t \hat{y}_{t-1}(a|x') + (1-\eta_t) y_t \mathds{1}(a_t = a), &\text{ if }x'_t = x' \end{cases} \end{align} [[/math]]

We can then use this running estimates of the causal effects in order to build an assignment policy [math]\pi_t(a|x')[/math] that now depends on the observed covariate [math]x'[/math]. If we go back to the earlier example of Trastuzumab, this policy would increasingly more assign participants with over-expressed ERBB2 to the treatment arm, while it would continue to be largely uniform for the remaining population.

A Parametrized Causal Effect. Up until this point, partially-observed confounders do not look like anything special. It is effectively running multiple RCT's in parallel by running an individual RCT for each covariate configuration. There is no benefit of running these RCT's in parallel relative to running these RCT's in sequence. We could however imagine a scenario where the former is more beneficial than the latter, and we consider one such case here. Assume that [math]x'[/math] is a multi-dimensional vector, i.e., [math]x' \in \mathbb{R}^d[/math] and that the true causal effect of each action [math]a[/math] is a linear function of the observed covariate [math]x'[/math]:

[[math]] \begin{align} \hat{y}^*(a|x') = \theta^*(a)^\top x' + b^*(a) = \sum_{i=1}^d \theta_d^*(a) x'_d + b(a). \end{align} [[/math]]

This means that each dimension [math]x'_d[/math] of the covariate has an additive effect on the expected outcome [math]\hat{y}(a|x)[/math] weighted by the associated coefficient [math]\theta_d^*(a)[/math], and that the effect [math]\theta_d^*(a) x'_d[/math] of each dimension on the expected outcome is independent of the other dimensions' effects. As an example, consider estimating the effect of weight lifting on the overall health. The action is whether to perform weight lifting each day, and the outcome is the degree of the subject's healthiness. Each dimension [math]x'_d[/math] refers to a habit of a person. For instance, it could be a habit of smoking, and the corresponding dimension [math]x'_d[/math] encodes the number of cigarettes the subject smokes a day. Another habit could be jogging, and the corresponding dimension would encode the number of minutes the subject runs a day. Smoking is associated with a negative coefficient regardless of [math]a[/math]. On the other hand, jogging is associated with a negative coefficient when [math]a=1[/math], because an excessive level of workout leads to frequent injuries, while it is with a positive coefficient when [math]a=0[/math]. Of course, some of these habits may have nonlinear effects. Running just the right duration each day in addition to weight lifting could lead to a better health outcome. It is however reasonable to assume linearity as the first-order approximation. We can estimate the coefficients [math]\theta(a)[/math] by regression from Regression: Causal Inference can be Trivial by solving

[[math]] \begin{align} \min_{\theta(a)} \frac{1}{2} \sum_{t'=1}^t \mathds{1}(a_{t'} = a) \left( y_{t'} - \theta(a)^\top x'_{t'} - b^*(a) \right)^2. \end{align} [[/math]]

Instead of keeping the count for each and every possible [math]x'[/math], we now keep only [math]\theta(a)[/math] for each action [math]a[/math]. This has an obvious advantage of requiring only [math]O(d)[/math] memory rather than [math]O(2^d)[/math]. More importantly however is that the estimated causal effect generalizes unseen covariate configuration. Let us continue from the example of having smoking and jogging as two dimensions of [math]x'[/math]. During RCT, we may have seen participants who either smoke or jog but never both. Because of the linearity, the estimated causal effect predictor,

[[math]] \begin{align} \theta(a)_{\mathrm{smoke}} x'_{\mathrm{smoke}} + \theta(a)_{\mathrm{run}} x'_{\mathrm{run}}, \end{align} [[/math]]

generalizes to participants who both smokes and jogs as well as who neither smokes nor jogs. This case of a linear causal effect suggests that we can rely on the power of generalization in machine learning in order to bypass the strong assumption of positivity. Even if we do not observe a covariate or an associated action, a parametrized causal effect predictor can generalize to those unseen cases. We will discuss this potential further later in the semester.


When there are many possible actions. Assume we do not observe any confounder, that is, there is no [math]x'[/math]. Then, at each time, RCT is nothing but estimating a single scalar for each action. Let [math]\mathcal{A}[/math] be a set of all actions and [math]|\mathcal{A}|[/math] a cardinality of this action set. Then, at any time [math]t[/math] of running an RCT, the number of data points we can use to estimate the causal effect of a particular action is

[[math]] \begin{align*} N(a) = \sum_{t'=1}^t \mathds{1}(a_{t'} = a) \approx t p_a(a), \end{align*} [[/math]]

where [math]p_a(a)[/math] is the probability of selecting the action [math]a[/math] during randomization. Just like the case above with partially observed confounders, the variance of the estimate of the causal effect of an individual action decreases dramatically as the number of possible actions increases. We must have some extra information (context) about these actions in order to break out of this issue. Let [math]c(a) \in \mathbb{R}^d[/math] be the context of the action [math]a[/math]. For instance, each dimension of [math]c(a)[/math] corresponds to the amount of one ingredient for making the perfect steak seasoning, such as salt, pepper, garlic and others. Then, each action [math]a[/math] corresponds to a unique combination of these ingredients. In this case, the causal effect of any particular action can be thought of mapping [math]c(a)[/math] to the outcome [math]\hat{y}(a)[/math] associated with [math]a[/math]. If we assume this mapping was linear[2], we can write it as

[[math]] \begin{align} \hat{y}(a) = c(a)^\top {\theta^*} + b^*, \end{align} [[/math]]

where [math]\theta \in \mathbb{R}^d[/math] and [math]b \in \mathbb{R}[/math]. Similarly to the case where there was an observed confounder above, with linearity, we do not need to maintain the causal estimate for each and every possible action, which amounts to [math]|\mathcal{A}|[/math] numbers, but the effect of each dimension of the action context on the outcome, which amounts of [math]d[/math] numbers. When [math]d \ll |\mathcal{A}|[/math], we gain a significant improvement in the variance of our estimates. Furthermore, just like what we saw above, we benefit from the compositionality, or compositional generalization. For instance, if the effects of salt and pepper on the final quality of seasoning are independent and additive, we can accurately estimate the effect of having both salt and pepper even when all tested seasonings had either salt or pepper but never both. Let [math]c(a)=[s_{\mathrm{salt}}, s_{\mathrm{pepper}}][/math], and assume [math]s_{\mathrm{salt}}, s_{\mathrm{pepper}} \in \{0, 1\}[/math] and that all past trials were such that [math]s_{\mathrm{salt}}=0[/math] or [math]s_{\mathrm{pepper}} = 0[/math]. We can approximate [math]\theta^*_{\mathrm{salt}}[/math] and [math]\theta^*_{\mathrm{pepper}}[/math] from these past trials, and due to the linearity assumption, we can now compute the causal effect of [math]c(a)=[1, 1][/math], as

[[math]] \begin{align} \hat{\theta}_{\mathrm{salt}} + \hat{\theta}_{\mathrm{pepper}}. \end{align} [[/math]]

This would not have been possible without the linearity, or more generally compositionality, because this particular action of adding both salt and pepper has never been seen before, i.e., it violates the positivity assumption. This is yet another example of overcoming the violation of positivity by generalization. At this point, one sees a clear connection between having some confounders observed and having many actions associated with their contexts. This is because they are simply two sides of the same coin. I leave this to you to think of why this is the case.

General references

Cho, Kyunghyun (2024). "A Brief Introduction to Causal Inference in Machine Learning". arXiv:2405.08793 [cs.LG].

References

  1. "Trastuzumab: triumphs and tribulations" (2007). Oncogene 26. Nature Publishing Group. 
  2. Li, Lihong; Chu, Wei; Langford, John; Schapire, Robert E (2010). A contextual-bandit approach to personalized news article recommendation. Proceedings of the 19th international conference on World wide web (Report).