Prediction vs. Causal Inference
The major difference between prediction and causal inference is the goal. The goal of prediction is to predict which value a particular variable, in our case often the outcome variable, would take given that we have observed the values of the other variables. On the other hand, the goal of causal inference is to know which value the outcome variable would take had we intervened on the action variable. This difference implies that causal inference may not be the best way to predict what would happen based on what we have observed. In the example of birds vs. branches above, if our goal is good prediction, we would be certainly open to using the location of the branch as one of the features as well. Even if a large portion of the bird in a picture is occluded by e.g. leaves, we may be able to accurately predict that there is a bird in the picture by noticing the horizontal branch near the bottom of the tree. This branch feature is clearly not a causal feature, but nevertheless helps us make better prediction. In short, if I knew that the picture was taken in a forest, I would rely on both the beak and the branch's location to determine whether there is a bird in the picture. This is however a brittle strategy, as it would certainly degrade my prediction ability had the picture been taken somewhere else. The invariant predictor [math]q(y | g(x))[/math] from above is thus likely sub-optimal in the context of prediction under any environment, although this may be the right distribution to compute the causal effect of [math]x[/math] and [math]y[/math]. This is because the invariant predictor only explains a part of [math]y[/math] (marked red below), while ignoring the open path (marked blue below) via the collider: \begin{center} \begin{tikzpicture}
\node[latent] (x) {[math]x[/math]}; \node[draw, rectangle, below=0.5cm of x, xshift=0.5cm] (g) {[math]g[/math]}; \node[latent, right=0.5cm of g] (gx) {[math]x'[/math]}; \node[latent, right=2cm of x] (y) {[math]y[/math]}; \node[obs, above=0.5cm of x, xshift=1cm] (z) {[math]z[/math]}; \edge{x}{g}; \edge{g}{gx}; \edge[color=red]{gx}{y}; \edge[color=blue]{x}{z}; \edge[color=blue]{y}{z}; \end{tikzpicture}
\end{center} Given an environment [math]z = \hat{z}[/math], we must capture both correlations arising from [math]g(x)\to y[/math] and [math]x \to \hat{z} \leftarrow y[/math], in order to properly predict what value [math]y[/math] is likely to take given [math]x[/math]. This can be addressed by introducing an environment-dependent feature extractor [math]h_{\hat{z}}(x)[/math] that is orthogonal to the invariant feature extractor [math]g(x)[/math]. We can impose such orthogonality (or independence) when learning [math]h_{\hat{z}}(x)[/math] by
with a given [math]g[/math]. [math]h_{\hat{z}}[/math] would only capture about [math]y[/math] that was not already captured by [math]g[/math], leading to the orthogonality. This however assumes that [math]q[/math] is constrained to the point that it cannot simply ignore [math]g(x)[/math] entirely. This view allows us to use a small number of labelled examples from a new environment in the test time to quickly learn the environment-specific feature extractor [math]h_z[/math] while having learned the environment-invariant feature extractor [math]g[/math] in the training time from a diverse set of environments. One can view such a scheme as meta-learning or transfer learning, although neither of these concepts is well defined. It is possible to flip the process described here to obtain an environment-invariant feature extractor [math]g[/math], if we know of an environment-dependent feature extractor [math]h_z[/math], by
assuming again that [math]q[/math] is constrained to the point that it cannot simply ignore [math]h(x)[/math] entirely. This flipped approach has been used to build a predictive model that is free of a known societal bias, of which the detector can be easily constructed~[1].
General references
Cho, Kyunghyun (2024). "A Brief Introduction to Causal Inference in Machine Learning". arXiv:2405.08793 [cs.LG].