More properties of the conditional expectation

Theorem

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]\mathcal{G}_1\subset\F[/math] and [math]\mathcal{G}_2\subset\F[/math] be two sub [math]\sigma[/math]-Algebras of [math]\F[/math]. Then [math]\mathcal{G}_1[/math] and [math]\mathcal{G}_2[/math] are independent if and only if for every positive and [math]\mathcal{G}_2[/math]-measurable r.v. [math]X[/math] (or for [math]X\in L^1(\Omega,\mathcal{G}_2,\p)[/math] or [math]X=\one_A[/math] for [math]A\in \mathcal{G}_2[/math]) we have

[[math]] \E[X\mid\mathcal{G}_2]=\E[X]. [[/math]]

Show Proof

We only need to prove that the statement in the bracket implies that [math]\mathcal{G}_1[/math] and [math]\mathcal{G}_2[/math] are independent. Assume that for all [math]A\in \mathcal{G}_2[/math] we have that

[[math]] \E[\one_A\mid\mathcal{G}_1]=\p[A] [[/math]]

and moreover for all [math]B\in \mathcal{G}_1[/math] we have that

[[math]] \E[\one_B\one_A]=\E[\E[\one_A\mid\mathcal{G}_1]\one_B]. [[/math]]

Note that [math]\E[\one_A\mid\mathcal{G}_1]=\p[A][/math] and therefore [math]\E[\one_B\one_A]=\p[A\cap B]=\p[A]\E[\one_B]=\p[A]\p[B][/math] and hence the claim follows.

■

Let [math]Z[/math] and [math]Y[/math] be two real valued r.v.'s. Then [math]Z[/math] and [math]Y[/math] are independent if and only if for all [math]h[/math] Borel measurable, such that [math]\E[\vert h(Z)\vert] \lt \infty[/math], we get [math]\E[h(Z)\mid Y]=\E[h(Z)][/math]. To see this we can apply the theorem with [math]\mathcal{G}_2=\sigma(Z)[/math] and note that all r.v.'s in [math]L^1(\Omega,\mathcal{G}_2,\p)[/math] are of the form [math]h(Z)[/math] with [math]\E[\vert h(Z)\vert] \lt \infty[/math]. In particular, if [math]Z\in L^1(\Omega,\F,\p)[/math], we get [math]\E[Z\mid Y]=\E[Z][/math]. Be aware that the latter equation does not imply that [math]Y[/math] and [math]Z[/math] are independent. For example take [math]Z\sim\mathcal{N}(0,1)[/math] and [math]Y=\vert Z\vert[/math]. Now for all [math]h[/math] with [math]\E[\vert h(\vert Z\vert)\vert] \lt \infty[/math] we get [math]\E[Zh(\vert Z\vert )]=0[/math]. Thus [math]\E[Z\mid \vert Z\vert]=0[/math], but [math]Z[/math] and [math]\vert Z\vert[/math] are not independent.

Theorem

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]X[/math] and [math]Y[/math] be two r.v.'s on that space with values in the same measure space [math]E[/math] and [math]F[/math]. Assume that [math]X[/math] is independent of the sub [math]\sigma[/math]-Algebra [math]\mathcal{G}\subset \F[/math] and that [math]Y[/math] is [math]\mathcal{G}[/math]-measurable. Then for every measurable map [math]g:E\times F\to \R_+[/math] we have

[[math]] \E[g(X,Y)\mid\mathcal{G}]=\int_E g(x,y)d\p_X(x), [[/math]]

where [math]\p_X[/math] is the law of [math]X[/math] and the right hand side has to be understood as a function [math]\phi(Y)[/math] with

[[math]] \phi:y\mapsto \int_E g(x,y)d\p_X(x). [[/math]]

Show Proof

We need to show that for all [math]\mathcal{G}[/math]-measurable r.v. [math]Z[/math] we get that

[[math]] \E[g(X,Y)Z]=\E[\phi(Y)Z]. [[/math]]

Let us denote by [math]\p_{(X,Y,Z)}[/math] the distribution of [math](X,Y,Z)[/math] on [math]E\times F\times \R_+[/math]. Since [math]X[/math] is independent of [math]\mathcal{G}[/math], we have [math]\p_{(X,Y,Z)}=\p_X\otimes \p_{(Y,Z)}[/math].Thus

[[math]] \begin{align*} \E[g(X,Y)Z]&=\int_{E\times F\times \R_+}g(x,y)zd\p_{(X,Y,Z)}(x,y,z)=\int_{F\times \R_+}z\left(\int_E g(x,y)d\p_X(x)\right) d\p_{(Y,Z)}(y,z)\\ &=\int_{F\times\R_1}z\phi(y)d\p_{(Y,Z)}(y,z)=\E[Z \phi(Y)]. \end{align*} [[/math]]

■

===Important examples}We need to take a look at two important examples.

=Variables with densities

Let [math](X,Y)\in \R^m\times \R^n[/math]. Assume that [math](X,Y)[/math] has density [math]P(x,y)[/math], i.e. for all Borel measurable maps [math]h:\R^m\times\R^n\to \R_+[/math] we have

[[math]] \E[h(X,Y)]=\int_{\R^m\times\R^n}h(x,y)P(x,y)dxdy. [[/math]]

The density of [math]Y[/math] is given by

[[math]] Q(y)=\int_{\R^m}P(x,y)dx. [[/math]]

We want to compute [math]\E[h(X)\mid Y][/math] for some measurable map [math]h:\R^m\to\R_+[/math]. Therefore we have

[[math]] \begin{align*==== \E[h(X)g(Y)&=\int_{\R^m\times\R^n}h(x)g(y)P(x,y)dxdy=\int_{\R^n}\left(\int_{\R^m} h(x)P(x,y)dx\right) g(y)dy\\ &=\int_{\R^n}\frac{1}{Q(y)}\left(\int_{\R^m}h(x)P(x,y)dx\right)g(y)Q(y)\one_{\{Q(y) \gt 0\}}dy\\ &=\int_{\R^n}\varphi(y)g(y)Q(y)\one_{\{Q(y) \gt 0\}}dy=\E[\varphi(Y)g(y)], \end{align*} [[/math]]

where

[[math]] \varphi(y)=\begin{cases}\frac{1}{Q(y)}\int_{\R^n}h(x)Q(x,y)dx,& Q(y) \gt 0\\ h(0),& Q(y)=0\end{cases} [[/math]]

Proposition

For [math]Y\in\R^n[/math], let [math]\nu(y,dx)[/math] be the probability measure on [math]\R^n[/math] defined by

[[math]] \nu(x,dy)=\begin{cases}\frac{1}{Q(y)}P(x,y)& Q(y) \gt 0\\ \delta_0(dx)& Q(y)=0\end{cases} [[/math]]

Then for all measurable maps [math]h:\R^m\to\R_+[/math] we get

[[math]] \E[h(X)\mid Y]=\int_{\R^m} h(x)\nu(Y, dx), [[/math]]

where the right hand side has to be understood as [math]\phi(Y)[/math], where

[[math]] \phi(Y)=\int_{\R^m} h(x)\nu(Y,dx). [[/math]]

In the literature, one abusively note

[[math]] \E[h(X)\mid Y=y]=\int_{\R^m}h(x)\nu(y,dx), [[/math]]

and [math]\nu(y,dx)[/math] is called the [math]conditional[/math] [math]distribution[/math] of [math]X[/math] given [math]Y=y[/math] (even though in general we have [math]\p[Y=y]=0[/math]).

The Gaussian case

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]X,Y_1,...,Y_p\in L^2(\Omega,\F,\p)[/math]. We saw that [math]\E[X\mid Y_1,...,Y_p][/math] is the orthogonal projection of [math]X[/math] on [math]L^2(\Omega,\sigma(Y_1,...,Y_p),\p)[/math]. Since this conditional expectation is [math]\sigma(Y_1,...,Y_p)[/math]-measurable, it is of the form [math]\varphi(Y_1,...,Y_p)[/math]. In general, [math]L^2(\Omega,\sigma(Y_1,...,Y_p),\p)[/math] is of infinite dimension, so it is bad to obtain [math]\varphi[/math] explicitly. We also saw that [math]\varphi(Y_1,...,Y_p)[/math] is the best approximation of [math]X[/math] in the [math]L^2(\Omega,\sigma(Y_1,...,Y_p),\p)[/math] sense by an element of [math]L^2(\Omega,\sigma(Y_1,...,Y_p),\p)[/math]. Moreover, it is well known that the best [math]L^2[/math]-approximation of [math]X[/math] by an affine function of [math]\one,Y_1,...,Y_p[/math] is the best orthogonal projection of [math]X[/math] on the vector space [math]\{\one,Y_1,...,Y_p\}[/math], i.e.

[[math]] \E[(X-(\alpha_0+\alpha_1 Y_1+\dotsm +\alpha_p Y_p)^2)]. [[/math]]

In general, this is different from the orthogonal projection on [math]L^2(\Omega,\sigma(Y_1,...,Y_p),\p)[/math], except in the Gaussian case.

General references

Moshayedi, Nima (2020). "Lectures on Probability Theory". arXiv:2010.16280 [math.PR].