The Conditional expectation

Conditional probability

Let [math](\Omega,\F,\p)[/math] be a probability space and let [math]A,B\in\F[/math] such that [math]\p[B] \gt 0[/math]. Then the conditional probability^[a] of [math]A[/math] given [math]B[/math] is defined as

[[math]] \p[A\mid B]=\frac{\p[A\cap B]}{\p[B]}. [[/math]]

The important fact here is that the application [math]\F\to [0,1][/math], [math]A\mapsto \p[A\mid B][/math] defines a new probability measure on [math]\F[/math] called the conditional probability given [math]B[/math]. There are several facts, which we need to recall:

If [math]A_1,...,A_n\in\F[/math] and if [math]\p\left[\bigcap_{k=1}^nA_k\right] \gt 0[/math], then
[[math]] \p\left[\bigcap_{k=1}^nA_k\right]=\prod_{j=1}^n\p\left[A_j\Big|\bigcap_{k=1}^{j-1}A_k\right]. [[/math]]
Let [math](E_n)_{n\geq 1}[/math] be a measurable partition of [math]\Omega[/math], i.e. for all [math]n\geq 1[/math] we have that [math]E_n\in\F[/math] and for [math]n\not=m[/math] we get [math]E_n\cap E_m=\varnothing[/math] and [math]\bigcup_{n\geq 1}E_n=\Omega[/math]. Now for [math]A\in \F[/math] we get
[[math]] \p[A]=\sum_{n\geq 1}\p[A\mid E_n]\p[E_n]. [[/math]]
(Baye's formula)^[b] Let [math](E_n)_{n\geq 1}[/math] be a measurable partition of [math]\Omega[/math] and [math]A\in\F[/math] with [math]\p[A] \gt 0[/math]. Then
[[math]] \p[E_n\mid A]=\frac{\p[A\mid E_n]\p[E_n]}{\sum_{m\geq 1}\p[A\mid E_m]\p[E_m]}. [[/math]]

We can reformulate the definition of the conditional probability to obtain

[[math]] \begin{align*} \p[A\mid B]\p[B]&=\p[A\cap B]\\ \p[B\mid A]\p[A]&=\p[A\cap B] \end{align*} [[/math]]

Therefore one can prove the statements (1) to (3) by using these two equations^[c].

Discrete construction of the conditional expectation

Let [math]X[/math] and [math]Y[/math] be two r.v.'s on a probability space [math](\Omega,\F,\p)[/math]. Let [math]Y[/math] take values in [math]\R[/math] and [math]X[/math] take values in a countable discrete set [math]\{x_1,x_2,...,x_n,...\}[/math]. The goal is to describe the expectation of the r.v. [math]Y[/math] by knowing the observed r.v. [math]X[/math]. For instance, let [math]X=x_j\in\{x_1,x_2,...,x_n,...\}[/math]. Therefore we look at a set [math]\{\omega\in\Omega\mid X(\omega)=x_j\}[/math] rather than looking at whole [math]\Omega[/math]. For [math]\Lambda\in\F[/math], we thus define

[[math]] \Q[\Lambda]=\p[\Lambda\mid \{X=x_j\}], [[/math]]

a new probability measure [math]\Q[/math], with [math]\p[X=x_j] \gt 0[/math]. Therefore it makes more sense to compute

[[math]] \E_\Q[Y]=\int_\Omega Y(\omega)d\Q(\omega)=\int_{\{\omega\in\Omega\mid X(\omega)=x_j\}}Y(\omega)d\p(\omega) [[/math]]

rather than

[[math]] \E_\p[Y]=\int_\Omega Y(\omega)d\p(\omega)=\int_\R yd\p_Y(y). [[/math]]

Definition (Conditional expectation ([math]X[/math] discrete, [math]Y[/math] real valued, single value case))

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]X:\Omega\to\{x_1,x_2,...,x_n,...\}[/math] be a r.v. taking values in a discrete set and let [math]Y[/math] be a real valued r.v. on that space. If [math]\p[X=x_j] \gt 0[/math], we can define the conditional expectation of [math]Y[/math] given [math]\{X=x_j\}[/math] to be

[[math]] \E[Y\mid X=x_j]=\E_\Q[Y], [[/math]]

where [math]\Q[/math] is the probability measure on [math]\F[/math] defined by

[[math]] \Q[\Lambda]=\p[\Lambda\mid X=x_j], [[/math]]

for [math]\Lambda\in\F[/math], provided that [math]\E_\Q[\vert Y\vert] \lt \infty[/math].

Theorem (Conditional expectation ([math]X[/math] discrete, [math]Y[/math] discrete, single value case))

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]X[/math] be a r.v. on that space with values in [math]\{x_1,x_2,...,x_n,...\}[/math] and let [math]Y[/math] also be a r.v. with values in [math]\{y_1,y_2,...,y_n,...\}[/math]. If [math]\p[X=x_j] \gt 0[/math], we can write the conditional expectation of [math]Y[/math] given [math]\{X=x_j\}[/math] as

[[math]] \E[Y\mid X=x_j]=\sum_{k=1}^\infty y_k\p[Y=y_k\mid X=x_j]. [[/math]]

provided that the series is absolutely convergent.

Show Proof

Apply the definitions above to obtain

[[math]] \E[Y\mid X=x_j]=\E_\Q[Y]=\sum_{k=1}^\infty y_k\Q[Y=y_k]=\sum_{k=1}^\infty y_k\p[Y=y_k\mid X=x_j] [[/math]]

■

Now let again [math]X[/math] be a r.v. with values in [math]\{x_1,x_2,...,x_n,...\}[/math] and [math]Y[/math] a real valued r.v. The next step is to define [math]\E[Y\mid X][/math] as a function [math]f(X)[/math]. Therefore we introduce the function

[[math]] \begin{equation} f:\{x_1,x_2,...,x_n,...\}\to \Rf(x)=\begin{cases}\E[Y\mid X=x],&\p[X=x] \gt 0\\ \text{any value in $\R$},&\p[X=x]=0\end{cases} \end{equation} [[/math]]

It doesn't matter which value we assign to [math]f[/math] for [math]\p[X=x]=0[/math], since it doesn't affect the expectation because it's defined on a null set. For convention we want to assign to it the value 0.

Definition (Conditional expectation ([math]X[/math] discrete, [math]Y[/math] real valued, complete case))

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]X[/math] be a countably valued r.v. and let [math]Y[/math] be a real valued r.v. The conditional expectation of [math]Y[/math] given [math]X[/math] is defined by

[[math]] \E[Y\mid X]=f(X), [[/math]]

with [math]f[/math] as in (2), provided that for all [math]j[/math]: if [math]\Q_j[\Lambda]=\p[\Lambda\mid X=x_j][/math], with [math]\p[X=x_j] \gt 0[/math], we get [math]\E_{\Q_j}[\vert Y\vert] \lt \infty[/math].

The above definition does not define [math]\E[Y\mid X][/math] everywhere but rather almost everywhere, since on each set [math]\{X=x\}[/math], where [math]\p[X=x]=0[/math], its value is arbitrary.

Example

Let^[d] [math]X\sim\Pi(\lambda)[/math]. Let us consider a tossing game, where we say that when [math]X=n[/math], we do [math]n[/math] independent tossing of a coin where each time one obtains 1 with probability [math]p\in[0,1][/math] and 0 with probability [math]1-p[/math]. Define also [math]S[/math] to be the r.v. giving the total number of 1 obtained in the game. Therefore, if [math]X=n[/math] is given, we get that [math]S[/math] is binomial distributed with parameters [math](p,n)[/math]. We want to compute

[math]\E[S\mid X][/math]
[math]\E[X\mid S][/math]

It is more natural to ask for the expectation of the amount of 1 obtained for the whole game by knowing how many games were played. The reverse is a bit more difficult. Logically, we may also notice that it definitely doesn't make sense to say [math]S\geq X[/math], because we can not obtain more wins in a game than the amount of games that were played.

First we compute [math]\E[S\mid X=n][/math]: If [math]X=n[/math], we know that [math]S[/math] is binomial distributed with parameters [math](p,n)[/math] ([math]S\sim \B(p,n)[/math]) and therefore we already know^[e]
[[math]] \E[S\mid X=n]=pn. [[/math]]
Now we need to identify the function [math]f[/math] defined as in (2) by
[[math]] \begin{align*} f:\N&\longrightarrow\R\\ n&\longmapsto pn. \end{align*} [[/math]]
Therefore we get by definition
[[math]] \E[S\mid X]=pX. [[/math]]
Next we want to compute [math]\E[X\mid S=k][/math]: For [math]n\geq k[/math] we have
[[math]] \p[X=n\mid S=k]=\frac{\p[S=k\mid X=n]\p[X=n]}{\p[S=k]}=\frac{\binom{n}{k} p^k(1-p)^{n-k}e^{-\lambda}\frac{\lambda^n}{n!}}{\sum_{m=k}^\infty\binom{m}{k}p^k(1-p)^{m-k}e^{-\lambda}\frac{\lambda^m}{m!}}, [[/math]]
since [math]\{S=k\}=\bigsqcup_{m\geq k}\{S=k,X=m\}[/math]. By some algebra we obtain that
[[math]] \frac{\binom{n}{k}p^k(1-p)^{n-k}e^{-\lambda}\frac{\lambda^n}{n!}}{\sum_{m=k}^\infty\binom{m}{k}p^k(1-p)^{m-k}e^{-\lambda}\frac{\lambda^m}{m!}}=\frac{(\lambda(1-p))^{n-k}e^{-\lambda(1-p)}}{(n-k)!} [[/math]]
Hence we get that
[[math]] \E[X\mid S=k]=\sum_{n\geq k}n\p[X=n\mid S=k]=k+\lambda(1-p). [[/math]]
Therefore [math]\E[X\mid S]=S+\lambda(1-p)[/math].

Continuous construction of the conditional expectation

Now we want to define [math]\E[Y\mid X][/math], where [math]X[/math] is no longer assumed to be countably valued. Therefore we want to recall the following two facts:

Definition ([math]\sigma[/math]-Algebra generated by a random variable)

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]X:(\Omega,\F,\p)\to (\R^n,\B(\R^n),\lambda)[/math] be a r.v. on that space. The [math]\sigma[/math]-Algebra generated by [math]X[/math] is given by

[[math]] \sigma(X)=X^{-1}(\B(\R^n))=\{A\in\Omega\mid A=X^{-1}(B),B\in\B(\R^n)\}. [[/math]]

Theorem

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]X:(\Omega,\F,\p)\to(\R^n,\B(\R^n),\lambda)[/math] be a r.v. on that space and let [math]Y[/math] be a real valued r.v. on that space. [math]Y[/math] is measurable with respect to [math]\sigma(X)[/math] if and only if there exists a Borel measurable function [math]f:\R^n\to\R[/math] such that

[[math]] Y=f(X). [[/math]]

We want to make use of the fact that for the Hilbert space [math]L^2(\Omega,\F,\p)[/math] we get that [math]L^2(\Omega,\sigma(X),\p)\subset L^2(\Omega,\F,\p)[/math] is a complete subspace, since [math]\sigma(X)\subset\F[/math]. This allows us to use the orthogonal projections and to interpret the conditional expectation as such a projection.

Definition (Conditional expectation (as a projection onto a closed subspace))

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]Y\in L^2(\Omega,\F,\p)[/math]. Then the conditional expectation of [math]Y[/math] given [math]X[/math] is the unique element [math]\hat Y\in L^2(\Omega,\sigma(X),\p)[/math] such that for all [math]Z\in L^2(\Omega,\sigma(X),\p)[/math]

[[math]] \begin{equation} \E[YZ]=\E[\hat Y Z]. \end{equation} [[/math]]

This result is due to the fact that if [math]Y-\hat Y\in L^2(\Omega,\F,\p)[/math] then for all [math]Z\in L^2(\Omega,\sigma(X),\p)[/math] we get [math]\langle Y-\hat Y,Z\rangle=0[/math]. We write [math]\E[Y\mid X][/math] for [math]\hat Y[/math].

[math]\hat Y[/math] is the orthogonal projection of [math]Y[/math] onto [math]L^2(\Omega,\sigma(X),\p)[/math].

Since [math]X[/math] takes values in [math]\R^n[/math], there exists a Borel measurable function [math]f:\R^n\to\R[/math] such that

[[math]] \E[Y\mid X]=f(X) [[/math]]

with [math]\E[f^2(X)] \lt \infty[/math]. We can also rewrite (3) as: for all Borel measurable [math]g:\R^n\to\R[/math], such that [math]\E[g^2(X)] \lt \infty[/math], we get

[[math]] \E[Yg(X)]=\E[f(X)g(X)]. [[/math]]

Now let [math]\mathcal{G}\subset\F[/math] be a sub [math]\sigma[/math]-Algebra of [math]\F[/math] and consider the space [math]L^2(\Omega,\mathcal{G},\p)\subset L^2(\Omega,\F,\p)[/math]. It is clear that [math]L^2(\Omega,\mathcal{G},\p)[/math] is a Hilbert space and thus we can project to it.

Definition (Conditional expectation (projection case))

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]Y\in L^2(\Omega,\F,\p)[/math] and let [math]\mathcal{G}\subset\F[/math] be a sub [math]\sigma[/math]-Algebra of [math]\F[/math]. Then the conditional expectation of [math]Y[/math] given [math]\mathcal{G}[/math] is defined as the unique element [math]\E[Y\mid \mathcal{G}]\in L^2(\Omega,\mathcal{G},\p)[/math] such that for all [math]Z\in L^2(\Omega,\mathcal{G},\p)[/math]

[[math]] \begin{equation} \label{4} \E[YZ]=\E[\E[Y\mid \mathcal{G}]Z]. \end{equation} [[/math]]

In (3) or (1), it is enough^[f] to restrict the test r.v. [math]Z[/math] to the class of r.v.'s of the form

[[math]] Z=\one_A,A\in\mathcal{G}. [[/math]]

The conditional expectation is in [math]L^2[/math], so it's only defined a.s. and not everywhere in a unique way. So in particular, any statement like [math]\E[Y\mid\mathcal{G}]\geq0[/math] or [math]\E[Y\mid \mathcal{G}]=Z[/math] has to be understood with an implicit a.s.

Theorem

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]Y\in L^2(\Omega,\F,\p)[/math] and let [math]\mathcal{G}\subset \F[/math] be a sub [math]\sigma[/math]-Algebra of [math]\F[/math].

If [math]Y\geq 0[/math], then [math]\E[Y\mid \mathcal{G}]\geq 0[/math]
[math]\E[\E[Y\mid\mathcal{G}]]=\E[Y][/math]
The map [math]Y\mapsto\E[Y\mid\mathcal{G}][/math] is linear.

Show Proof

For [math](i)[/math] take [math]Z=\one_{\{\E[Y\mid\mathcal{G}] \lt 0\}}[/math] to obtain

[[math]] \underbrace{\E[YZ]}_{\geq 0}=\underbrace{\E[\E[Y\mid \mathcal{G}]Z]}_{\leq 0}. [[/math]]

This implies that [math]\p[\E[Y\mid \mathcal{G}] \lt 0]=0[/math]. For [math](ii)[/math] take [math]Z=\one_{\Omega}[/math] and plug into (4). For [math](iii)[/math] notice that linearity comes from the orthogonal projection operator. But we can also do it directly by taking [math]Y,Y'\in L^2(\Omega,\F,\p)[/math], [math]\alpha,\beta\in \R[/math] and [math]Z\in L^2(\Omega,\mathcal{G},\p)[/math] to obtain

[[math]] \E[(\alpha Y+\beta Y')Z]=\E[YZ]+\beta\E[Y'Z]=\alpha\E[\E[Y\mid\mathcal{G}]Z]+\beta\E[\E[Y'\mid\mathcal{G}]Z]=\E[(\alpha\E[Y\mid\mathcal{G}]+\beta\E[Y'\mid\mathcal{G}])Z]. [[/math]]

Now we can conclude by using the uniqueness property that

[[math]] \E[\alpha Y+\beta Y'\mid \mathcal{G}]=\alpha\E[Y\mid \mathcal{G}]+\beta\E[Y'\mid \mathcal{G}]. [[/math]]

■

Now we want to extend the definition of the conditional expectation to r.v.'s in [math]L^1(\Omega,\F,\p)[/math] or to [math]L^+(\Omega,\F,\p)[/math], which is the space of non negative r.v.'s allowing the value [math]\infty[/math].

Lemma

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]Y\in L^+(\Omega,\F,\p)[/math] and let [math]\mathcal{G}\subset \F[/math] be a sub [math]\sigma[/math]-Algebra of [math]\F[/math]. Then there exists a unique element [math]\E[Y\mid \mathcal{G}]\in L^+(\Omega,\mathcal{G},\p)[/math] such that for all [math]X\in L^+(\Omega,\mathcal{G},\p)[/math]

[[math]] \begin{equation} \E[YX]=\E[\E[Y\mid \mathcal{G}]X] \end{equation} [[/math]]

and this conditional expectation agrees with the previous definition when [math]Y\in L^2(\Omega,\F,\p)[/math]. Moreover, if [math]0\leq Y\leq Y'[/math], then

[[math]] \E[Y\mid \mathcal{G}]\leq \E[Y'\mid \mathcal{G}]. [[/math]]

Show Proof

If [math]Y\leq 0[/math] and [math]Y\in L^2(\Omega,\F,\p)[/math], then we define [math]\E[Y\mid\mathcal{G}][/math] as before. If [math]X\in L^+(\Omega,\mathcal{G},\p)[/math], we get that [math]X_n=X\land n[/math], is in [math]L^2(\Omega,\mathcal{G},\p)[/math] and is positive with [math]X_n\uparrow X[/math] for [math]n\to\infty[/math]. Using the monotone convergence theorem we get

[[math]] \E[YX]=\E[Y\lim_{n\to\infty}X_n]=\lim_{n\to\infty}\E[YX_n]=\lim_{n\to\infty}\E[\E[Y\mid\mathcal{G}]X_n]=\E[\E[Y\mid\mathcal{G}]\lim_{n\to\infty}X]=\E[\E[Y\mid\mathcal{G}]X]. [[/math]]

This shows that (5) is true whenever [math]Y\in L^2(\Omega,\F,\p)[/math] with [math]Y\geq 0[/math] and [math]X\in L^+(\Omega,\mathcal{G},\p)[/math]. Now let [math]Y\in L^1(\Omega,\F,\p)[/math]. Define [math]Y_m=Y\land m[/math]. Hence we get [math]Y_m\in L^2(\Omega,\F,\p)[/math] and [math]Y_m\uparrow Y[/math] as [math]n\to\infty[/math]. Each [math]\E[Y_m\mid\mathcal{G}][/math] is well defined^[g] and positive and increasing. We define

[[math]] \E[Y\mid\mathcal{G}]=\lim_{n\to\infty}\E[Y_m\mid \mathcal{G}]. [[/math]]

Several applications of the monotone convergence theorem will give us for [math]X\in L^+(\Omega,\mathcal{G},\p)[/math]

[[math]] \E[YX]=\lim_{m\to\infty}\E[Y_mX]=\lim_{m\to\infty}\E[\E[Y_m\mid\mathcal{G}]X]=\E[\E[Y\mid \mathcal{G}]X]. [[/math]]

Furthermore if [math]0\leq Y\leq Y'[/math], then [math]Y\land m\leq Y'\land m[/math] and therefore

[[math]] \E[Y\mid\mathcal{G}]\leq \E[Y'\mid\mathcal{G}]. [[/math]]

Now we need to show uniqueness^[h]. Let [math]U[/math] and [math]V[/math] be two versions of [math]\E[Y\mid \mathcal{G}][/math]. Let

[[math]] \Lambda_n=\{U \lt V\leq n\}\in\mathcal{G} [[/math]]

and assume [math]\p[\Lambda_n] \gt 0[/math]. We then have

[[math]] \E[Y\one_{\Lambda_n}]=\underbrace{\E[U\one_{\Lambda_n}]=\E[V\one_{\Lambda_n}]}_{\E[(U-V)\one_{\Lambda_n}]=0}. [[/math]]

This contradicts the fact that [math]\p[\Lambda_n] \gt 0[/math]. Moreover, [math]\{U \lt V\}=\bigcup_{n\geq 1}\Lambda_n[/math] and therefore

[[math]] \p[U \lt V]=0 [[/math]]

and similarly [math]\p[V \lt U]=0[/math]. This implies

[[math]] \p[U=V]=1. [[/math]]

■

Theorem

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]Y\in L^1(\Omega,\F,\p)[/math] and let [math]\mathcal{G}\subset\F[/math] be a sub [math]\sigma[/math]-Algebra of [math]\F[/math]. Then there exists a unique element [math]\E[Y\mid \mathcal{G}]\in L^1(\Omega,\mathcal{G},\p)[/math] such that for every [math]X[/math] bounded and [math]\mathcal{G}[/math]-measurable

[[math]] \begin{equation} \E[YX]=\E[\E[Y\mid \mathcal{G}]X]. \end{equation} [[/math]]

This conditional expectation agrees with the definition for the [math]L^2[/math]. Moreover it satisfies:

If [math]Y\geq 0[/math], then [math]\E[Y\mid\mathcal{G}]\geq 0[/math]
The map [math]Y\mapsto \E[Y\mid\mathcal{G}][/math] is linear.

Show Proof

We will only prove the existence, since the rest is exactly the same as before. Write [math]Y=Y^+-Y^-[/math] with [math]Y^+,Y^-\in L^1(\Omega,\F,\p)[/math] and [math]Y^+,Y^-\geq 0[/math]. So [math]\E[Y^+\mid\mathcal{G}][/math] and [math]\E[Y^-\mid\mathcal{G}][/math] are well defined. Now we set

[[math]] \E[Y\mid\mathcal{G}]=\E[Y^+\mid \mathcal{G}]-\E[Y^-\mid\mathcal{G}]. [[/math]]

This is well defined because

[[math]] \E[\E[Y^\pm\mid\mathcal{G}]]=\E[Y^\pm] \lt \infty [[/math]]

if we let [math]X=\one_\Omega[/math] in the previous lemma and therefore [math]\E[Y^+\mid\mathcal{G}][/math] and [math]\E[Y^-\mid \mathcal{G}]\in L^1(\Omega,\mathcal{G},\p)[/math]. For all [math]X[/math] bounded and [math]\mathcal{G}[/math]-measurable we can also write [math]X=X^+-X^-[/math] and it follows from the previous lemma that

[[math]] \E[\E[Y^\pm\mid\mathcal{G}]X]=\E[Y^\pm X]. [[/math]]

This implies that [math]\E[Y\mid\mathcal{G}][/math] satisfies (6).

■

Corollary

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]X\in L^1(\Omega,\F,\p)[/math] be a r.v. on that space. Then

[[math]] \E[\E[X\mid\mathcal{G}]]=\E[X]. [[/math]]

Show Proof

Take equation (4) and set [math]Z=\one_\Omega[/math].

■

Corollary

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]X\in L^1(\Omega,\F,\p)[/math] be a r.v. on that space. Then

[[math]] \vert\E[X\mid\mathcal{G}]\vert\leq \E[\vert X\vert\mid\mathcal{G}]. [[/math]]

In particular

[[math]] \E[\vert\E[X\mid\mathcal{G}]\vert]\leq \E[\vert X\vert]. [[/math]]

Show Proof

We can always write [math]X=X^+-X^-[/math] and also [math]\vert X\vert=X^++X^-[/math]. Therefore we get

[[math]] \vert\E[X\mid\mathcal{G}]\vert=\vert\E[X^+\mid\mathcal{G}]-\E[X^-\mid\mathcal{G}]\vert\leq \E[X^+\mid\mathcal{G}]+\E[X^-\mid\mathcal{G}]=\E[X^++X^-\mid\mathcal{G}]=\E[\vert X\vert\mid\mathcal{G}]. [[/math]]

■

Proposition

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]Y\in L^1(\Omega,\F,\p)[/math] be a r.v. on that space and assume that [math]Y[/math] is independent of the sub [math]\sigma[/math]-Algebra [math]\mathcal{G}\subset\F[/math], i.e. [math]\sigma(Y)[/math] is independent of [math]\mathcal{G}[/math]. Then

[[math]] \E[Y\mid\mathcal{G}]=\E[Y]. [[/math]]

Show Proof

Let [math]Z[/math] be a bounded and [math]\mathcal{G}[/math]-measurable r.v. and therefore [math]Y[/math] and [math]Z[/math] are independent. Hence we get

[[math]] \E[YZ]=\E[Y]\E[Z]=\E[\E[Y]Z]. [[/math]]

This implies that, since [math]\E[Y][/math] is constant, that [math]\E[Y]\in L^1(\Omega,\mathcal{G},\p)[/math] and satisfies (4). Therefore by uniqueness we get that [math]\E[Y\mid\mathcal{G}]=\E[Y][/math].

■

Theorem

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]X[/math] and [math]Y[/math] be two r.v.'s on that space and let [math]\mathcal{G}\subset\F[/math] be a sub [math]\sigma[/math]-Algebra of [math]\F[/math]. Assume further that at least one of these two holds:

[math]X,Y[/math] and [math]XY[/math] are in [math]L^1(\Omega,\F,\p)[/math] with [math]X[/math] being [math]\mathcal{G}[/math]-measurable.
[math]X\geq 0[/math], [math]Y\geq 0[/math] with [math]X[/math] being [math]\mathcal{G}[/math]-mearuable.

Then

[[math]] \E[XY\mid\mathcal{G}]=\E[Y\mid\mathcal{G}]X. [[/math]]

In particular, if [math]X[/math] is a positive r.v. or in [math]L^1(\Omega,\mathcal{G},\p)[/math] and [math]\mathcal{G}[/math]-measurable, then

[[math]] \E[X\mid\mathcal{G}]=X. [[/math]]

Show Proof

For [math](ii)[/math] assume first that [math]X,Y\leq 0[/math]. Let [math]Z[/math] be a positive and [math]\mathcal{G}[/math]-measurable r.v. Then we can obtain

[[math]] \E[(XY)Z]=\E[Y(XZ)]=\E[\E[Y\mid\mathcal{G}]XZ]=\E[(\E[Y\mid\mathcal{G}]X)Z]. [[/math]]

Note that [math]\E[(\E[Y\mid\mathcal{G}]X)Z][/math] is a positive r.v. and [math]\mathcal{G}[/math]-measurable. Hence [math]\E[XY\mid\mathcal{G}]=X\E[Y\mid\mathcal{G}][/math]. For [math](i)[/math] we can write [math]X=X^++X^-[/math] and use [math](ii)[/math]. This is an easy exercise.

■

Next we want to show that the classical limit theorems from measure theory also make sense in terms of the conditional expectation^[i].

Theorem (Limit theorems for the conditional expectation)

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math](Y_n)_{n\geq 1}[/math] be a sequence of r.v.'s on that space and let [math]\mathcal{G}\subset\F[/math] be a sub [math]\sigma[/math]-Algebra of [math]\F[/math]. Then we have:

(Monotone convergence) Assume that [math](Y_n)_{n\geq 1}[/math] is a sequence of positive r.v.'s for all [math]n[/math] such that [math]\lim_{n\to\infty}\uparrow Y_n=Y[/math] a.s. Then
[[math]] \lim_{n\to\infty}\E[Y_n\mid\mathcal{G}]=\E[Y\mid \mathcal{G}]. [[/math]]
(Fatou) Assume that [math](Y_n)_{n\geq 1}[/math] is a sequence of positive r.v.'s for all [math]n[/math]. Then
[[math]] \E[\liminf_n Y_n\mid\mathcal{G}]=\liminf_n\E[Y_n\mid\mathcal{G}]. [[/math]]
(Dominated convergence) Assume that [math]Y_n\xrightarrow{n\to\infty}Y[/math] a.s. and that there exists [math]Z\in L^1(\Omega,\F,\p)[/math] such that [math]\vert Y_n\vert\leq Z[/math] for all [math]n[/math]. Then
[[math]] \lim_{n\to\infty}\E[Y_n\mid \mathcal{G}]=\E[Y\mid\mathcal{G}]. [[/math]]

Show Proof

We will only prove [math](i)[/math], since [math](ii)[/math] and [math](iii)[/math] are proved in a similar way (it's a good exercise to do the proof). Since [math](Y_n)_{n\geq 1}[/math] is an increasing sequence, it follows that

[[math]] \E[Y_{n+1}\mid\mathcal{G}]\geq \E[Y_n\mid\mathcal{G}]. [[/math]]

Hence we can deduce that [math]\lim_{n\to\infty}\uparrow \E[Y_n\mid\mathcal{G}][/math] exists and we denote it by [math]Y'[/math]. Moreover, note that [math]Y'[/math] is [math]\mathcal{G}[/math]-measurable, since it is a limit of [math]\mathcal{G}[/math]-measurable r.v.'s. Let [math]X[/math] be a positive and [math]\mathcal{G}[/math]-measurable r.v. and obtain then

[[math]] \E[Y'X]=\E[\lim_{n\to\infty}\E[Y_n\mid\mathcal{G}]X]=\lim_{n\to\infty}\uparrow\E[\E[Y_n\mid\mathcal{G}]X]=\lim_{n\to\infty}\E[Y_n X]=\E[YX], [[/math]]

where we have used monotone convergence twice and equation (4). Therefore we get

[[math]] \lim_{n\to\infty}\E[Y_n\mid\mathcal{G}]=\E[Y\mid\mathcal{G}]. [[/math]]

■

Theorem (Jensen's inequality)

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]\varphi:\R\to\R[/math] be a real, convex function. Let [math]X \in L^1(\Omega,\F,\p)[/math] such that [math]\varphi(X)\in L^1(\Omega,\F,\p)[/math]. Then

[[math]] \varphi(\E[X\mid\mathcal{G}])\leq \E[\varphi(X)\mid\mathcal{G}] [[/math]]

for all sub [math]\sigma[/math]-Algebras [math]\mathcal{G}\subset\F[/math].

Show Proof

Exercise.

■

Example

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]\varphi(X)=X^2[/math] and let [math]X\in L^2(\Omega,\F,\p)[/math]. Then

[[math]] (\E[X\mid \mathcal{G}])^2\leq \E[X^2\mid\mathcal{G}] [[/math]]

for all sub [math]\sigma[/math]-Algebras [math]\mathcal{G}\subset \F[/math].

Theorem (Tower property)

Let [math](\Omega,\F,\p)[/math] be a probability space. Let [math]X\in L^1(\Omega,\F,\p)[/math] be a positive r.v. on that space. Let [math]\mathcal{C}\subset\mathcal{G}\subset \F[/math] be a tower of sub [math]\sigma[/math]-Algebras of [math]\F[/math]. Then

[[math]] \E[\E[X\mid\mathcal{G}]\mathcal{C}]=\E[X\mid\mathcal{C}]. [[/math]]

Show Proof

Let [math]Z[/math] be a bounded and [math]\mathcal{C}[/math]-measurable r.v. Then we obtain

[[math]] \E[XZ]=\E[\E[X\mid\mathcal{C}]Z]. [[/math]]

But [math]Z[/math] is also [math]\mathcal{G}[/math]-measurable and hence we get

[[math]] \E[XZ]=\E[\E[X\mid\mathcal{G}]Z]. [[/math]]

Therefore, for all [math]Z[/math] bounded and [math]\mathcal{C}[/math]-measurable r.v.'s, we get

[[math]] \E[\E[X\mid\mathcal{G}]Z]=\E[\E[X\mid\mathcal{C}]Z] [[/math]]

and thus

[[math]] \E[\E[X\mid\mathcal{G}]\mathcal{C}]=\E[X\mid\mathcal{C}]. [[/math]]

■

General references

Moshayedi, Nima (2020). "Lectures on Probability Theory". arXiv:2010.16280 [math.PR].

Notes

One can look it up for more details in the stochastics I part.
Use the previous facts for the proof of Baye's formula. One can also look it up in the stochastics I part.
One also has to notice that if [math]A[/math] and [math]B[/math] are two independent events, then [math]\p[A\mid B]=\frac{\p[A\cap B]}{\p[B]}=\frac{\p[A]\p[B]}{\p[B]}=\p[A][/math]
Recall that this means that [math]X[/math] is Poisson distributed: [math]\p[X=k]=e^{-\lambda}\frac{\lambda^k}{k!}[/math] for [math]k\in\N[/math]
If [math]X\sim \B(p,n)[/math] then [math]\E[X]=pn[/math]. For further calculation, one can look it up in the stochastics I notes
Since we can always consider linear combinations of [math]\one_A[/math] and then apply density theorems to it
because for [math]Y\in L^2[/math] and [math]U\in L^2[/math] we get [math]Y\geq U\Longrightarrow Y-U\geq 0\Longrightarrow \E[Y\mid\mathcal{G}]\geq \E[U\mid\mathcal{G}][/math]
Note that for any [math]W\in L^+[/math], the set [math]E[/math] on which [math]W=\infty[/math] is a null set. For suppose not, then [math]\E[W]\geq \E[\infty \one_E]=\infty\p[E][/math]. But since [math]\p[E] \gt 0[/math] this cannot happen
Recall the classical limit theorems for integrals: [math]Monotone[/math] [math]convergence:[/math] Let [math](f_n)_{n\geq 1}[/math] be an increasing sequence of positive and measurable functions and let [math]f=\lim_{n\to\infty}\uparrow f_n[/math]. Then [math]\int fd\mu=\lim_{n\to\infty}f_nd\mu[/math]. [math]Fatou:[/math] Let [math](f_n)_{n\geq 1}[/math] be a sequence of measurable and positive functions. Then [math]\int\liminf_n f_n d\mu\leq \liminf_n \int f_nd\mu[/math]. [math]Dominated[/math] [math]convergence:[/math] Let [math](f_n)_{n\geq 1}[/math] be a sequence of integrable functions with [math]\vert f_n\vert\leq g[/math] for all [math]n[/math] with [math]g[/math] integrable. Denote [math]f=\lim_{n\to\infty}f_n[/math]. Then [math]\lim_{n\to\infty}\int f_nd\mu=\int fd\mu[/math]

[1] One can look it up for more details in the stochastics I part.

[2] Use the previous facts for the proof of Baye's formula. One can also look it up in the stochastics I part.

[3] One also has to notice that if [math]A[/math] and [math]B[/math] are two independent events, then [math]\p[A\mid B]=\frac{\p[A\cap B]}{\p[B]}=\frac{\p[A]\p[B]}{\p[B]}=\p[A][/math]

[4] Recall that this means that [math]X[/math] is Poisson distributed: [math]\p[X=k]=e^{-\lambda}\frac{\lambda^k}{k!}[/math] for [math]k\in\N[/math]

[5] If [math]X\sim \B(p,n)[/math] then [math]\E[X]=pn[/math]. For further calculation, one can look it up in the stochastics I notes

[6] Since we can always consider linear combinations of [math]\one_A[/math] and then apply density theorems to it

[7] use for [math]Y\in L^2[/math] and [math]U\in L^2[/math] we get [math]Y\geq U\Longrightarrow Y-U\geq 0\Longrightarrow \E[Y\mid\mathcal{G}]\geq \E[U\mid\mathcal{G}][/math]

[8] Note that for any [math]W\in L^+[/math], the set [math]E[/math] on which [math]W=\infty[/math] is a null set. For suppose not, then [math]\E[W]\geq \E[\infty \one_E]=\infty\p[E][/math]. But since [math]\p[E] \gt 0[/math] this cannot happen

[9] Recall the classical limit theorems for integrals: [math]Monotone[/math] [math]convergence:[/math] Let [math](f_n)_{n\geq 1}[/math] be an increasing sequence of positive and measurable functions and let [math]f=\lim_{n\to\infty}\uparrow f_n[/math]. Then [math]\int fd\mu=\lim_{n\to\infty}f_nd\mu[/math]. [math]Fatou:[/math] Let [math](f_n)_{n\geq 1}[/math] be a sequence of measurable and positive functions. Then [math]\int\liminf_n f_n d\mu\leq \liminf_n \int f_nd\mu[/math]. [math]Dominated[/math] [math]convergence:[/math] Let [math](f_n)_{n\geq 1}[/math] be a sequence of integrable functions with [math]\vert f_n\vert\leq g[/math] for all [math]n[/math] with [math]g[/math] integrable. Denote [math]f=\lim_{n\to\infty}f_n[/math]. Then [math]\lim_{n\to\infty}\int f_nd\mu=\int fd\mu[/math]

[a]

[b]

[c]

[d]

[e]

[f]

[g]

[h]

[i]