Moments of Random Variables

[math] \newcommand{\R}{\mathbb{R}} \newcommand{\A}{\mathcal{A}} \newcommand{\B}{\mathcal{B}} \newcommand{\N}{\mathbb{N}} \newcommand{\C}{\mathbb{C}} \newcommand{\Rbar}{\overline{\mathbb{R}}} \newcommand{\Bbar}{\overline{\mathcal{B}}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\E}{\mathbb{E}} \newcommand{\p}{\mathbb{P}} \newcommand{\one}{\mathds{1}} \newcommand{\0}{\mathcal{O}} \newcommand{\mat}{\textnormal{Mat}} \newcommand{\sign}{\textnormal{sign}} \newcommand{\CP}{\mathcal{P}} \newcommand{\CT}{\mathcal{T}} \newcommand{\CY}{\mathcal{Y}} \newcommand{\F}{\mathcal{F}} \newcommand{\mathds}{\mathbb}[/math]

Moments and Variance

Let [math](\Omega,\A,\p)[/math] be a probability space. Let [math]X[/math] be a r.v. and let [math]p\geq 1[/math] be an integer (or even a real number). The [math]p[/math]-th moment of [math]X[/math] is by definition [math]\E[X^p][/math], which is well defined when [math]X\geq 0[/math] or [math]\E[\vert X\vert^p] \lt \infty[/math], which is by definition

[[math]] \E[\vert X\vert^p]=\int_\Omega\vert X(\omega)\vert^pd\p(\omega) \lt \infty. [[/math]]

When [math]p=1[/math], we get the expected value. We say that [math]X[/math] is [math]centered[/math] if [math]\E[X]=0[/math]. The spaces [math]L^p(\Omega,\A,\p)[/math] for [math]p\in[1,\infty)[/math] are defined as we have seen in the course [math]Measure[/math] [math]and[/math] [math]Integral[/math]. From Hölder's inequality we can observe that

[[math]] \E[\vert XY\vert]\leq \E[\vert X\vert^p]^{\frac{1}{p}}\E[\vert Y\vert^q]^{\frac{1}{q}}, [[/math]]

whenever [math]p,q\geq 1[/math] and [math]\frac{1}{p}+\frac{1}{q}=1[/math]. If we take [math]Y=1[/math] above, we obtain

[[math]] \E[X]\leq \E[\vert X\vert^p]^{\frac{1}{p}}, [[/math]]

which means [math]\|X\|_1\leq \|X\|_p[/math]. This can be extended as [math]\|X\|_r\leq \|X\|_p[/math] if [math]r\leq p[/math]. So it follows that [math]L^p(\Omega,\A,\p)\subset L^r(\Omega,\A,\p)[/math]. For [math]p=q=2[/math] we get the Cauchy-Schwarz inequality as follows

[[math]] \E[\vert XY\vert]\leq \E[\vert X\vert^2]^{\frac{1}{2}}\E[\vert Y\vert^2]^{\frac{1}{2}}. [[/math]]

With [math]Y=1[/math] we have [math]\E[\vert X\vert]^2=\E[X^2][/math].

Definition (Variance)

Let [math](\Omega,\A,\p)[/math] be a probability space. Consider a r.v. [math]X\in L^2(\Omega,\A,\p)[/math]. The variance of [math]X[/math] is defined as

[[math]] Var(X)=\E[(X-\E[X])^2] [[/math]]
and the standard deviation of [math]X[/math] is given by

[[math]] \sigma_X=\sqrt{Var(X)}. [[/math]]

Informally, the variance represents the deviation of [math]X[/math] around its mean [math]\E[X][/math]. Note that [math]Var(X)=0[/math] if and only if [math]X[/math] is a.s. constant.

Proposition


[[math]] Var(X)=\E[X^2]-\E[X]^2\text{and for all $a\in\R$ we get}\E[(X-a)^2]=Var(X)+(\E[X]-a)^2. [[/math]]

Consequently, we get

[[math]] Var(X)=\inf_{a\in R}\E[(X-a)^2]. [[/math]]


Show Proof

[[math]] Var(X)=\E[(X-\E[X])^2]=\E[X^2-\{\E[X]X+\E[X]^2\}]=\E[X^2]-2\E[X]\E[X]+\E[X]^2=\E[X^2]-\E[X]^2 [[/math]]
Moreover, we have

[[math]] \E[(X-a)^2]=\E[(X-\E[X])+(\E[X]-a)^2]=Var(X)+(\E[X]-a)^2, [[/math]]
which implies that for all [math]a\in\R[/math]

[[math]] \E[(X-a)^2]\geq Var(X) [[/math]]
and there is equality when [math]a=\E[X][/math].

Remark

It follows that if [math]X[/math] is centered (i.e. [math]\E[X]=0[/math]), we get [math]Var(X)=\E[X^2][/math]. Moreover, the following two simple inequalities are very often used.

  • (Markov inequality) If [math]X\geq 0[/math] and [math]a\geq 0[/math] then
    [[math]] \boxed{\p[X \gt a]\leq \frac{1}{a}\E[X]}. [[/math]]
  • (Tchebisheff inequality)
    [[math]] \boxed{\p[\vert X-\E[X]\vert \gt a]\leq \frac{1}{a^2}Var(X)}. [[/math]]

Show Proof

We want to show both inequalities.

  • Note that
    [[math]] \p[X\geq a]=\E[\one_{\{X\geq a\}}]\leq \E\left[\frac{X}{a}\underbrace{\one_{\{X\geq a\}}}_{\leq 1}\right]\leq \E\left[\frac{X}{a}\right]. [[/math]]
  • This follows from (1) because [math]\vert X-\E[X]\vert[/math] is a positive r.v. and hence
    [[math]] \p[\vert X-\E[X]\vert\geq a]=\p[\vert X-\E[X]\vert^2\geq a^2]\leq \frac{1}{a^2}\underbrace{\E[\vert X-\E[X]\vert^2]}_{Var(X)}. [[/math]]
Definition (Covariance)

Let [math](\Omega,\A,\p)[/math] be a probability space. Consider two r.v.'s [math]X,Y\in L^2(\Omega,\A,\p)[/math]. The covariance of [math]X[/math] and [math]Y[/math] is defined as

[[math]] Cov(X,Y)=\E[(X-\E[X])(Y-\E[Y])]=\E[XY]-\E[X]\E[Y]. [[/math]]

If [math]X=(X_1,...,X_d)\in\R^d[/math] is a r.v. such that [math]\forall i\in\{1,...,d\}[/math], [math]X_i\in L^2(\Omega,\A,\p)[/math], then the covariance matrix of [math]X[/math] is defined as

[[math]] K_X=\left( Cov(X_i,X_j)\right)_{1\leq i,j\leq d}. [[/math]]

Informally speaking, the covariance between [math]X[/math] and [math]Y[/math] measures the correlation between [math]X[/math] and [math]Y[/math]. Note that [math]Cov(X,X)=Var(X)[/math] and from Cauchy-Schwarz we get

[[math]] \left\vert Cov(X,Y)\right\vert\leq \sqrt{Var(X)}\cdot\sqrt{Var(Y)}. [[/math]]

The application [math](X,Y)\mapsto Cov(X,Y)[/math] is a bilinear form on [math]L^2(\Omega,\A,\p)[/math]. We also note that [math]K_X[/math] is symmetric and positive, i.e. if [math]\lambda_1,...,\lambda_d\in\R[/math], [math]\lambda=(\lambda_1,...,\lambda_d)^T[/math], then

[[math]] \left\langle K_X\lambda,\lambda\right\rangle=\sum_{i,j=1}^d\lambda_i\lambda_jCov(X_i,X_j)\geq 0. [[/math]]

So we get

[[math]] \begin{align*} \sum_{i,j=1}^d\lambda_i\lambda_jCov(X_i,X_j)&=Var\left(\sum_{j=1}^d\lambda_jX_j\right)=\E\left[\left(\sum_{j=1}^d\lambda_jX_j-\E\left[\sum_{j=1}^d\lambda_j X_j\right]\right)^2\right]\\ &=\E\left[\left(\sum_{j=1}^d\lambda_jX_j-\sum_{j=1}^d\lambda_j\E[X_j]\right)^2\right]=\E\left[\left(\sum_{j=1}^d (X_j-\E[X_j])\right)^2\right]\\ &=\E\left[\sum_{j=1}\lambda_j(X_j-\E[X_j])\sum_{i=1}^d\lambda_i(X_i-\E[X_i])\right]\\ &=\sum_{i,j=1}^d\lambda_i\lambda_j\E[(X_i-\E[X_i])(X_j-\E[X_j])]\geq 0 \end{align*} [[/math]]

Linear Regression

Let [math](\Omega,\A,\p)[/math] be a probability space. Let [math]X,Y_1,...,Y_n[/math] be r.v.'s in [math]L^2(\Omega,\A,\p)[/math]. We want the best approximation of [math]X[/math] as an affine function of [math]Y_1,...,Y_n[/math]. More precisely we want to minimize

[[math]] \E[(X-\E[\beta_0+\beta_1Y_1+...+\beta_nY_n])^2] [[/math]]

over all possible choices of [math](\beta_0,\beta_1,...,\beta_n)[/math].

Proposition

Let [math](\Omega,\A,\p)[/math] be a probability space. Let [math]X,Y\in L^1(\Omega,\A,\p)[/math] be two r.v.'s. Then

[[math]] \inf_{(\beta_0,\beta_1,...,\beta_n)\in\R^{n+1}}\E[(X-(\beta_0+\beta_1Y_1+...+\beta_nY_n))^2]=\E[(X-Z)^2], [[/math]]
where [math]Z=\E[X]+\sum_{j=1}^n\alpha_j(Y_j-\E[Y_j])[/math] and the [math]\alpha_j[/math]'s are solutions to the system

[[math]] \sum_{j=1}^n\alpha_jCov(Y_j,Y_k)=Cov(X,Y_k)_{1\leq k\leq n}. [[/math]]
In particular if [math]K_Y[/math] is invertible, we have [math]\alpha=Cov(X,Y)K_Y^{-1}[/math], where

[[math]] Cov(X,Y)=\begin{pmatrix}Cov(X,Y_1)\\ \vdots \\ Cov(X,Y_n)\end{pmatrix} [[/math]]
.


Show Proof

Let [math]H[/math] be the linear subspace of [math]L^2(\Omega,\A,\p)[/math] spanned by [math]\{1,Y_1,...,Y_n\}[/math]. Then we know that the r.v. [math]Z[/math], which minimizes

[[math]] \|X-U\|_2=\E[(X-U)^2] [[/math]]
for [math]U\in H[/math], is the orthogonal projection of [math]X[/math] on [math]H[/math]. We can thus write

[[math]] Z=\alpha_0+\sum_{j=1}^n\alpha_j(Y_j-\E[Y_j]). [[/math]]
The orthogonality of [math]X-Z[/math] to [math]H[/math] can be written as [math]\E[(X-Z)\cdot 1]=0[/math]. Therefore [math]\E[X]=\E[Z][/math] and thus [math]\alpha_0=\E[X][/math]. Moreover, we get [math]\E[(X-Z)(Y_k-\E[Y_k])]=0[/math], which implies that for all [math]k\in\{1,...,n\}[/math] we get [math]\E[(X-\E[X]+\E[Z]-Z)]\cdot(Y_k-\E[Y_k])=0[/math].

When [math]n=1[/math], we have

[[math]] Z=\E[X]+\frac{Cov(X,Y)}{Var(Y)}(Y-\E[Y]). [[/math]]

General references

Moshayedi, Nima (2020). "Lectures on Probability Theory". arXiv:2010.16280 [math.PR].