guide:B33a8d8a51: Difference between revisions

Revision as of 01:53, 8 May 2024

Moments and Variance

Let [math](\Omega,\A,\p)[/math] be a probability space. Let [math]X[/math] be a r.v. and let [math]p\geq 1[/math] be an integer (or even a real number). The [math]p[/math]-th moment of [math]X[/math] is by definition [math]\E[X^p][/math], which is well defined when [math]X\geq 0[/math] or [math]\E[\vert X\vert^p] \lt \infty[/math], which is by definition

[[math]] \E[\vert X\vert^p]=\int_\Omega\vert X(\omega)\vert^pd\p(\omega) \lt \infty. [[/math]]

When [math]p=1[/math], we get the expected value. We say that [math]X[/math] is [math]centered[/math] if [math]\E[X]=0[/math]. The spaces [math]L^p(\Omega,\A,\p)[/math] for [math]p\in[1,\infty)[/math] are defined as we have seen in the course [math]Measure[/math] [math]and[/math] [math]Integral[/math]. From Hölder's inequality we can observe that

[[math]] \E[\vert XY\vert]\leq \E[\vert X\vert^p]^{\frac{1}{p}}\E[\vert Y\vert^q]^{\frac{1}{q}}, [[/math]]

whenever [math]p,q\geq 1[/math] and [math]\frac{1}{p}+\frac{1}{q}=1[/math]. If we take [math]Y=1[/math] above, we obtain

[[math]] \E[X]\leq \E[\vert X\vert^p]^{\frac{1}{p}}, [[/math]]

which means [math]\|X\|_1\leq \|X\|_p[/math]. This can be extended as [math]\|X\|_r\leq \|X\|_p[/math] if [math]r\leq p[/math]. So it follows that [math]L^p(\Omega,\A,\p)\subset L^r(\Omega,\A,\p)[/math]. For [math]p=q=2[/math] we get the Cauchy-Schwarz inequality as follows

[[math]] \E[\vert XY\vert]\leq \E[\vert X\vert^2]^{\frac{1}{2}}\E[\vert Y\vert^2]^{\frac{1}{2}}. [[/math]]

With [math]Y=1[/math] we have [math]\E[\vert X\vert]^2=\E[X^2][/math].

Definition (Variance)

Let [math](\Omega,\A,\p)[/math] be a probability space. Consider a r.v. [math]X\in L^2(\Omega,\A,\p)[/math]. The variance of [math]X[/math] is defined as

[[math]] Var(X)=\E[(X-\E[X])^2] [[/math]]

and the standard deviation of [math]X[/math] is given by

[[math]] \sigma_X=\sqrt{Var(X)}. [[/math]]

Informally, the variance represents the deviation of [math]X[/math] around its mean [math]\E[X][/math]. Note that [math]Var(X)=0[/math] if and only if [math]X[/math] is a.s. constant.

Proposition

[[math]] Var(X)=\E[X^2]-\E[X]^2\text{and for all $a\in\R$ we get}\E[(X-a)^2]=Var(X)+(\E[X]-a)^2. [[/math]]

Consequently, we get

[[math]] Var(X)=\inf_{a\in R}\E[(X-a)^2]. [[/math]]

Show Proof

[[math]] Var(X)=\E[(X-\E[X])^2]=\E[X^2-\{\E[X]X+\E[X]^2\}]=\E[X^2]-2\E[X]\E[X]+\E[X]^2=\E[X^2]-\E[X]^2 [[/math]]

Moreover, we have

[[math]] \E[(X-a)^2]=\E[(X-\E[X])+(\E[X]-a)^2]=Var(X)+(\E[X]-a)^2, [[/math]]

which implies that for all [math]a\in\R[/math]

[[math]] \E[(X-a)^2]\geq Var(X) [[/math]]

and there is equality when [math]a=\E[X][/math].

■

\label{random3} It follows that if [math]X[/math] is centered (i.e. [math]\E[X]=0[/math]), we get [math]Var(X)=\E[X^2][/math]. Moreover, the following two simple inequalities are very often used.

(Markov inequality) If [math]X\geq 0[/math] and [math]a\geq 0[/math] then
[[math]] \boxed{\p[X \gt a]\leq \frac{1}{a}\E[X]}. [[/math]]
(Tchebisheff inequality)
[[math]] \boxed{\p[\vert X-\E[X]\vert \gt a]\leq \frac{1}{a^2}Var(X)}. [[/math]]

\begin{proof}[Proof of Remark] We want to show both inequalities.

Note that
[[math]] \p[X\geq a]=\E[\one_{\{X\geq a\}}]\leq \E\left[\frac{X}{a}\underbrace{\one_{\{X\geq a\}}}_{\leq 1}\right]\leq \E\left[\frac{X}{a}\right]. [[/math]]
This follows from (1) because [math]\vert X-\E[X]\vert[/math] is a positive r.v. and hence
[[math]] \p[\vert X-\E[X]\vert\geq a]=\p[\vert X-\E[X]\vert^2\geq a^2]\leq \frac{1}{a^2}\underbrace{\E[\vert X-\E[X]\vert^2]}_{Var(X)}. [[/math]]

\end{proof}

Definition (Covariance)

Let [math](\Omega,\A,\p)[/math] be a probability space. Consider two r.v.'s [math]X,Y\in L^2(\Omega,\A,\p)[/math]. The covariance of [math]X[/math] and [math]Y[/math] is defined as

[[math]] Cov(X,Y)=\E[(X-\E[X])(Y-\E[Y])]=\E[XY]-\E[X]\E[Y]. [[/math]]

If [math]X=(X_1,...,X_d)\in\R^d[/math] is a r.v. such that [math]\forall i\in\{1,...,d\}[/math], [math]X_i\in L^2(\Omega,\A,\p)[/math], then the covariance matrix of [math]X[/math] is defined as

[[math]] K_X=\left( Cov(X_i,X_j)\right)_{1\leq i,j\leq d}. [[/math]]

Informally speaking, the covariance between [math]X[/math] and [math]Y[/math] measures the correlation between [math]X[/math] and [math]Y[/math]. Note that [math]Cov(X,X)=Var(X)[/math] and from Cauchy-Schwarz we get

[[math]] \left\vert Cov(X,Y)\right\vert\leq \sqrt{Var(X)}\cdot\sqrt{Var(Y)}. [[/math]]

The application [math](X,Y)\mapsto Cov(X,Y)[/math] is a bilinear form on [math]L^2(\Omega,\A,\p)[/math]. We also note that [math]K_X[/math] is symmetric and positive, i.e. if [math]\lambda_1,...,\lambda_d\in\R[/math], [math]\lambda=(\lambda_1,...,\lambda_d)^T[/math], then

[[math]] \left\langle K_X\lambda,\lambda\right\rangle=\sum_{i,j=1}^d\lambda_i\lambda_jCov(X_i,X_j)\geq 0. [[/math]]

So we get

[[math]] \begin{align*} \sum_{i,j=1}^d\lambda_i\lambda_jCov(X_i,X_j)&=Var\left(\sum_{j=1}^d\lambda_jX_j\right)=\E\left[\left(\sum_{j=1}^d\lambda_jX_j-\E\left[\sum_{j=1}^d\lambda_j X_j\right]\right)^2\right]\\ &=\E\left[\left(\sum_{j=1}^d\lambda_jX_j-\sum_{j=1}^d\lambda_j\E[X_j]\right)^2\right]=\E\left[\left(\sum_{j=1}^d (X_j-\E[X_j])\right)^2\right]\\ &=\E\left[\sum_{j=1}\lambda_j(X_j-\E[X_j])\sum_{i=1}^d\lambda_i(X_i-\E[X_i])\right]\\ &=\sum_{i,j=1}^d\lambda_i\lambda_j\E[(X_i-\E[X_i])(X_j-\E[X_j])]\geq 0 \end{align*} [[/math]]

\begin{exer} If [math]A[/math] is a matrix of size [math]n\times d[/math], and [math]Y=AX[/math], then prove that

[[math]] K_Y=AK_XA^T. [[/math]]

\end{exer}

Set [math]X=(X_1,...,X_d)^T[/math] and [math]XX^T=(X_iX_j)_{1\leq i,j\leq n}[/math], then informally

[[math]] K_X=\E[XX^T]=(\E[X_iX_j])_{1\leq i,j\leq n}, [[/math]]

and for [math]Y=AX[/math] we get

[[math]] K_Y=\E[AXX^TA^T]=A\E[XX^T]A^T=AK_XA^T. [[/math]]

Linear Regression

Let [math](\Omega,\A,\p)[/math] be a probability space. Let [math]X,Y_1,...,Y_n[/math] be r.v.'s in [math]L^2(\Omega,\A,\p)[/math]. We want the best approximation of [math]X[/math] as an affine function of [math]Y_1,...,Y_n[/math]. More precisely we want to minimize

[[math]] \E[(X-\E[\beta_0+\beta_1Y_1+...+\beta_nY_n])^2] [[/math]]

over all possible choices of [math](\beta_0,\beta_1,...,\beta_n)[/math].

Proposition

Let [math](\Omega,\A,\p)[/math] be a probability space. Let [math]X,Y\in L^1(\Omega,\A,\p)[/math] be two r.v.'s. Then

[[math]] \inf_{(\beta_0,\beta_1,...,\beta_n)\in\R^{n+1}}\E[(X-(\beta_0+\beta_1Y_1+...+\beta_nY_n))^2]=\E[(X-Z)^2], [[/math]]

where [math]Z=\E[X]+\sum_{j=1}^n\alpha_j(Y_j-\E[Y_j])[/math] and the [math]\alpha_j[/math]'s are solutions to the system

[[math]] \sum_{j=1}^n\alpha_jCov(Y_j,Y_k)=Cov(X,Y_k)_{1\leq k\leq n}. [[/math]]

In particular if [math]K_Y[/math] is invertible, we have [math]\alpha=Cov(X,Y)K_Y^{-1}[/math], where

[[math]] Cov(X,Y)=\begin{pmatrix}Cov(X,Y_1)\\ \vdots \\ Cov(X,Y_n)\end{pmatrix} [[/math]]

.

Show Proof

Let [math]H[/math] be the linear subspace of [math]L^2(\Omega,\A,\p)[/math] spanned by [math]\{1,Y_1,...,Y_n\}[/math]. Then we know that the r.v. [math]Z[/math], which minimizes

[[math]] \|X-U\|_2=\E[(X-U)^2] [[/math]]

for [math]U\in H[/math], is the orthogonal projection of [math]X[/math] on [math]H[/math]. We can thus write

[[math]] Z=\alpha_0+\sum_{j=1}^n\alpha_j(Y_j-\E[Y_j]). [[/math]]

The orthogonality of [math]X-Z[/math] to [math]H[/math] can be written as [math]\E[(X-Z)\cdot 1]=0[/math]. Therefore [math]\E[X]=\E[Z][/math] and thus [math]\alpha_0=\E[X][/math]. Moreover, we get [math]\E[(X-Z)(Y_k-\E[Y_k])]=0[/math], which implies that for all [math]k\in\{1,...,n\}[/math] we get [math]\E[(X-\E[X]+\E[Z]-Z)]\cdot(Y_k-\E[Y_k])=0[/math].

■

When [math]n=1[/math], we have

[[math]] Z=\E[X]+\frac{Cov(X,Y)}{Var(Y)}(Y-\E[Y]). [[/math]]

General references

Moshayedi, Nima (2020). "Lectures on Probability Theory". arXiv:2010.16280 [math.PR].

@@ Line 1: / Line 1: @@
+<div class="d-none"><math>
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\A}{\mathcal{A}}
+\newcommand{\B}{\mathcal{B}}
+\newcommand{\N}{\mathbb{N}}
+\newcommand{\C}{\mathbb{C}}
+\newcommand{\Rbar}{\overline{\mathbb{R}}}
+\newcommand{\Bbar}{\overline{\mathcal{B}}}
+\newcommand{\Q}{\mathbb{Q}}
+\newcommand{\E}{\mathbb{E}}
+\newcommand{\p}{\mathbb{P}}
+\newcommand{\one}{\mathds{1}}
+\newcommand{\0}{\mathcal{O}}
+\newcommand{\mat}{\textnormal{Mat}}
+\newcommand{\sign}{\textnormal{sign}}
+\newcommand{\CP}{\mathcal{P}}
+\newcommand{\CT}{\mathcal{T}}
+\newcommand{\CY}{\mathcal{Y}}
+\newcommand{\F}{\mathcal{F}}
+\newcommand{\mathds}{\mathbb}</math></div>
+===Moments and Variance===
+Let <math>(\Omega,\A,\p)</math> be a probability space. Let <math>X</math> be a r.v. and let <math>p\geq 1</math> be an integer (or even a real number). The <math>p</math>-th moment of <math>X</math> is by definition <math>\E[X^p]</math>, which is well defined when <math>X\geq 0</math> or <math>\E[\vert X\vert^p] < \infty</math>, which is by definition
+<math display="block">
+\E[\vert X\vert^p]=\int_\Omega\vert X(\omega)\vert^pd\p(\omega) < \infty.
+</math>
+When <math>p=1</math>, we get the expected value. We say that <math>X</math> is <math>centered</math> if <math>\E[X]=0</math>. The spaces <math>L^p(\Omega,\A,\p)</math> for <math>p\in[1,\infty)</math> are defined as we have seen in the course <math>Measure</math> <math>and</math> <math>Integral</math>. From Hölder's inequality we can observe that
+<math display="block">
+\E[\vert XY\vert]\leq \E[\vert X\vert^p]^{\frac{1}{p}}\E[\vert Y\vert^q]^{\frac{1}{q}},
+</math>
+whenever <math>p,q\geq 1</math> and <math>\frac{1}{p}+\frac{1}{q}=1</math>. If we take <math>Y=1</math> above, we obtain
+<math display="block">
+\E[X]\leq \E[\vert X\vert^p]^{\frac{1}{p}},
+</math>
+which means <math>\|X\|_1\leq \|X\|_p</math>. This can be extended as <math>\|X\|_r\leq \|X\|_p</math> if <math>r\leq p</math>. So it follows that <math>L^p(\Omega,\A,\p)\subset L^r(\Omega,\A,\p)</math>. For <math>p=q=2</math> we get the Cauchy-Schwarz inequality as follows
+<math display="block">
+\E[\vert XY\vert]\leq \E[\vert X\vert^2]^{\frac{1}{2}}\E[\vert Y\vert^2]^{\frac{1}{2}}.
+</math>
+With <math>Y=1</math> we have <math>\E[\vert X\vert]^2=\E[X^2]</math>.
+{{definitioncard|Variance|Let <math>(\Omega,\A,\p)</math> be a probability space. Consider a r.v. <math>X\in L^2(\Omega,\A,\p)</math>. The variance of <math>X</math> is defined as
+<math display="block">
+Var(X)=\E[(X-\E[X])^2]
+</math>
+and the standard deviation of <math>X</math> is given by
+<math display="block">
+\sigma_X=\sqrt{Var(X)}.
+</math>
+}}
+{{alert-info |
+Informally, the variance represents the deviation of <math>X</math> around its mean <math>\E[X]</math>. Note that <math>Var(X)=0</math> if and only if <math>X</math> is a.s. constant.
+}}
+{{proofcard|Proposition|prop-1|
+<math display="block">
+Var(X)=\E[X^2]-\E[X]^2\text{and for all $a\in\R$ we get}\E[(X-a)^2]=Var(X)+(\E[X]-a)^2.
+</math>
+Consequently, we get
+<math display="block">
+Var(X)=\inf_{a\in R}\E[(X-a)^2].
+</math>
+|
+<math display="block">
+Var(X)=\E[(X-\E[X])^2]=\E[X^2-\{\E[X]X+\E[X]^2\}]=\E[X^2]-2\E[X]\E[X]+\E[X]^2=\E[X^2]-\E[X]^2
+</math>
+Moreover, we have
+<math display="block">
+\E[(X-a)^2]=\E[(X-\E[X])+(\E[X]-a)^2]=Var(X)+(\E[X]-a)^2,
+</math>
+which implies that for all <math>a\in\R</math>
+<math display="block">
+\E[(X-a)^2]\geq Var(X)
+</math>
+and there is equality when <math>a=\E[X]</math>.}}
+{{alert-info |
+\label{random3}
+It follows that if <math>X</math> is centered (i.e. <math>\E[X]=0</math>), we get <math>Var(X)=\E[X^2]</math>. Moreover, the following two simple inequalities are very often used.
+<ul style{{=}}"list-style-type:lower-roman"><li>(''Markov inequality'') If <math>X\geq 0</math> and <math>a\geq 0</math> then
+<math display="block">
+\boxed{\p[X > a]\leq \frac{1}{a}\E[X]}.
+</math>
+</li>
+<li>(''Tchebisheff inequality'')
+<math display="block">
+\boxed{\p[\vert X-\E[X]\vert > a]\leq \frac{1}{a^2}Var(X)}.
+</math>
+</li>
+</ul>
+}}
+\begin{proof}[Proof of [[#random3 |Remark]]] We want to show both inequalities.
+<ul style{{=}}"list-style-type:lower-roman"><li>Note that
+<math display="block">
+\p[X\geq a]=\E[\one_{\{X\geq a\}}]\leq \E\left[\frac{X}{a}\underbrace{\one_{\{X\geq a\}}}_{\leq 1}\right]\leq \E\left[\frac{X}{a}\right].
+</math>
+</li>
+<li>This follows from (1) because <math>\vert X-\E[X]\vert</math> is a positive r.v. and hence
+<math display="block">
+\p[\vert X-\E[X]\vert\geq a]=\p[\vert X-\E[X]\vert^2\geq a^2]\leq \frac{1}{a^2}\underbrace{\E[\vert X-\E[X]\vert^2]}_{Var(X)}.
+</math>
+</li>
+</ul>
+\end{proof}
+{{definitioncard|Covariance|Let <math>(\Omega,\A,\p)</math> be a probability space. Consider two r.v.'s <math>X,Y\in L^2(\Omega,\A,\p)</math>. The covariance of <math>X</math> and <math>Y</math> is defined as
+<math display="block">
+Cov(X,Y)=\E[(X-\E[X])(Y-\E[Y])]=\E[XY]-\E[X]\E[Y].
+</math>
+}}
+If <math>X=(X_1,...,X_d)\in\R^d</math> is a r.v. such that <math>\forall i\in\{1,...,d\}</math>, <math>X_i\in L^2(\Omega,\A,\p)</math>, then the covariance matrix of <math>X</math> is defined as
+<math display="block">
+K_X=\left( Cov(X_i,X_j)\right)_{1\leq i,j\leq d}.
+</math>
+Informally speaking, the covariance between <math>X</math> and <math>Y</math> measures the correlation between <math>X</math> and <math>Y</math>. Note that <math>Cov(X,X)=Var(X)</math> and from Cauchy-Schwarz we get
+<math display="block">
+\left\vert Cov(X,Y)\right\vert\leq \sqrt{Var(X)}\cdot\sqrt{Var(Y)}.
+</math>
+The application <math>(X,Y)\mapsto Cov(X,Y)</math> is a bilinear form on <math>L^2(\Omega,\A,\p)</math>. We also note that <math>K_X</math> is symmetric and positive, i.e. if <math>\lambda_1,...,\lambda_d\in\R</math>, <math>\lambda=(\lambda_1,...,\lambda_d)^T</math>, then
+<math display="block">
+\left\langle K_X\lambda,\lambda\right\rangle=\sum_{i,j=1}^d\lambda_i\lambda_jCov(X_i,X_j)\geq 0.
+</math>
+So we get
+<math display="block">
+\begin{align*}
+\sum_{i,j=1}^d\lambda_i\lambda_jCov(X_i,X_j)&=Var\left(\sum_{j=1}^d\lambda_jX_j\right)=\E\left[\left(\sum_{j=1}^d\lambda_jX_j-\E\left[\sum_{j=1}^d\lambda_j X_j\right]\right)^2\right]\\
+&=\E\left[\left(\sum_{j=1}^d\lambda_jX_j-\sum_{j=1}^d\lambda_j\E[X_j]\right)^2\right]=\E\left[\left(\sum_{j=1}^d (X_j-\E[X_j])\right)^2\right]\\
+&=\E\left[\sum_{j=1}\lambda_j(X_j-\E[X_j])\sum_{i=1}^d\lambda_i(X_i-\E[X_i])\right]\\
+&=\sum_{i,j=1}^d\lambda_i\lambda_j\E[(X_i-\E[X_i])(X_j-\E[X_j])]\geq 0
+\end{align*}
+</math>
+\begin{exer} If <math>A</math> is a matrix of size <math>n\times d</math>, and <math>Y=AX</math>, then prove that
+<math display="block">
+K_Y=AK_XA^T.
+</math>
+\end{exer}
+{{alert-info |
+Set <math>X=(X_1,...,X_d)^T</math> and <math>XX^T=(X_iX_j)_{1\leq i,j\leq n}</math>, then informally
+<math display="block">
+K_X=\E[XX^T]=(\E[X_iX_j])_{1\leq i,j\leq n},
+</math>
+and for <math>Y=AX</math> we get
+<math display="block">
+K_Y=\E[AXX^TA^T]=A\E[XX^T]A^T=AK_XA^T.
+</math>
+}}
+===Linear Regression===
+Let <math>(\Omega,\A,\p)</math> be a probability space. Let <math>X,Y_1,...,Y_n</math> be r.v.'s in <math>L^2(\Omega,\A,\p)</math>. We want the best approximation of <math>X</math> as an affine function of <math>Y_1,...,Y_n</math>. More precisely we want to minimize
+<math display="block">
+\E[(X-\E[\beta_0+\beta_1Y_1+...+\beta_nY_n])^2]
+</math>
+over all possible choices of <math>(\beta_0,\beta_1,...,\beta_n)</math>.
+{{proofcard|Proposition|prop-2|Let <math>(\Omega,\A,\p)</math> be a probability space. Let <math>X,Y\in L^1(\Omega,\A,\p)</math> be two r.v.'s. Then
+<math display="block">
+\inf_{(\beta_0,\beta_1,...,\beta_n)\in\R^{n+1}}\E[(X-(\beta_0+\beta_1Y_1+...+\beta_nY_n))^2]=\E[(X-Z)^2],
+</math>
+where <math>Z=\E[X]+\sum_{j=1}^n\alpha_j(Y_j-\E[Y_j])</math> and the <math>\alpha_j</math>'s are solutions to the system
+<math display="block">
+\sum_{j=1}^n\alpha_jCov(Y_j,Y_k)=Cov(X,Y_k)_{1\leq k\leq n}.
+</math>
+In particular if <math>K_Y</math> is invertible, we have <math>\alpha=Cov(X,Y)K_Y^{-1}</math>, where
+<math display="block">
+Cov(X,Y)=\begin{pmatrix}Cov(X,Y_1)\\ \vdots \\ Cov(X,Y_n)\end{pmatrix}
+</math>
+.
+|Let <math>H</math> be the linear subspace of <math>L^2(\Omega,\A,\p)</math> spanned by <math>\{1,Y_1,...,Y_n\}</math>. Then we know that the r.v. <math>Z</math>, which minimizes
+<math display="block">
+\|X-U\|_2=\E[(X-U)^2]
+</math>
+for <math>U\in H</math>, is the orthogonal projection of <math>X</math> on <math>H</math>. We can thus write
+<math display="block">
+Z=\alpha_0+\sum_{j=1}^n\alpha_j(Y_j-\E[Y_j]).
+</math>
+The orthogonality of <math>X-Z</math> to <math>H</math> can be written as <math>\E[(X-Z)\cdot 1]=0</math>. Therefore <math>\E[X]=\E[Z]</math> and thus <math>\alpha_0=\E[X]</math>. Moreover, we get <math>\E[(X-Z)(Y_k-\E[Y_k])]=0</math>, which implies that for all <math>k\in\{1,...,n\}</math> we get <math>\E[(X-\E[X]+\E[Z]-Z)]\cdot(Y_k-\E[Y_k])=0</math>.}}
+{{alert-info |
+When <math>n=1</math>, we have
+<math display="block">
+Z=\E[X]+\frac{Cov(X,Y)}{Var(Y)}(Y-\E[Y]).
+</math>
+}}
+==General references==
+{{cite arXiv|last=Moshayedi|first=Nima|year=2020|title=Lectures on Probability Theory|eprint=2010.16280|class=math.PR}}