guide:B33a8d8a51: Difference between revisions
No edit summary |
mNo edit summary |
||
(One intermediate revision by one other user not shown) | |||
Line 1: | Line 1: | ||
<div class="d-none"><math> | |||
\newcommand{\R}{\mathbb{R}} | |||
\newcommand{\A}{\mathcal{A}} | |||
\newcommand{\B}{\mathcal{B}} | |||
\newcommand{\N}{\mathbb{N}} | |||
\newcommand{\C}{\mathbb{C}} | |||
\newcommand{\Rbar}{\overline{\mathbb{R}}} | |||
\newcommand{\Bbar}{\overline{\mathcal{B}}} | |||
\newcommand{\Q}{\mathbb{Q}} | |||
\newcommand{\E}{\mathbb{E}} | |||
\newcommand{\p}{\mathbb{P}} | |||
\newcommand{\one}{\mathds{1}} | |||
\newcommand{\0}{\mathcal{O}} | |||
\newcommand{\mat}{\textnormal{Mat}} | |||
\newcommand{\sign}{\textnormal{sign}} | |||
\newcommand{\CP}{\mathcal{P}} | |||
\newcommand{\CT}{\mathcal{T}} | |||
\newcommand{\CY}{\mathcal{Y}} | |||
\newcommand{\F}{\mathcal{F}} | |||
\newcommand{\mathds}{\mathbb}</math></div> | |||
===Moments and Variance=== | |||
Let <math>(\Omega,\A,\p)</math> be a probability space. Let <math>X</math> be a r.v. and let <math>p\geq 1</math> be an integer (or even a real number). The <math>p</math>-th moment of <math>X</math> is by definition <math>\E[X^p]</math>, which is well defined when <math>X\geq 0</math> or <math>\E[\vert X\vert^p] < \infty</math>, which is by definition | |||
<math display="block"> | |||
\E[\vert X\vert^p]=\int_\Omega\vert X(\omega)\vert^pd\p(\omega) < \infty. | |||
</math> | |||
When <math>p=1</math>, we get the expected value. We say that <math>X</math> is <math>centered</math> if <math>\E[X]=0</math>. The spaces <math>L^p(\Omega,\A,\p)</math> for <math>p\in[1,\infty)</math> are defined as we have seen in the course <math>Measure</math> <math>and</math> <math>Integral</math>. From Hölder's inequality we can observe that | |||
<math display="block"> | |||
\E[\vert XY\vert]\leq \E[\vert X\vert^p]^{\frac{1}{p}}\E[\vert Y\vert^q]^{\frac{1}{q}}, | |||
</math> | |||
whenever <math>p,q\geq 1</math> and <math>\frac{1}{p}+\frac{1}{q}=1</math>. If we take <math>Y=1</math> above, we obtain | |||
<math display="block"> | |||
\E[X]\leq \E[\vert X\vert^p]^{\frac{1}{p}}, | |||
</math> | |||
which means <math>\|X\|_1\leq \|X\|_p</math>. This can be extended as <math>\|X\|_r\leq \|X\|_p</math> if <math>r\leq p</math>. So it follows that <math>L^p(\Omega,\A,\p)\subset L^r(\Omega,\A,\p)</math>. For <math>p=q=2</math> we get the Cauchy-Schwarz inequality as follows | |||
<math display="block"> | |||
\E[\vert XY\vert]\leq \E[\vert X\vert^2]^{\frac{1}{2}}\E[\vert Y\vert^2]^{\frac{1}{2}}. | |||
</math> | |||
With <math>Y=1</math> we have <math>\E[\vert X\vert]^2=\E[X^2]</math>. | |||
{{definitioncard|Variance|Let <math>(\Omega,\A,\p)</math> be a probability space. Consider a r.v. <math>X\in L^2(\Omega,\A,\p)</math>. The variance of <math>X</math> is defined as | |||
<math display="block"> | |||
Var(X)=\E[(X-\E[X])^2] | |||
</math> | |||
and the standard deviation of <math>X</math> is given by | |||
<math display="block"> | |||
\sigma_X=\sqrt{Var(X)}. | |||
</math> | |||
}} | |||
{{alert-info | | |||
Informally, the variance represents the deviation of <math>X</math> around its mean <math>\E[X]</math>. Note that <math>Var(X)=0</math> if and only if <math>X</math> is a.s. constant. | |||
}} | |||
{{proofcard|Proposition|prop-1| | |||
<math display="block"> | |||
Var(X)=\E[X^2]-\E[X]^2\text{and for all $a\in\R$ we get}\E[(X-a)^2]=Var(X)+(\E[X]-a)^2. | |||
</math> | |||
Consequently, we get | |||
<math display="block"> | |||
Var(X)=\inf_{a\in R}\E[(X-a)^2]. | |||
</math> | |||
| | |||
<math display="block"> | |||
Var(X)=\E[(X-\E[X])^2]=\E[X^2-\{\E[X]X+\E[X]^2\}]=\E[X^2]-2\E[X]\E[X]+\E[X]^2=\E[X^2]-\E[X]^2 | |||
</math> | |||
Moreover, we have | |||
<math display="block"> | |||
\E[(X-a)^2]=\E[(X-\E[X])+(\E[X]-a)^2]=Var(X)+(\E[X]-a)^2, | |||
</math> | |||
which implies that for all <math>a\in\R</math> | |||
<math display="block"> | |||
\E[(X-a)^2]\geq Var(X) | |||
</math> | |||
and there is equality when <math>a=\E[X]</math>.}} | |||
{{proofcard|Remark|random3|It follows that if <math>X</math> is centered (i.e. <math>\E[X]=0</math>), we get <math>Var(X)=\E[X^2]</math>. Moreover, the following two simple inequalities are very often used. <ul style{{=}}"list-style-type:lower-roman"><li>(''Markov inequality'') If <math>X\geq 0</math> and <math>a\geq 0</math> then | |||
<math display="block"> | |||
\boxed{\p[X > a]\leq \frac{1}{a}\E[X]}. | |||
</math> | |||
</li> | |||
<li>(''Tchebisheff inequality'') | |||
<math display="block"> | |||
\boxed{\p[\vert X-\E[X]\vert > a]\leq \frac{1}{a^2}Var(X)}. | |||
</math> | |||
</li> | |||
</ul>|We want to show both inequalities. | |||
<ul style{{=}}"list-style-type:lower-roman"><li>Note that | |||
<math display="block"> | |||
\p[X\geq a]=\E[\one_{\{X\geq a\}}]\leq \E\left[\frac{X}{a}\underbrace{\one_{\{X\geq a\}}}_{\leq 1}\right]\leq \E\left[\frac{X}{a}\right]. | |||
</math> | |||
</li> | |||
<li>This follows from (1) because <math>\vert X-\E[X]\vert</math> is a positive r.v. and hence | |||
<math display="block"> | |||
\p[\vert X-\E[X]\vert\geq a]=\p[\vert X-\E[X]\vert^2\geq a^2]\leq \frac{1}{a^2}\underbrace{\E[\vert X-\E[X]\vert^2]}_{Var(X)}. | |||
</math> | |||
</li> | |||
</ul> | |||
}} | |||
{{definitioncard|Covariance|Let <math>(\Omega,\A,\p)</math> be a probability space. Consider two r.v.'s <math>X,Y\in L^2(\Omega,\A,\p)</math>. The covariance of <math>X</math> and <math>Y</math> is defined as | |||
<math display="block"> | |||
Cov(X,Y)=\E[(X-\E[X])(Y-\E[Y])]=\E[XY]-\E[X]\E[Y]. | |||
</math> | |||
}} | |||
If <math>X=(X_1,...,X_d)\in\R^d</math> is a r.v. such that <math>\forall i\in\{1,...,d\}</math>, <math>X_i\in L^2(\Omega,\A,\p)</math>, then the covariance matrix of <math>X</math> is defined as | |||
<math display="block"> | |||
K_X=\left( Cov(X_i,X_j)\right)_{1\leq i,j\leq d}. | |||
</math> | |||
Informally speaking, the covariance between <math>X</math> and <math>Y</math> measures the correlation between <math>X</math> and <math>Y</math>. Note that <math>Cov(X,X)=Var(X)</math> and from Cauchy-Schwarz we get | |||
<math display="block"> | |||
\left\vert Cov(X,Y)\right\vert\leq \sqrt{Var(X)}\cdot\sqrt{Var(Y)}. | |||
</math> | |||
The application <math>(X,Y)\mapsto Cov(X,Y)</math> is a bilinear form on <math>L^2(\Omega,\A,\p)</math>. We also note that <math>K_X</math> is symmetric and positive, i.e. if <math>\lambda_1,...,\lambda_d\in\R</math>, <math>\lambda=(\lambda_1,...,\lambda_d)^T</math>, then | |||
<math display="block"> | |||
\left\langle K_X\lambda,\lambda\right\rangle=\sum_{i,j=1}^d\lambda_i\lambda_jCov(X_i,X_j)\geq 0. | |||
</math> | |||
So we get | |||
<math display="block"> | |||
\begin{align*} | |||
\sum_{i,j=1}^d\lambda_i\lambda_jCov(X_i,X_j)&=Var\left(\sum_{j=1}^d\lambda_jX_j\right)=\E\left[\left(\sum_{j=1}^d\lambda_jX_j-\E\left[\sum_{j=1}^d\lambda_j X_j\right]\right)^2\right]\\ | |||
&=\E\left[\left(\sum_{j=1}^d\lambda_jX_j-\sum_{j=1}^d\lambda_j\E[X_j]\right)^2\right]=\E\left[\left(\sum_{j=1}^d (X_j-\E[X_j])\right)^2\right]\\ | |||
&=\E\left[\sum_{j=1}\lambda_j(X_j-\E[X_j])\sum_{i=1}^d\lambda_i(X_i-\E[X_i])\right]\\ | |||
&=\sum_{i,j=1}^d\lambda_i\lambda_j\E[(X_i-\E[X_i])(X_j-\E[X_j])]\geq 0 | |||
\end{align*} | |||
</math> | |||
===Linear Regression=== | |||
Let <math>(\Omega,\A,\p)</math> be a probability space. Let <math>X,Y_1,...,Y_n</math> be r.v.'s in <math>L^2(\Omega,\A,\p)</math>. We want the best approximation of <math>X</math> as an affine function of <math>Y_1,...,Y_n</math>. More precisely we want to minimize | |||
<math display="block"> | |||
\E[(X-\E[\beta_0+\beta_1Y_1+...+\beta_nY_n])^2] | |||
</math> | |||
over all possible choices of <math>(\beta_0,\beta_1,...,\beta_n)</math>. | |||
{{proofcard|Proposition|prop-2|Let <math>(\Omega,\A,\p)</math> be a probability space. Let <math>X,Y\in L^1(\Omega,\A,\p)</math> be two r.v.'s. Then | |||
<math display="block"> | |||
\inf_{(\beta_0,\beta_1,...,\beta_n)\in\R^{n+1}}\E[(X-(\beta_0+\beta_1Y_1+...+\beta_nY_n))^2]=\E[(X-Z)^2], | |||
</math> | |||
where <math>Z=\E[X]+\sum_{j=1}^n\alpha_j(Y_j-\E[Y_j])</math> and the <math>\alpha_j</math>'s are solutions to the system | |||
<math display="block"> | |||
\sum_{j=1}^n\alpha_jCov(Y_j,Y_k)=Cov(X,Y_k)_{1\leq k\leq n}. | |||
</math> | |||
In particular if <math>K_Y</math> is invertible, we have <math>\alpha=Cov(X,Y)K_Y^{-1}</math>, where | |||
<math display="block"> | |||
Cov(X,Y)=\begin{pmatrix}Cov(X,Y_1)\\ \vdots \\ Cov(X,Y_n)\end{pmatrix} | |||
</math> | |||
. | |||
|Let <math>H</math> be the linear subspace of <math>L^2(\Omega,\A,\p)</math> spanned by <math>\{1,Y_1,...,Y_n\}</math>. Then we know that the r.v. <math>Z</math>, which minimizes | |||
<math display="block"> | |||
\|X-U\|_2=\E[(X-U)^2] | |||
</math> | |||
for <math>U\in H</math>, is the orthogonal projection of <math>X</math> on <math>H</math>. We can thus write | |||
<math display="block"> | |||
Z=\alpha_0+\sum_{j=1}^n\alpha_j(Y_j-\E[Y_j]). | |||
</math> | |||
The orthogonality of <math>X-Z</math> to <math>H</math> can be written as <math>\E[(X-Z)\cdot 1]=0</math>. Therefore <math>\E[X]=\E[Z]</math> and thus <math>\alpha_0=\E[X]</math>. Moreover, we get <math>\E[(X-Z)(Y_k-\E[Y_k])]=0</math>, which implies that for all <math>k\in\{1,...,n\}</math> we get <math>\E[(X-\E[X]+\E[Z]-Z)]\cdot(Y_k-\E[Y_k])=0</math>.}} | |||
{{alert-info | | |||
When <math>n=1</math>, we have | |||
<math display="block"> | |||
Z=\E[X]+\frac{Cov(X,Y)}{Var(Y)}(Y-\E[Y]). | |||
</math> | |||
}} | |||
==General references== | |||
{{cite arXiv|last=Moshayedi|first=Nima|year=2020|title=Lectures on Probability Theory|eprint=2010.16280|class=math.PR}} |
Latest revision as of 12:48, 8 May 2024
Moments and Variance
Let [math](\Omega,\A,\p)[/math] be a probability space. Let [math]X[/math] be a r.v. and let [math]p\geq 1[/math] be an integer (or even a real number). The [math]p[/math]-th moment of [math]X[/math] is by definition [math]\E[X^p][/math], which is well defined when [math]X\geq 0[/math] or [math]\E[\vert X\vert^p] \lt \infty[/math], which is by definition
When [math]p=1[/math], we get the expected value. We say that [math]X[/math] is [math]centered[/math] if [math]\E[X]=0[/math]. The spaces [math]L^p(\Omega,\A,\p)[/math] for [math]p\in[1,\infty)[/math] are defined as we have seen in the course [math]Measure[/math] [math]and[/math] [math]Integral[/math]. From Hölder's inequality we can observe that
whenever [math]p,q\geq 1[/math] and [math]\frac{1}{p}+\frac{1}{q}=1[/math]. If we take [math]Y=1[/math] above, we obtain
which means [math]\|X\|_1\leq \|X\|_p[/math]. This can be extended as [math]\|X\|_r\leq \|X\|_p[/math] if [math]r\leq p[/math]. So it follows that [math]L^p(\Omega,\A,\p)\subset L^r(\Omega,\A,\p)[/math]. For [math]p=q=2[/math] we get the Cauchy-Schwarz inequality as follows
With [math]Y=1[/math] we have [math]\E[\vert X\vert]^2=\E[X^2][/math].
Let [math](\Omega,\A,\p)[/math] be a probability space. Consider a r.v. [math]X\in L^2(\Omega,\A,\p)[/math]. The variance of [math]X[/math] is defined as
Informally, the variance represents the deviation of [math]X[/math] around its mean [math]\E[X][/math]. Note that [math]Var(X)=0[/math] if and only if [math]X[/math] is a.s. constant.
Consequently, we get
It follows that if [math]X[/math] is centered (i.e. [math]\E[X]=0[/math]), we get [math]Var(X)=\E[X^2][/math]. Moreover, the following two simple inequalities are very often used.
- (Markov inequality) If [math]X\geq 0[/math] and [math]a\geq 0[/math] then
[[math]] \boxed{\p[X \gt a]\leq \frac{1}{a}\E[X]}. [[/math]]
- (Tchebisheff inequality)
[[math]] \boxed{\p[\vert X-\E[X]\vert \gt a]\leq \frac{1}{a^2}Var(X)}. [[/math]]
We want to show both inequalities.
- Note that
[[math]] \p[X\geq a]=\E[\one_{\{X\geq a\}}]\leq \E\left[\frac{X}{a}\underbrace{\one_{\{X\geq a\}}}_{\leq 1}\right]\leq \E\left[\frac{X}{a}\right]. [[/math]]
- This follows from (1) because [math]\vert X-\E[X]\vert[/math] is a positive r.v. and hence
[[math]] \p[\vert X-\E[X]\vert\geq a]=\p[\vert X-\E[X]\vert^2\geq a^2]\leq \frac{1}{a^2}\underbrace{\E[\vert X-\E[X]\vert^2]}_{Var(X)}. [[/math]]
Let [math](\Omega,\A,\p)[/math] be a probability space. Consider two r.v.'s [math]X,Y\in L^2(\Omega,\A,\p)[/math]. The covariance of [math]X[/math] and [math]Y[/math] is defined as
If [math]X=(X_1,...,X_d)\in\R^d[/math] is a r.v. such that [math]\forall i\in\{1,...,d\}[/math], [math]X_i\in L^2(\Omega,\A,\p)[/math], then the covariance matrix of [math]X[/math] is defined as
Informally speaking, the covariance between [math]X[/math] and [math]Y[/math] measures the correlation between [math]X[/math] and [math]Y[/math]. Note that [math]Cov(X,X)=Var(X)[/math] and from Cauchy-Schwarz we get
The application [math](X,Y)\mapsto Cov(X,Y)[/math] is a bilinear form on [math]L^2(\Omega,\A,\p)[/math]. We also note that [math]K_X[/math] is symmetric and positive, i.e. if [math]\lambda_1,...,\lambda_d\in\R[/math], [math]\lambda=(\lambda_1,...,\lambda_d)^T[/math], then
So we get
Linear Regression
Let [math](\Omega,\A,\p)[/math] be a probability space. Let [math]X,Y_1,...,Y_n[/math] be r.v.'s in [math]L^2(\Omega,\A,\p)[/math]. We want the best approximation of [math]X[/math] as an affine function of [math]Y_1,...,Y_n[/math]. More precisely we want to minimize
over all possible choices of [math](\beta_0,\beta_1,...,\beta_n)[/math].
Let [math](\Omega,\A,\p)[/math] be a probability space. Let [math]X,Y\in L^1(\Omega,\A,\p)[/math] be two r.v.'s. Then
Let [math]H[/math] be the linear subspace of [math]L^2(\Omega,\A,\p)[/math] spanned by [math]\{1,Y_1,...,Y_n\}[/math]. Then we know that the r.v. [math]Z[/math], which minimizes
When [math]n=1[/math], we have
General references
Moshayedi, Nima (2020). "Lectures on Probability Theory". arXiv:2010.16280 [math.PR].