guide:D3d82beb0f: Difference between revisions

From Stochiki
No edit summary
 
No edit summary
Line 1: Line 1:
   
Previously we studied transformations of a single random variable and how to compute the distribution of a transformed random variable. Here we are primarily considered with situations when the function of the random variables, sometimes called a ''change of variables'', has a (joint) cumulative distribution or (joint) density function that can be computed simply in terms of the corresponding cumulative distribution function or joint density function of the random variables that it depends on.
 
==The Setup ==
 
We let <math>X_1,\ldots,X_n</math> denote a sequence of random variables and let
 
<math display="block">
 
\mathbf{Y} = [Y_1,\ldots,Y_m] = T(X_1,\ldots,X_n) = T(\mathbf{X})
 
</math>
 
denote the change of variables.
 
== Monotone One-to-One Transformations ==
 
A transformation <math>T</math> is ''one-to-one'' when <math>T(\mathbf{x}) = T( \mathbf{y}) </math> implies that <math> \mathbf{x} = \mathbf{y}</math>: <math>T </math> maps two distinct points to two distinct points. If <math>T</math> is one-to-one then <math>T</math> has a left-inverse <math>T^{-1}</math>:
 
<math display="block">
 
T^{-1}(T(\mathbf{x})) = \mathbf{x}.
 
</math>
 
===Strictly Increasing ===
 
If <math>T</math> is strictly increasing,
 
<math display="block">
 
  x_i \leq y_i \,\text{for all}\,\,i  \Leftrightarrow \mathbf{x} \leq \mathbf{y} \implies T(\mathbf{x}) \leq T(\mathbf{y})
 
</math>
 
, then
 
<math display="block">
 
\begin{align}
 
\operatorname{P}[T(\mathbf{X}) \leq y ] = \operatorname{P}[X \leq T^{-1}(y)] = f_{\textbf{X}}[T^{-1}(y)]
 
\end{align}
 
</math>
 
and we obtain the simple relation
 
<math display="block">
 
\begin{equation}
 
\label{transform-rel-up}
 
\mathbf{X} \mapsto T(\mathbf{X})=\textbf{Y}  \implies F_{\textbf{Y}} = f_{\textbf{X}} \circ T^{-1}.
 
\end{equation}
 
</math>
 
===Example===
 
Suppose that <math>\mathbf{X} = [X_1,X_2] </math> and let
 
<math display="block">
 
\mathbf{Y} = [Y_1,Y_2]=T(x_1,x_2) = [e^{2x_1 + x_2},e^{x_1 + 2x_2}] = \exp[A(\mathbf{X})]
 
</math>
 
with <math>A</math> the linear transformation
 
<math display = "block">
A(x_1,X_2) = [x_1 + x_2, x_1 + 2x_2].
</math>
 
The inverse of the transformation, <math>T^{-1}</math>,  equals 
 
<math display="block">
 
A^{-1} \circ \log \, , \quad A^{-1}(\mathbf{y}) = [(y_1 - y_2)/3,(2y_2 - y_1)/3].
 
</math>
 
By \ref{transform-rel-up},  the cumulative distribution function of <math>\textbf{Y}</math>  equals <math>F_{\textbf{Y}} = f_{\textbf{X}} \circ A^{-1} \circ \log.</math>
 
===Strictly Decreasing ===
 
If the transformation <math>T</math> is strictly decreasing and the distribution function of <math>X</math>, <math>F_{\textbf{X}}</math>,  is ''continuous'' then
 
<math display="block">
 
\operatorname{P}[T(\mathbf{X})\leq y ] =  \operatorname{P}[\mathbf{X} \geq T^{-1}(y)] = 1 - f_{\textbf{X}}[T^{-1}(y)]
 
</math>
 
and the distribution funtion of <math>\textbf{Y}</math> equals <math>1 - f_{\textbf{X}} \circ T^{-1} </math>.
 
==Smooth One-to-One Transformations ==
 
We assume the following:
 
#<math>X_1,\ldots,X_n</math> have a joint density.
#The transformation <math>T</math> is [[wikipedia:one-to-one|one-to-one]].
#The transformation <math>T</math> is differentiable ''almost surely'': the set of points where <math>T</math> is not differentiable has probability 0.
 
To illustrate condition 3, suppose that <math>X_1 </math> and <math>X_2</math> are uniformly distributed on [0,1] and <math>T(X_1,X_2) = [X_2,X_1]</math>. Then <math>T</math> is defined on the square region {{nowrap|[0,1] x [0,1]}} and the points where it's differentiable has probability 1:
 
<math display="block">
 
\operatorname{P}(\{0 \lt X_1 \lt 1\} \cap \{0 \lt X_2 \lt 1\}) = 1.
 
</math>
 
If the conditions above are met, we can actually compute the joint density of <math>\textbf{Y}</math>:
 
<math display="block">
f_{\textbf{Y}}(\mathbf{y}) = \frac{ f_{\textbf{X}}(T^{-1}(\mathbf{y}))}{|\operatorname{det}[J_T(T^{-1}(\mathbf{y}))]|}\,, \mathbf{y} = [y_1,\ldots,y_n].
</math>
 
Here <math>J_T</math> denotes the [[wikipedia:Jacobian_matrix_and_determinant|Jacobian]] matrix of <math>T</math> and <math>\textrm{det}[J_T]</math> its [[wikipedia:determinant|determinant]]. We can express the density of <math>Y</math> more succinctly (recall that <math>\circ</math> denotes function composition):
 
<math display="block">
 
\begin{equation}
 
\label{transform-density}
 
f_{\textbf{Y}} = \left(f_{\textbf{X}} \cdot |\operatorname{det}[J_T]|^{-1}\right) \circ T^{-1}
 
\end{equation}
 
</math>
 
===Example===
 
Let <math>\textbf{U}</math> denote a uniform on the square {{nowrap|[0,1]x[0,1]}} and let
 
<math display="block">
 
\textbf{Y} = [\exp(U_1 + U_2),\exp(U_1 - U_2)] = \exp[A(\mathbf{U})]
 
</math>
 
with <math>A</math> the linear transformation <math>A(x_1,x_2) = [x_1 + x+2,x_1 - x_2]</math>. The inverse of the transformation, <math>T^{-1}</math>,  equals 
 
<math display="block">
 
A^{-1} \circ \log\, , \quad A^{-1}(\mathbf{y}) =1/2 [y_1 + y_2,y_1-y_2].
 
</math>
 
We compute the determinant of the Jacobian of the transformation
 
<math display="block">
 
\begin{align}
 
J_T &= \left(\begin{matrix}
 
\frac{\partial}{\partial x_1} e^{x_1 + x_2} & \frac{\partial}{\partial x_2} e^{x_1 + x_2} \\
 
\frac{\partial}{\partial x_1}e^{x_1 - x_2} & \frac{\partial}{\partial x_2} e^{x_1 - x_2}\\
 
\end{matrix}\right) \\
 
& = \left(\begin{matrix}
 
e^{x_1 + x_2} & e^{x_1 + x_2} \\
 
e^{x_1 - x_2} & - e^{x_1 - x_2}  \\
 
\end{matrix}\right)
 
\end{align}
 
</math>
 
and its corresponding determinant equals <math> -2e^{2x_1}</math>. Recalling \ref{transform-density}, the density function of <math>\textbf{Y}</math> equals
 
<math display="block">
 
\begin{cases}\frac{1}{2}\exp[-\log(y_1) -\log(y_2)] = \frac{1}{2y_1y_2} \ & \mathbf{y} \in D \\
 
0 & \text{otherwise}
 
\end{cases}
</math>
 
with <math>D</math> equal to the region <math>T([0,1] \times [0,1])</math> (the image of the unit square under the transformation). Expressing <math>x_1</math> and <math>x_2</math> in terms of <math>y_1</math> and <math>y_2</math>, the region <math>D</math> can be described as follows:
 
<math display="block">
 
\begin{align*}
 
0 \lt \frac{\log(y_1) + \log(y_2)}{2} \lt 1 &\Leftrightarrow 1 \lt y_1y_2 \lt e^2 \\
 
0 \lt \frac{\log(y_1) - \log(y_2)}{2} \lt 1 &\Leftrightarrow  1 \lt \frac{y_1}{y_2} \lt e^2 \\
 
\end{align*}
 
</math>
 
==Other Methods ==
 
We present other ways of calculating the cumulative distribution function of transformed random variables or the probability of events associated with said variables.
 
===Direct Evaluation ===
 
The most direct way to compute probabilities such as <math>\operatorname{P}[\textbf{Y} \in A] </math> is to express them in terms of events associated with the initial random variable <math>\textbf{X}</math>:
 
<math display="block">
 
\operatorname{P}[\textbf{Y} \in A ] = \operatorname{P}[\textbf{X} \in B]\,, \quad B = \{x \,:\, T(x) \in A\}.
 
</math>
 
====Example 1 ====
 
If <math>Y = X_1 + X_2= T(\mathbf{X})</math>, then <math>\operatorname{P}[Y_1 \leq y] = \operatorname{P}[X \in B ] </math> with <math>B</math> the set of points
 
<math display="block">
 
B = \{[x_1,x_2] \, : \, x_1 + x_2  \leq y \}.
 
</math>
 
====Example 2 ====
 
If <math>\textbf{Y} = [X_1 + X_2,X_1 - X_2] = T(\mathbf{X}) </math>, then <math>\operatorname{P}[Y_1 \leq y_1,Y_2 \leq y_2] = \operatorname{P}[\textbf{X} \in B ] </math> with <math>B</math> the set of points
 
<math display="block">
 
B = \{[x_1,x_2] \, : \, x_1 + x_2  \leq y_1\,, x_1 - x_2 \leq y_2\}.
 
</math>
 
===Conditioning ===
 
We limit ourselves to the case when we only have two variables: <math>\mathbf{X} = [X_1,X_2] </math>. The conditioning method to computing probabilities associated with <math>\textbf{Y}</math> is to condition on <math>X_1</math> or <math>X_2</math>:
 
<math display="block">
 
\begin{equation}
 
\label{cond-method-eqn}
\operatorname{P}(\textbf{Y} \in A) = \operatorname{E}[\operatorname{P}(A \mid X_i)]\,, i=1,2.
 
\end{equation}
</math>
 
<math>\mathbf{X}</math> typically has a joint density function and then \ref{cond-method-eqn} also equals
 
<math display="block">
 
\int_{-\infty}^{\infty}\operatorname{P}(\textbf{Y} \in A \mid X_i = x) \, f_{X_i}(x) \, dx \,, i=1,2
 
</math>
 
with <math>f_{X_i}</math> the marginal density of <math>X_i</math>.
 
 
====Example 1 ====
 
Suppose <math>X_1</math> and <math>X_2</math> are mutually independent. Suppose further that <math>X_1</math> has density function
 
<math display="block">
 
f_{X_1}(x) = \begin{cases}
 
\frac{\alpha}{x^{\alpha + 1}} & x \geq 1 \\
 
0 & x \lt 1
 
\end{cases}
 
</math>
 
and <math>X_2 </math> is uniformly distributed on [0,1]. We consider the random variable <math>Y = X_1 X_2 </math>. Conditioning on <math>X_2</math>, we have
 
<math display="block">
 
f_Y(y\mid X_2 = \theta) = \begin{cases}
 
\frac{\alpha \theta^{\alpha}}{x^{\alpha + 1}} & x \geq \theta \\
 
0 & x \lt \theta
 
\end{cases}
 
</math>
 
and thus the density function of <math>Y</math> equals
 
<math display="block">
 
\alpha x^{-(\alpha + 1)} \, \int_0^{\min(x,1)} \theta^{\alpha} \, d\theta = \begin{cases} \frac{\alpha}{\alpha +1 } & x \lt 1 \\
 
\frac{\alpha}{\alpha + 1} \cdot x^{-(\alpha + 1)} & x \geq 1
 
\end{cases}
 
</math>
 
====Example 2====
 
Suppose <math>X</math> is uniform on the square {{nowrap|[0,1]x[0,1]}} and let
 
<math display="block">
 
Y = X_1/X_2.
 
</math>
 
Conditional on <math>X_2</math>, <math>Y</math> is uniformly distributed on the interval <math>X_2 </math>. Furthermore, <math>X_2</math> is uniformly distributed on {{nowrap|[0,1]}}. By \ref{cond-method-eqn}, we have
 
<math display="block">
 
\begin{align*}
 
\operatorname{P}(Y \leq y) &=
 
\int_0^1 x\,\min(y,1/x)\, dx  \\
 
&= \int_0^{1/y} xy \,dx + \int_{1/y}^1 \, dx \\
 
&= \begin{cases}
 
1/2y & y < 1 \\
 
1 - 1/2y & y \geq 1.
 
\end{cases}
 
\end{align*}
 
</math>
 
==Sums of Independent Random Variables ==
 
The probability distribution of the sum of two or more independent random variables is the '''convolution''' of their individual distributions. The term is motivated by the fact that the probability mass function or probability density function of a sum of random variables is the convolution of their corresponding probability mass functions or probability density functions respectively.
 
===Convolutions ===
 
==== Continuous ====
 
The convolution of <math>f</math> and <math>g</math> is written <math>f*g</math>, using an [[wikipedia:asterisk|asterisk]] or star. It is defined as the integral of the product of the two functions after one is reversed and shifted. As such, it is a particular kind of [[wikipedia:integral transform|integral transform]]:
 
<math display="block">
 
\begin{align}
 
(f * g )(t) &\stackrel{\mathrm{def}}{=}\ \int_{-\infty}^\infty f(\tau)\, g(t - \tau)\, d\tau \\
 
&= \int_{-\infty}^\infty f(t-\tau)\, g(\tau)\, d\tau \\
 
&= (g * f)(t).
 
\end{align}
 
</math>
 
While the symbol <math>t</math> is used above, it need not represent the time domain.  But in that context, the convolution formula can be described as a weighted average of the function <math>f(\tau)</math> at the moment <math>t</math> where the weighting is given by <math>g(-\tau)</math> simply shifted by amount <math>t</math>. As <math>t</math> changes, the weighting function emphasizes different parts of the input function.
 
For functions <math>f</math>, <math>g</math> [[wikipedia:Support (mathematics)|supported]] on only <math>[0, \infty)</math> (i.e., zero for negative arguments), the integration limits can be truncated, resulting in
 
<math display="block">
 
(f * g )(t) = \int_{0}^{t} f(\tau)\, g(t - \tau)\, d\tau
\ \ \ \mathrm{for} \ \ f, g : [0, \infty) \to \mathbb{R}
 
</math>
 
====Discrete ====
 
For functions <math>f</math>, <math>g</math> defined on the set <math>Z</math> of integers, the '''discrete convolution''' of <math>f</math> and <math>g</math> is given by:<ref>{{harvnb|Damelin|Miller|2011|p=232}}</ref>
 
<math display="block">
\begin{align}
 
(f * g)[n]\, & \stackrel{\mathrm{def}}{=}\ \sum_{m=-\infty}^\infty f[m]\, g[n - m] \\
 
&= \sum_{m=-\infty}^\infty f[n-m]\, g[m] \\
 
&= (g * f)[n].
 
\end{align}
 
</math>
 
When <math>g</math> has finite support in the set <math>\{-M,-M+1,\dots,M-1,M\}</math> (representing, for instance, a [[wikipedia:finite impulse response|finite impulse response]]), a finite summation may be used:<ref>{{cite book |last1=Press |first1=William H. |last2=Flannery |first2=Brian P.|last3=Teukolsky |first3=Saul A. |last4=Vetterling |first4=William T. |title= Numerical Recipes in Pascal|edition= |year=1989 |publisher=Cambridge University Press |isbn=0-521-37516-9|page=450}}</ref><math display="block">(f*g)[n]=\sum_{m=-M}^M f[n-m]g[m].</math>
 
===Continuous Distributions ===
 
If <math>X_1,\ldots,X_n</math> denotes a sequence of mutually independent continuous random variables each having a probability density function, then the sum  <math> Y = \sum_{i=1}^n X_i  </math> has probability density function <math> f_{X_1} * \cdots * f_{X_n}</math>. Here is a table giving the distribution for sums of well-known discrete distributions:
 
<table class = "table">
 
<tr>
<th>
<math>X_i</math>
</th>
 
<th>
<math>Y</math>
</th>
</tr>
 
<tr>
<td><math>\textrm{Normal}(\mu_i, \sigma^2_i)</math></td>
<td><math>\textrm{Normal}(\mu_1 + ... + \mu_n, \sigma_1^2 + \ldots + \sigma^2_n)</math></td>
</tr>
 
<tr>
<td><math>\textrm{Cauchy}(a_i,γ_i)</math></td>
<td> <math>\textrm{Cauchy}(a_1+...+a_n,γ_1 +...+γ_n)</math></td>
</tr>
 
<tr>
<td><math>\textrm{Exponential}(\theta)</math></td>
<td><math>\textrm{Gamma}(n, \theta)</math></td>
</tr>
 
<tr>
<td><math>\textrm{Chi-Square}(r_i)</math></td>
<td><math>\textrm{Chi-Square}(r_1 +...+ r_n) </math></td>
</tr>
</table>
 
To demonstrate the convolution technique, we will give a derivation of the result described in the third row in the table above (sum of exponentials). The proof is by [[wikipedia:Mathematical_induction|induction]] on <math>n</math>. For <math>n = 1 </math>, there is nothing to show. If
 
<math display="block">
 
Y_n = X_1 + \ldots + X_n
 
</math>
 
then we need to show that <math>Y_{n+1}</math> is <math>\textrm{Gamma}(n+1, \theta)</math> given that <math>Y_n</math> is <math>\textrm{Gamma}(n,\theta)</math>. It suffices to show the result for <math>\theta = 1 </math> since the general case then follows by scaling. Using the convolution technique, the density of <math>Y_{n+1}</math> equals:
 
<math display="block">
\begin{align}
f_{Y_n}*f_{X_{n+1}}(y) & \propto \int_0^y x^{-n} e^{-x} e^{x-y} dx \\
 
& \propto e^{-y} \cdot \int_0^y x^{n-1} dx \\
 
& \propto e^{-y} \, y^{n}.  \\
 
\end{align}
 
</math>
 
This is the density function of a <math>\textrm{Gamma}(n+1,1)</math> and thus, by induction, we have completed the derivation.
 
{{alert-info|The symbol <math>\propto</math> means ''proportional to'' and is used to signify that the expression to the left of it is proportional to the expression to the right of it.}}
 
===Discrete Distributions ===
 
Similarly, if <math>X_1, \ldots, X_n </math> denotes a sequence of mutually independent ''integer'' valued random variables, then the sum
 
<math display="block">
 
Y = \sum_{i=1}^n X_i
 
</math>
 
has probability mass function <math>p_{Y} = p_{X_1} * \cdots * p_{X_n}.</math>. Here is a table giving the distribution for sums of well-known discrete distributions:
 
<table class = "table">
 
<tr>
<th>
<math>X_i</math>
</th>
 
<th>
<math>Y</math>
</th>
</tr>
 
<tr>
<td><math>\textrm{Bernouilli}(p)</math></td>
<td><math>\textrm{Binomial}(n,p)</math></td>
 
</tr>
 
<tr>
 
<td><math>\textrm{Binomial}(n_i,p)</math></td>
<td><math>\textrm{Binomial}(n_1 +...+n_n,p) </math></td>
 
</tr>
 
<tr>
<td><math>\textrm{NegativeBinomial}(n_i,p)</math></td>
<td><math>\textrm{NegativeBinomial}(n_1 + \cdots + n_n, p)</math></td>
</tr>
 
<tr>
<td><math>\textrm{Geometric}(p)</math></td>
<td><math>\textrm{NegativeBinomial}(n,p) </math> </td>
</tr>
 
<tr>
<td><math>\textrm{Poisson}(\lambda_i)</math></td>
<td><math>\textrm{Poisson}(\lambda_1 + \ldots + \lambda_n)</math></td>
 
</tr>
 
</table>
 
To demonstrate the convolution technique, we will give a derivation of the result described in the second row in the table above (sum of binomials). The proof is by induction on <math>n</math>. For <math> n = 1 </math> there is nothing to show. If <math>Y_n = \sum_{i=1}^n X_i</math>, <math>s_n = \sum_{i=1}^n n_i</math>, and we set <math>q= 1-p</math> and <math>\operatorname{P}[Y_{n+1} = y] = p(y) </math>, then
 
<math display="block">
\begin{align}
p(y) &= \sum_{m=0}^{y} \binom{s_n}{m} p^m q^{s_n-m} \binom{n_{n+1}}{y-m} p^{m-y}q^{n_{n+1}-(m-y)}  \\
 
\label{conv-bin-1}  &= p^{y} q^{s_{n+1} -y} \sum_{m=0}^y \left[\binom{s_{n}}{m} + \binom{n_{n+1}}{y-m} \right] \\
 
\label{conv-bin-2}  &= p^{y} q^{s_{n+1} -y} \binom{s_{n+1}}{y}.
 
\end{align}
</math>
 
To go from \ref{conv-bin-1} to \ref{conv-bin-2}, we have used the combinatorial identity
 
<math display="block">
 
\sum_{m=0}^y \left[\binom{N_1}{m} + \binom{N_2}{y-m} \right] = \binom{N_1 + N_2}{y}.
 
</math>
 
\ref{conv-bin-2} indicates that <math>Y_{n+1}</math> is a binomial with parameters <math>s_{n+1}</math> and <math>p</math>. By induction, we have completed the derivation.
 
==Notes==
{{reflist}}
 
==References==
*{{cite book |last1=Hogg |first1=Robert V. |authorlink1=wikipedia:Robert V. Hogg |last2=McKean |first2=Joseph W. |last3=Craig |first3=Allen T. |title=Introduction to mathematical statistics |edition=6th |publisher=Prentice Hall |url=http://www.pearsonhighered.com/educator/product/Introduction-to-Mathematical-Statistics/9780130085078.page |location=Upper Saddle River, New Jersey  |year=2004 | page=692 |ISBN=978-0-13-008507-8 |mr=467974}}
*{{cite web |url= https://en.wikipedia.org/w/index.php?title=Convolution&oldid=1064916310 |title= Convolution |author = Wikipedia contributors |website= Wikipedia |publisher= Wikipedia |access-date = 28 January 2022 }}

Revision as of 16:00, 2 June 2022

Previously we studied transformations of a single random variable and how to compute the distribution of a transformed random variable. Here we are primarily considered with situations when the function of the random variables, sometimes called a change of variables, has a (joint) cumulative distribution or (joint) density function that can be computed simply in terms of the corresponding cumulative distribution function or joint density function of the random variables that it depends on.

The Setup

We let [math]X_1,\ldots,X_n[/math] denote a sequence of random variables and let

[[math]] \mathbf{Y} = [Y_1,\ldots,Y_m] = T(X_1,\ldots,X_n) = T(\mathbf{X}) [[/math]]

denote the change of variables.

Monotone One-to-One Transformations

A transformation [math]T[/math] is one-to-one when [math]T(\mathbf{x}) = T( \mathbf{y}) [/math] implies that [math] \mathbf{x} = \mathbf{y}[/math]: [math]T [/math] maps two distinct points to two distinct points. If [math]T[/math] is one-to-one then [math]T[/math] has a left-inverse [math]T^{-1}[/math]:

[[math]] T^{-1}(T(\mathbf{x})) = \mathbf{x}. [[/math]]

Strictly Increasing

If [math]T[/math] is strictly increasing,

[[math]] x_i \leq y_i \,\text{for all}\,\,i \Leftrightarrow \mathbf{x} \leq \mathbf{y} \implies T(\mathbf{x}) \leq T(\mathbf{y}) [[/math]]

, then

[[math]] \begin{align} \operatorname{P}[T(\mathbf{X}) \leq y ] = \operatorname{P}[X \leq T^{-1}(y)] = f_{\textbf{X}}[T^{-1}(y)] \end{align} [[/math]]

and we obtain the simple relation

[[math]] \begin{equation} \label{transform-rel-up} \mathbf{X} \mapsto T(\mathbf{X})=\textbf{Y} \implies F_{\textbf{Y}} = f_{\textbf{X}} \circ T^{-1}. \end{equation} [[/math]]

Example

Suppose that [math]\mathbf{X} = [X_1,X_2] [/math] and let

[[math]] \mathbf{Y} = [Y_1,Y_2]=T(x_1,x_2) = [e^{2x_1 + x_2},e^{x_1 + 2x_2}] = \exp[A(\mathbf{X})] [[/math]]

with [math]A[/math] the linear transformation

[[math]] A(x_1,X_2) = [x_1 + x_2, x_1 + 2x_2]. [[/math]]

The inverse of the transformation, [math]T^{-1}[/math], equals

[[math]] A^{-1} \circ \log \, , \quad A^{-1}(\mathbf{y}) = [(y_1 - y_2)/3,(2y_2 - y_1)/3]. [[/math]]

By \ref{transform-rel-up}, the cumulative distribution function of [math]\textbf{Y}[/math] equals [math]F_{\textbf{Y}} = f_{\textbf{X}} \circ A^{-1} \circ \log.[/math]

Strictly Decreasing

If the transformation [math]T[/math] is strictly decreasing and the distribution function of [math]X[/math], [math]F_{\textbf{X}}[/math], is continuous then

[[math]] \operatorname{P}[T(\mathbf{X})\leq y ] = \operatorname{P}[\mathbf{X} \geq T^{-1}(y)] = 1 - f_{\textbf{X}}[T^{-1}(y)] [[/math]]

and the distribution funtion of [math]\textbf{Y}[/math] equals [math]1 - f_{\textbf{X}} \circ T^{-1} [/math].

Smooth One-to-One Transformations

We assume the following:

  1. [math]X_1,\ldots,X_n[/math] have a joint density.
  2. The transformation [math]T[/math] is one-to-one.
  3. The transformation [math]T[/math] is differentiable almost surely: the set of points where [math]T[/math] is not differentiable has probability 0.

To illustrate condition 3, suppose that [math]X_1 [/math] and [math]X_2[/math] are uniformly distributed on [0,1] and [math]T(X_1,X_2) = [X_2,X_1][/math]. Then [math]T[/math] is defined on the square region [0,1] x [0,1] and the points where it's differentiable has probability 1:

[[math]] \operatorname{P}(\{0 \lt X_1 \lt 1\} \cap \{0 \lt X_2 \lt 1\}) = 1. [[/math]]

If the conditions above are met, we can actually compute the joint density of [math]\textbf{Y}[/math]:

[[math]] f_{\textbf{Y}}(\mathbf{y}) = \frac{ f_{\textbf{X}}(T^{-1}(\mathbf{y}))}{|\operatorname{det}[J_T(T^{-1}(\mathbf{y}))]|}\,, \mathbf{y} = [y_1,\ldots,y_n]. [[/math]]

Here [math]J_T[/math] denotes the Jacobian matrix of [math]T[/math] and [math]\textrm{det}[J_T][/math] its determinant. We can express the density of [math]Y[/math] more succinctly (recall that [math]\circ[/math] denotes function composition):

[[math]] \begin{equation} \label{transform-density} f_{\textbf{Y}} = \left(f_{\textbf{X}} \cdot |\operatorname{det}[J_T]|^{-1}\right) \circ T^{-1} \end{equation} [[/math]]

Example

Let [math]\textbf{U}[/math] denote a uniform on the square [0,1]x[0,1] and let

[[math]] \textbf{Y} = [\exp(U_1 + U_2),\exp(U_1 - U_2)] = \exp[A(\mathbf{U})] [[/math]]

with [math]A[/math] the linear transformation [math]A(x_1,x_2) = [x_1 + x+2,x_1 - x_2][/math]. The inverse of the transformation, [math]T^{-1}[/math], equals

[[math]] A^{-1} \circ \log\, , \quad A^{-1}(\mathbf{y}) =1/2 [y_1 + y_2,y_1-y_2]. [[/math]]

We compute the determinant of the Jacobian of the transformation

[[math]] \begin{align} J_T &= \left(\begin{matrix} \frac{\partial}{\partial x_1} e^{x_1 + x_2} & \frac{\partial}{\partial x_2} e^{x_1 + x_2} \\ \frac{\partial}{\partial x_1}e^{x_1 - x_2} & \frac{\partial}{\partial x_2} e^{x_1 - x_2}\\ \end{matrix}\right) \\ & = \left(\begin{matrix} e^{x_1 + x_2} & e^{x_1 + x_2} \\ e^{x_1 - x_2} & - e^{x_1 - x_2} \\ \end{matrix}\right) \end{align} [[/math]]

and its corresponding determinant equals [math] -2e^{2x_1}[/math]. Recalling \ref{transform-density}, the density function of [math]\textbf{Y}[/math] equals

[[math]] \begin{cases}\frac{1}{2}\exp[-\log(y_1) -\log(y_2)] = \frac{1}{2y_1y_2} \ & \mathbf{y} \in D \\ 0 & \text{otherwise} \end{cases} [[/math]]

with [math]D[/math] equal to the region [math]T([0,1] \times [0,1])[/math] (the image of the unit square under the transformation). Expressing [math]x_1[/math] and [math]x_2[/math] in terms of [math]y_1[/math] and [math]y_2[/math], the region [math]D[/math] can be described as follows:

[[math]] \begin{align*} 0 \lt \frac{\log(y_1) + \log(y_2)}{2} \lt 1 &\Leftrightarrow 1 \lt y_1y_2 \lt e^2 \\ 0 \lt \frac{\log(y_1) - \log(y_2)}{2} \lt 1 &\Leftrightarrow 1 \lt \frac{y_1}{y_2} \lt e^2 \\ \end{align*} [[/math]]

Other Methods

We present other ways of calculating the cumulative distribution function of transformed random variables or the probability of events associated with said variables.

Direct Evaluation

The most direct way to compute probabilities such as [math]\operatorname{P}[\textbf{Y} \in A] [/math] is to express them in terms of events associated with the initial random variable [math]\textbf{X}[/math]:

[[math]] \operatorname{P}[\textbf{Y} \in A ] = \operatorname{P}[\textbf{X} \in B]\,, \quad B = \{x \,:\, T(x) \in A\}. [[/math]]

Example 1

If [math]Y = X_1 + X_2= T(\mathbf{X})[/math], then [math]\operatorname{P}[Y_1 \leq y] = \operatorname{P}[X \in B ] [/math] with [math]B[/math] the set of points

[[math]] B = \{[x_1,x_2] \, : \, x_1 + x_2 \leq y \}. [[/math]]

Example 2

If [math]\textbf{Y} = [X_1 + X_2,X_1 - X_2] = T(\mathbf{X}) [/math], then [math]\operatorname{P}[Y_1 \leq y_1,Y_2 \leq y_2] = \operatorname{P}[\textbf{X} \in B ] [/math] with [math]B[/math] the set of points

[[math]] B = \{[x_1,x_2] \, : \, x_1 + x_2 \leq y_1\,, x_1 - x_2 \leq y_2\}. [[/math]]

Conditioning

We limit ourselves to the case when we only have two variables: [math]\mathbf{X} = [X_1,X_2] [/math]. The conditioning method to computing probabilities associated with [math]\textbf{Y}[/math] is to condition on [math]X_1[/math] or [math]X_2[/math]:

[[math]] \begin{equation} \label{cond-method-eqn} \operatorname{P}(\textbf{Y} \in A) = \operatorname{E}[\operatorname{P}(A \mid X_i)]\,, i=1,2. \end{equation} [[/math]]

[math]\mathbf{X}[/math] typically has a joint density function and then \ref{cond-method-eqn} also equals

[[math]] \int_{-\infty}^{\infty}\operatorname{P}(\textbf{Y} \in A \mid X_i = x) \, f_{X_i}(x) \, dx \,, i=1,2 [[/math]]

with [math]f_{X_i}[/math] the marginal density of [math]X_i[/math].


Example 1

Suppose [math]X_1[/math] and [math]X_2[/math] are mutually independent. Suppose further that [math]X_1[/math] has density function

[[math]] f_{X_1}(x) = \begin{cases} \frac{\alpha}{x^{\alpha + 1}} & x \geq 1 \\ 0 & x \lt 1 \end{cases} [[/math]]

and [math]X_2 [/math] is uniformly distributed on [0,1]. We consider the random variable [math]Y = X_1 X_2 [/math]. Conditioning on [math]X_2[/math], we have

[[math]] f_Y(y\mid X_2 = \theta) = \begin{cases} \frac{\alpha \theta^{\alpha}}{x^{\alpha + 1}} & x \geq \theta \\ 0 & x \lt \theta \end{cases} [[/math]]

and thus the density function of [math]Y[/math] equals

[[math]] \alpha x^{-(\alpha + 1)} \, \int_0^{\min(x,1)} \theta^{\alpha} \, d\theta = \begin{cases} \frac{\alpha}{\alpha +1 } & x \lt 1 \\ \frac{\alpha}{\alpha + 1} \cdot x^{-(\alpha + 1)} & x \geq 1 \end{cases} [[/math]]

Example 2

Suppose [math]X[/math] is uniform on the square [0,1]x[0,1] and let

[[math]] Y = X_1/X_2. [[/math]]

Conditional on [math]X_2[/math], [math]Y[/math] is uniformly distributed on the interval [math]X_2 [/math]. Furthermore, [math]X_2[/math] is uniformly distributed on [0,1]. By \ref{cond-method-eqn}, we have

[[math]] \begin{align*} \operatorname{P}(Y \leq y) &= \int_0^1 x\,\min(y,1/x)\, dx \\ &= \int_0^{1/y} xy \,dx + \int_{1/y}^1 \, dx \\ &= \begin{cases} 1/2y & y \lt 1 \\ 1 - 1/2y & y \geq 1. \end{cases} \end{align*} [[/math]]

Sums of Independent Random Variables

The probability distribution of the sum of two or more independent random variables is the convolution of their individual distributions. The term is motivated by the fact that the probability mass function or probability density function of a sum of random variables is the convolution of their corresponding probability mass functions or probability density functions respectively.

Convolutions

Continuous

The convolution of [math]f[/math] and [math]g[/math] is written [math]f*g[/math], using an asterisk or star. It is defined as the integral of the product of the two functions after one is reversed and shifted. As such, it is a particular kind of integral transform:

[[math]] \begin{align} (f * g )(t) &\stackrel{\mathrm{def}}{=}\ \int_{-\infty}^\infty f(\tau)\, g(t - \tau)\, d\tau \\ &= \int_{-\infty}^\infty f(t-\tau)\, g(\tau)\, d\tau \\ &= (g * f)(t). \end{align} [[/math]]

While the symbol [math]t[/math] is used above, it need not represent the time domain. But in that context, the convolution formula can be described as a weighted average of the function [math]f(\tau)[/math] at the moment [math]t[/math] where the weighting is given by [math]g(-\tau)[/math] simply shifted by amount [math]t[/math]. As [math]t[/math] changes, the weighting function emphasizes different parts of the input function.

For functions [math]f[/math], [math]g[/math] supported on only [math][0, \infty)[/math] (i.e., zero for negative arguments), the integration limits can be truncated, resulting in

[[math]] (f * g )(t) = \int_{0}^{t} f(\tau)\, g(t - \tau)\, d\tau \ \ \ \mathrm{for} \ \ f, g : [0, \infty) \to \mathbb{R} [[/math]]

Discrete

For functions [math]f[/math], [math]g[/math] defined on the set [math]Z[/math] of integers, the discrete convolution of [math]f[/math] and [math]g[/math] is given by:[1]

[[math]] \begin{align} (f * g)[n]\, & \stackrel{\mathrm{def}}{=}\ \sum_{m=-\infty}^\infty f[m]\, g[n - m] \\ &= \sum_{m=-\infty}^\infty f[n-m]\, g[m] \\ &= (g * f)[n]. \end{align} [[/math]]

When [math]g[/math] has finite support in the set [math]\{-M,-M+1,\dots,M-1,M\}[/math] (representing, for instance, a finite impulse response), a finite summation may be used:[2]

[[math]](f*g)[n]=\sum_{m=-M}^M f[n-m]g[m].[[/math]]

Continuous Distributions

If [math]X_1,\ldots,X_n[/math] denotes a sequence of mutually independent continuous random variables each having a probability density function, then the sum [math] Y = \sum_{i=1}^n X_i [/math] has probability density function [math] f_{X_1} * \cdots * f_{X_n}[/math]. Here is a table giving the distribution for sums of well-known discrete distributions:

[math]X_i[/math]

[math]Y[/math]

[math]\textrm{Normal}(\mu_i, \sigma^2_i)[/math] [math]\textrm{Normal}(\mu_1 + ... + \mu_n, \sigma_1^2 + \ldots + \sigma^2_n)[/math]
[math]\textrm{Cauchy}(a_i,γ_i)[/math] [math]\textrm{Cauchy}(a_1+...+a_n,γ_1 +...+γ_n)[/math]
[math]\textrm{Exponential}(\theta)[/math] [math]\textrm{Gamma}(n, \theta)[/math]
[math]\textrm{Chi-Square}(r_i)[/math] [math]\textrm{Chi-Square}(r_1 +...+ r_n) [/math]

To demonstrate the convolution technique, we will give a derivation of the result described in the third row in the table above (sum of exponentials). The proof is by induction on [math]n[/math]. For [math]n = 1 [/math], there is nothing to show. If

[[math]] Y_n = X_1 + \ldots + X_n [[/math]]

then we need to show that [math]Y_{n+1}[/math] is [math]\textrm{Gamma}(n+1, \theta)[/math] given that [math]Y_n[/math] is [math]\textrm{Gamma}(n,\theta)[/math]. It suffices to show the result for [math]\theta = 1 [/math] since the general case then follows by scaling. Using the convolution technique, the density of [math]Y_{n+1}[/math] equals:

[[math]] \begin{align} f_{Y_n}*f_{X_{n+1}}(y) & \propto \int_0^y x^{-n} e^{-x} e^{x-y} dx \\ & \propto e^{-y} \cdot \int_0^y x^{n-1} dx \\ & \propto e^{-y} \, y^{n}. \\ \end{align} [[/math]]

This is the density function of a [math]\textrm{Gamma}(n+1,1)[/math] and thus, by induction, we have completed the derivation.

The symbol [math]\propto[/math] means proportional to and is used to signify that the expression to the left of it is proportional to the expression to the right of it.

Discrete Distributions

Similarly, if [math]X_1, \ldots, X_n [/math] denotes a sequence of mutually independent integer valued random variables, then the sum

[[math]] Y = \sum_{i=1}^n X_i [[/math]]

has probability mass function [math]p_{Y} = p_{X_1} * \cdots * p_{X_n}.[/math]. Here is a table giving the distribution for sums of well-known discrete distributions:

[math]X_i[/math]

[math]Y[/math]

[math]\textrm{Bernouilli}(p)[/math] [math]\textrm{Binomial}(n,p)[/math]
[math]\textrm{Binomial}(n_i,p)[/math] [math]\textrm{Binomial}(n_1 +...+n_n,p) [/math]
[math]\textrm{NegativeBinomial}(n_i,p)[/math] [math]\textrm{NegativeBinomial}(n_1 + \cdots + n_n, p)[/math]
[math]\textrm{Geometric}(p)[/math] [math]\textrm{NegativeBinomial}(n,p) [/math]
[math]\textrm{Poisson}(\lambda_i)[/math] [math]\textrm{Poisson}(\lambda_1 + \ldots + \lambda_n)[/math]

To demonstrate the convolution technique, we will give a derivation of the result described in the second row in the table above (sum of binomials). The proof is by induction on [math]n[/math]. For [math] n = 1 [/math] there is nothing to show. If [math]Y_n = \sum_{i=1}^n X_i[/math], [math]s_n = \sum_{i=1}^n n_i[/math], and we set [math]q= 1-p[/math] and [math]\operatorname{P}[Y_{n+1} = y] = p(y) [/math], then

[[math]] \begin{align} p(y) &= \sum_{m=0}^{y} \binom{s_n}{m} p^m q^{s_n-m} \binom{n_{n+1}}{y-m} p^{m-y}q^{n_{n+1}-(m-y)} \\ \label{conv-bin-1} &= p^{y} q^{s_{n+1} -y} \sum_{m=0}^y \left[\binom{s_{n}}{m} + \binom{n_{n+1}}{y-m} \right] \\ \label{conv-bin-2} &= p^{y} q^{s_{n+1} -y} \binom{s_{n+1}}{y}. \end{align} [[/math]]

To go from \ref{conv-bin-1} to \ref{conv-bin-2}, we have used the combinatorial identity

[[math]] \sum_{m=0}^y \left[\binom{N_1}{m} + \binom{N_2}{y-m} \right] = \binom{N_1 + N_2}{y}. [[/math]]

\ref{conv-bin-2} indicates that [math]Y_{n+1}[/math] is a binomial with parameters [math]s_{n+1}[/math] and [math]p[/math]. By induction, we have completed the derivation.

Notes

  1. Damelin & Miller 2011, p. 232
  2. Press, William H.; Flannery, Brian P.; Teukolsky, Saul A.; Vetterling, William T. (1989). Numerical Recipes in Pascal. Cambridge University Press. p. 450. ISBN 0-521-37516-9.

References