Generating Functions for Discrete Distributions

[math] \newcommand{\NA}{{\rm NA}} \newcommand{\mat}[1]{{\bf#1}} \newcommand{\exref}[1]{\ref{##1}} \newcommand{\secstoprocess}{\all} \newcommand{\NA}{{\rm NA}} \newcommand{\mathds}{\mathbb}[/math]

So far we have considered in detail only the two most important attributes of a random variable, namely, the mean and the variance. We have seen how these attributes enter into the fundamental limit theorems of probability, as well as into all sorts of practical calculations. We have seen that the mean and variance of a random variable contain important information about the random variable, or, more precisely, about the distribution function of that variable. Now we shall see that the mean and variance do not contain all the available information about the density function of a random variable. To begin with, it is easy to give examples of different distribution functions which have the same mean and the same variance. For instance, suppose [math]X[/math] and [math]Y[/math] are random variables, with distributions

[[math]] p_X = \pmatrix{ 1 & 2 & 3 & 4 & 5 & 6\cr 0 & 1/4 & 1/2 & 0 & 0 & 1/4\cr}, [[/math]]


[[math]] p_Y = \pmatrix{ 1 & 2 & 3 & 4 & 5 & 6\cr 1/4 & 0 & 0 & 1/2 & 1/4 & 0\cr}. [[/math]]

Then with these choices, we have [math]E(X) = E(Y) = 7/2[/math] and [math]V(X) = V(Y) = 9/4[/math], and yet certainly [math]p_X[/math] and [math]p_Y[/math] are quite different density functions. This raises a question: If [math]X[/math] is a random variable with range [math]\{x_1, x_2, \ldots\}[/math] of at most countable size, and distribution function [math]p = p_X[/math], and if we know its mean [math]\mu = E(X)[/math] and its variance [math]\sigma^2 = V(X)[/math], then what else do we need to know to determine [math]p[/math] completely?

Moments

A nice answer to this question, at least in the case that [math]X[/math] has finite range, can be given in terms of the moments of [math]X[/math], which are numbers defined as follows:

[[math]] \begin{eqnarray*} \mu_k &=& k \mbox{th}\,\,\mbox{moment~of}\,\, X\\ &=& E(X^k) \\ &=& \sum_{j = 1}^\infty (x_j)^k p(x_j)\ , \end{eqnarray*} [[/math]]

provided the sum converges. Here [math]p(x_j) = P(X = x_j)[/math].


In terms of these moments, the mean [math]\mu[/math] and variance [math]\sigma^2[/math] of [math]X[/math] are given simply by

[[math]] \begin{eqnarray*} \mu &=& \mu_1, \\ \sigma^2 &=& \mu_2 - \mu_1^2\ , \end{eqnarray*} [[/math]]

so that a knowledge of the first two moments of [math]X[/math] gives us its mean and variance. But a knowledge of all the moments of [math]X[/math] determines its distribution function [math]p[/math] completely.

Moment Generating Functions

To see how this comes about, we introduce a new variable [math]t[/math], and define a function [math]g(t)[/math] as follows:

[[math]] \begin{eqnarray*} g(t) &=& E(e^{tX}) \\ &=& \sum_{k = 0}^\infty \frac {\mu_k t^k}{k!} \\ &=& E\left(\sum_{k = 0}^\infty \frac {X^k t^k}{k!} \right) \\ &=& \sum_{j = 1}^\infty e^{tx_j} p(x_j)\ . \end{eqnarray*} [[/math]]

We call [math]g(t)[/math] the moment generating function for [math]X[/math], and think of it as a convenient bookkeeping device for describing the moments of [math]X[/math]. Indeed, if we differentiate [math]g(t)[/math] [math]n[/math] times and then set [math]t = 0[/math], we get [math]\mu_n[/math]:

[[math]] \begin{eqnarray*} \left. \frac {d^n}{dt^n} g(t) \right|_{t = 0} &=& g^{(n)}(0) \\ &=& \left. \sum_{k = n}^\infty \frac {k!\, \mu_k t^{k - n}} {(k - n)!\, k!} \right|_{t = 0} \\ &=& \mu_n\ . \end{eqnarray*} [[/math]]

It is easy to calculate the moment generating function for simple examples.

Examples

Example Suppose [math]X[/math] has range [math]\{1,2,3,\ldots,n\}[/math] and [math]p_X(j) = 1/n[/math] for [math]1 \leq j \leq n[/math] (uniform distribution). Then

[[math]] \begin{eqnarray*} g(t) &=& \sum_{j = 1}^n \frac 1n e^{tj} \\ &=& \frac 1n (e^t + e^{2t} +\cdots+ e^{nt}) \\ &=& \frac {e^t (e^{nt} - 1)} {n (e^t - 1)}\ . \end{eqnarray*} [[/math]]

If we use the expression on the right-hand side of the second line above, then it is easy to see that

[[math]] \begin{eqnarray*} \mu_1 &=& g'(0) = \frac 1n (1 + 2 + 3 + \cdots + n) = \frac {n + 1}2, \\ \mu_2 &=& g''(0) = \frac 1n (1 + 4 + 9+ \cdots + n^2) = \frac {(n + 1)(2n + 1)}6\ , \end{eqnarray*} [[/math]]

and that [math]\mu = \mu_1 = (n + 1)/2[/math] and [math]\sigma^2 = \mu_2 - \mu_1^2 = (n^2 - 1)/12[/math].

Example Suppose now that [math]X[/math] has range [math]\{0,1,2,3,\ldots,n\}[/math] and [math]p_X(j) = {n \choose j} p^j q^{n - j}[/math] for [math]0 \leq j \leq n[/math] (binomial distribution). Then

[[math]] \begin{eqnarray*} g(t) &=& \sum_{j = 0}^n e^{tj} {n \choose j} p^j q^{n - j} \\ &=& \sum_{j = 0}^n {n \choose j} (pe^t)^j q^{n - j} \\ &=& (pe^t + q)^n\ . \end{eqnarray*} [[/math]]

Note that

[[math]] \begin{eqnarray*} \mu_1 = g'(0) &=& \left. n(pe^t + q)^{n - 1}pe^t \right|_{t = 0} = np\ , \\ \mu_2 = g''(0) &=& n(n - 1)p^2 + np\ , \end{eqnarray*} [[/math]]

so that [math]\mu = \mu_1 = np[/math], and [math]\sigma^2 = \mu_2 - \mu_1^2 = np(1 - p)[/math], as expected.

Example Suppose [math]X[/math] has range [math]\{1,2,3,\ldots\}[/math] and [math]p_X(j) = q^{j - 1}p[/math] for all [math]j[/math] (geometric distribution). Then

[[math]] \begin{eqnarray*} g(t) &=& \sum_{j = 1}^\infty e^{tj} q^{j - 1}p \\ &=& \frac {pe^t}{1 - qe^t}\ . \end{eqnarray*} [[/math]]

Here

[[math]] \begin{eqnarray*} \mu_1 &=& g'(0) = \left. \frac {pe^t}{(1 - qe^t)^2} \right|_{t = 0} = \frac 1p\ , \\ \mu_2 &=& g''(0) = \left. \frac {pe^t + pqe^{2t}}{(1 - qe^t)^3} \right|_{t = 0} = \frac {1 + q}{p^2}\ , \end{eqnarray*} [[/math]]

[math]\mu = \mu_1 = 1/p[/math], and [math]\sigma^2 = \mu_2 - \mu_1^2 = q/p^2[/math], as computed in Example.

Example Let [math]X[/math] have range [math]\{0,1,2,3,\ldots\}[/math] and let [math]p_X(j) =e^{-\lambda}\lambda^j/j![/math] for all [math]j[/math] (Poisson distribution with mean [math]\lambda[/math]). Then

[[math]] \begin{eqnarray*} g(t) &=& \sum_{j = 0}^\infty e^{tj} \frac {e^{-\lambda}\lambda^j}{j!} \\ &=& e^{-\lambda} \sum_{j = 0}^\infty \frac {(\lambda e^t)^j}{j!} \\ &=& e^{-\lambda} e^{\lambda e^t} = e^{\lambda(e^t - 1)}\ . \end{eqnarray*} [[/math]]

Then

[[math]] \begin{eqnarray*} \mu_1 &=& g'(0) = \left. e^{\lambda(e^t - 1)}\lambda e^t \right|_{t = 0} = \lambda\ ,\\ \mu_2 &=& g''(0) = \left. e^{\lambda(e^t - 1)} (\lambda^2 e^{2t} + \lambda e^t) \right|_{t = 0} = \lambda^2 + \lambda\ , \end{eqnarray*} [[/math]]

[math]\mu = \mu_1 = \lambda[/math], and [math]\sigma^2 = \mu_2 - \mu_1^2 = \lambda[/math]. The variance of the Poisson distribution is easier to obtain in this way than directly from the definition (as was done in Exercise).

Moment Problem

Using the moment generating function, we can now show, at least in the case of a discrete random variable with finite range, that its distribution function is completely determined by its moments.

Theorem

Let [math]X[/math] be a discrete random variable with finite range [math]\{x_1,x_2,\ldots, x_n\}[/math], distribution function [math]p[/math], and moment generating function [math]g[/math]. Then [math]g[/math] is uniquely determined by [math]p[/math], and conversely.

Show Proof

We know that [math]p[/math] determines [math]g[/math], since

[[math]] g(t) = \sum_{j = 1}^n e^{tx_j} p(x_j)\ . [[/math]]
Conversely, assume that [math]g(t)[/math] is known. We wish to determine the values of [math]x_j[/math] and [math]p(x_j)[/math], for [math]1 \le j \le n[/math]. We assume, without loss of generality, that [math]p(x_j) \gt 0[/math] for [math]1 \le j \le n[/math], and that

[[math]] x_1 \lt x_2 \lt \ldots \lt x_n\ . [[/math]]
We note that [math]g(t)[/math] is differentiable for all [math]t[/math], since it is a finite linear combination of exponential functions. If we compute [math]g'(t)/g(t)[/math], we obtain

[[math]] {{x_1 p(x_1) e^{tx_1} + \ldots + x_n p(x_n) e^{t x_n}}\over{p(x_1) e^{tx_1} + \ldots + p(x_n)e^{tx_n}}}\ . [[/math]]
Dividing both top and bottom by [math]e^{tx_n}[/math], we obtain the expression

[[math]] {{x_1 p(x_1) e^{t(x_1-x_n)} + \ldots + x_n p(x_n)}\over{p(x_1) e^{t(x_1 - x_n)} + \ldots + p(x_n)}}\ . [[/math]]
Since [math]x_n[/math] is the largest of the [math]x_j[/math]'s, this expression approaches [math]x_n[/math] as [math]t[/math] goes to [math]\infty[/math]. So we have shown that

[[math]] x_n = \lim_{t \rightarrow \infty} {{g'(t)}\over{g(t)}}\ . [[/math]]
To find [math]p(x_n)[/math], we simply divide [math]g(t)[/math] by [math]e^{tx_n}[/math] and let [math]t[/math] go to [math]\infty[/math]. Once [math]x_n[/math] and [math]p(x_n)[/math] have been determined, we can subtract [math]p(x_n) e^{tx_n}[/math] from [math]g(t)[/math], and repeat the above procedure with the resulting function, obtaining, in turn, [math]x_{n-1}, \ldots, x_1[/math] and [math]p(x_{n-1}), \ldots, p(x_1)[/math].

If we delete the hypothesis that [math]X[/math] have finite range in the above theorem, then the conclusion is no longer necessarily true.

Ordinary Generating Functions

In the special but important case where the [math]x_j[/math] are all nonnegative integers, [math]x_j = j[/math], we can prove this theorem in a simpler way. In this case, we have

[[math]] g(t) = \sum_{j = 0}^n e^{tj} p(j)\ , [[/math]]

and we see that [math]g(t)[/math] is a polynomial in [math]e^t[/math]. If we write [math]z = e^t[/math], and define the function [math]h[/math] by

[[math]] h(z) = \sum_{j = 0}^n z^j p(j)\ , [[/math]]

then [math]h(z)[/math] is a polynomial in [math]z[/math] containing the same information as [math]g(t)[/math], and in fact

[[math]] \begin{eqnarray*} h(z) &=& g(\log z)\ , \\ g(t) &=& h(e^t)\ . \end{eqnarray*} [[/math]]

The function [math]h(z)[/math] is often called the ordinary generating function for [math]X[/math]. Note that [math]h(1) = g(0) = 1[/math], [math]h'(1) = g'(0) = \mu_1[/math], and [math]h''(1) = g''(0) - g'(0) = \mu_2 - \mu_1[/math]. It follows from all this that if we know [math]g(t)[/math], then we know [math]h(z)[/math], and if we know [math]h(z)[/math], then we can find the [math]p(j)[/math] by Taylor's formula:

[[math]] \begin{eqnarray*} p(j) &=& \mbox{coefficient~of}\,\, z^j \,\, \mbox{in}\,\, h(z) \\ &=& \frac{h^{(j)}(0)}{j!}\ . \end{eqnarray*} [[/math]]


For example, suppose we know that the moments of a certain discrete random variable [math]X[/math] are given by

[[math]] \begin{eqnarray*} \mu_0 &=& 1\ , \\ \mu_k &=& \frac12 + \frac{2^k}4\ , \qquad \mbox{for}\,\, k \geq 1\ . \end{eqnarray*} [[/math]]

Then the moment generating function [math]g[/math] of [math]X[/math] is

[[math]] \begin{eqnarray*} g(t) &=& \sum_{k = 0}^\infty \frac{\mu_k t^k}{k!} \\ &=& 1 + \frac12 \sum_{k = 1}^\infty \frac{t^k}{k!} + \frac14 \sum_{k = 1}^\infty \frac{(2t)^k}{k!} \\ &=& \frac14 + \frac12 e^t + \frac14 e^{2t}\ . \end{eqnarray*} [[/math]]

This is a polynomial in [math]z = e^t[/math], and

[[math]] h(z) = \frac14 + \frac12 z + \frac14 z^2\ . [[/math]]

Hence, [math]X[/math] must have range [math]\{0,1,2\}[/math], and [math]p[/math] must have values [math]\{1/4,1/2,1/4\}[/math].

Properties

Both the moment generating function [math]g[/math] and the ordinary generating function [math]h[/math] have many properties useful in the study of random variables, of which we can consider only a few here. In particular, if [math]X[/math] is any discrete random variable and [math]Y = X + a[/math], then

[[math]] \begin{eqnarray*} g_Y(t) &=& E(e^{tY}) \\ &=& E(e^{t(X + a)}) \\ &=& e^{ta} E(e^{tX}) \\ &=& e^{ta} g_X(t)\ , \end{eqnarray*} [[/math]]

while if [math]Y = bX[/math], then

[[math]] \begin{eqnarray*} g_Y(t) &=& E(e^{tY}) \\ &=& E(e^{tbX}) \\ &=& g_X(bt)\ . \end{eqnarray*} [[/math]]

In particular, if

[[math]] X^* = \frac{X - \mu}\sigma\ , [[/math]]

then (see Exercise)

[[math]] g_{x^*}(t) = e^{-\mu t/\sigma} g_X\left( \frac t\sigma \right)\ . [[/math]]

If [math]X[/math] and [math]Y[/math] are independent random variables and [math]Z = X + Y[/math] is their sum, with [math]p_X[/math], [math]p_Y[/math], and [math]p_Z[/math] the associated distribution functions, then we have seen in Sums of Independent Random Variables that [math]p_Z[/math] is the convolution of [math]p_X[/math] and [math]p_Y[/math], and we know that convolution involves a rather complicated calculation. But for the generating functions we have instead the simple relations

[[math]] \begin{eqnarray*} g_Z(t) &=& g_X(t) g_Y(t)\ , \\ h_Z(z) &=& h_X(z) h_Y(z)\ , \end{eqnarray*} [[/math]]

that is, [math]g_Z[/math] is simply the product of [math]g_X[/math] and [math]g_Y[/math], and similarly for [math]h_Z[/math]. To see this, first note that if [math]X[/math] and [math]Y[/math] are independent, then [math]e^{tX}[/math] and [math]e^{tY}[/math] are independent (see Exercise), and hence

[[math]] E(e^{tX} e^{tY}) = E(e^{tX}) E(e^{tY})\ . [[/math]]

It follows that

[[math]] \begin{eqnarray*} g_Z(t) &=& E(e^{tZ}) = E(e^{t(X + Y)}) \\ &=& E(e^{tX}) E(e^{tY}) \\ &=& g_X(t) g_Y(t)\ , \end{eqnarray*} [[/math]]

and, replacing [math]t[/math] by [math]\log z[/math], we also get

[[math]] h_Z(z) = h_X(z) h_Y(z)\ . [[/math]]

Example If [math]X[/math] and [math]Y[/math] are independent discrete random variables with range [math]\{0,1,2,\ldots,n\}[/math] and binomial distribution

[[math]] p_X(j) = p_Y(j) = {n \choose j} p^j q^{n - j}\ , [[/math]]

and if [math]Z = X + Y[/math], then we know (cf. Section \ref{sec 7.1}) that the range of [math]X[/math] is

[[math]] \{0,1,2,\ldots,2n\} [[/math]]

and [math]X[/math] has binomial distribution

[[math]] p_Z(j) = (p_X * p_Y)(j) = {2n \choose j} p^j q^{2n - j}\ . [[/math]]

Here we can easily verify this result by using generating functions. We know that

[[math]] \begin{eqnarray*} g_X(t) = g_Y(t) &=& \sum_{j = 0}^n e^{tj} {n \choose j} p^j q^{n - j} \\ &=& (pe^t + q)^n\ , \end{eqnarray*} [[/math]]

and

[[math]] h_X(z) = h_Y(z) = (pz + q)^n\ . [[/math]]

Hence, we have

[[math]] g_Z(t) = g_X(t) g_Y(t) = (pe^t + q)^{2n}\ , [[/math]]

or, what is the same,

[[math]] \begin{eqnarray*} h_Z(z) &=& h_X(z) h_Y(z) = (pz + q)^{2n} \\ &=& \sum_{j = 0}^{2n} {2n \choose j} (pz)^j q^{2n - j}\ , \end{eqnarray*} [[/math]]

from which we can see that the coefficient of [math]z^j[/math] is just [math]p_Z(j) = {2n \choose j} p^j q^{2n - j}[/math].

Example If [math]X[/math] and [math]Y[/math] are independent discrete random variables with the non-negative integers [math]\{0,1,2,3,\ldots\}[/math] as range, and with geometric distribution function

[[math]] p_X(j) = p_Y(j) = q^j p\ , [[/math]]

then

[[math]] g_X(t) = g_Y(t) = \frac p{1 - qe^t}\ , [[/math]]

and if [math]Z = X + Y[/math], then

[[math]] \begin{eqnarray*} g_Z(t) &=& g_X(t) g_Y(t) \\ &=& \frac{p^2}{1 - 2qe^t + q^2 e^{2t}}\ . \end{eqnarray*} [[/math]]

If we replace [math]e^t[/math] by [math]z[/math], we get

[[math]] \begin{eqnarray*} h_Z(z) &=& \frac{p^2}{(1 - qz)^2} \\ &=& p^2 \sum_{k = 0}^\infty (k + 1) q^k z^k\ , \end{eqnarray*} [[/math]]

and we can read off the values of [math]p_Z(j)[/math] as the coefficient of [math]z^j[/math] in this expansion for [math]h(z)[/math], even though [math]h(z)[/math] is not a polynomial in this case. The distribution [math]p_Z[/math] is a negative binomial distribution (see Important Distributions).

Here is a more interesting example of the power and scope of the method of generating functions.

Heads or Tails

Example In the coin-tossing game discussed in Example, we now consider the question “When is Peter first in the lead?”


Let [math]X_k[/math] describe the outcome of the [math]k[/math]th trial in the game

[[math]] X_k = \left \{ \matrix{ +1, &\mbox{if}\,\, k{\rm th}\,\, \mbox{toss~is~heads}, \cr -1, &\mbox{if}\,\, k{\rm th}\,\, \mbox{toss~is~tails.}\cr}\right. [[/math]]

Then the [math]X_k[/math] are independent random variables describing a Bernoulli process. Let [math]S_0 = 0[/math], and, for [math]n \geq 1[/math], let

[[math]] S_n = X_1 + X_2 + \cdots + X_n\ . [[/math]]

Then [math]S_n[/math] describes Peter's fortune after [math]n[/math] trials, and Peter is first in the lead after [math]n[/math] trials if [math]S_k \leq 0[/math] for [math]1 \leq k \lt n[/math] and [math]S_n = 1[/math].


Now this can happen when [math]n = 1[/math], in which case [math]S_1 = X_1 = 1[/math], or when [math]n \gt 1[/math], in which case [math]S_1 = X_1 = -1[/math]. In the latter case, [math]S_k = 0[/math] for [math]k = n - 1[/math], and perhaps for other [math]k[/math] between 1 and [math]n[/math]. Let [math]m[/math] be the least such value of [math]k[/math]; then [math]S_m = 0[/math] and [math]S_k \lt 0[/math] for [math]1 \leq k \lt m[/math]. In this case Peter loses on the first trial, regains his initial position in the next [math]m - 1[/math] trials, and gains the lead in the next [math]n - m[/math] trials.


Let [math]p[/math] be the probability that the coin comes up heads, and let [math]q = 1-p[/math]. Let [math]r_n[/math] be the probability that Peter is first in the lead after [math]n[/math] trials. Then from the discussion above, we see that

[[math]] \begin{eqnarray*} r_n &=& 0\ , \qquad \mbox{if}\,\, n\,\, \mbox{even}, \\ r_1 &=& p \qquad (= \mbox{probability~of~heads~in~a~single~toss)}, \\ r_n &=& q(r_1r_{n-2} + r_3r_{n-4} +\cdots+ r_{n-2}r_1)\ , \qquad \mbox{if} \ n \gt 1,\ n\ \mbox{odd}. \end{eqnarray*} [[/math]]

Now let [math]T[/math] describe the time (that is, the number of trials) required for Peter to take the lead. Then [math]T[/math] is a random variable, and since [math]P(T = n) = r_n[/math], [math]r[/math] is the distribution function for [math]T[/math]. We introduce the generating function [math]h_T(z)[/math] for [math]T[/math]:

[[math]] h_T(z) = \sum_{n = 0}^\infty r_n z^n\ . [[/math]]

Then, by using the relations above, we can verify the relation

[[math]] h_T(z) = pz + qz(h_T(z))^2\ . [[/math]]

If we solve this quadratic equation for [math]h_T(z)[/math], we get

[[math]] h_T(z) = \frac{1 \pm \sqrt{1 - 4pqz^2}}{2qz} = \frac{2pz}{1 \mp \sqrt{1 - 4pqz^2}}\ . [[/math]]

Of these two solutions, we want the one that has a convergent power series in [math]z[/math] (i.e., that is finite for [math]z = 0[/math]). Hence we choose

[[math]] h_T(z) = \frac{1 - \sqrt{1 - 4pqz^2}}{2qz} = \frac{2pz}{1 + \sqrt{1 - 4pqz^2}}\ . [[/math]]

Now we can ask: What is the probability that Peter is ever in the lead? This probability is given by (see Exercise)

[[math]] \begin{eqnarray*} \sum_{n = 0}^\infty r_n &=& h_T(1) = \frac{1 - \sqrt{\mathstrut1 - 4pq}}{2q} \\ &=& \frac{1 - |p - q|}{2q} \\ &=& \left \{ \begin{array}{ll} p/q, & \mbox{if $p \lt q$}, \\ 1, & \mbox{if $p \geq q$}, \end{array}\right. \end{eqnarray*} [[/math]]

so that Peter is sure to be in the lead eventually if [math]p \geq q[/math].


How long will it take? That is, what is the expected value of [math]T[/math]? This value is given by

[[math]] E(T) = h_T'(1) = \left \{ \matrix { 1/(p - q), & \mbox{if}\,\, p \gt q, \cr \infty, & \mbox{if}\,\, p = q.\cr}\right. [[/math]]

This says that if [math]p \gt q[/math], then Peter can expect to be in the lead by about [math]1/(p - q)[/math] trials, but if [math]p = q[/math], he can expect to wait a long time.

A related problem, known as the Gambler's Ruin problem, is studied in Exercise and in Gambler's Ruin.

General references

Doyle, Peter G. (2006). "Grinstead and Snell's Introduction to Probability" (PDF). Retrieved June 6, 2024.