guide:3f98833561: Difference between revisions
No edit summary |
mNo edit summary |
||
Line 1: | Line 1: | ||
Bayesian credibility attempts to estimate a relevant quantity by taking a weighted average of estimates with the weighting directly influenced by the individual experience. | |||
==Statistical Models== | |||
Let <math>X</math> denote a random variable. A statistical model for <math>X</math> is a family of probability distributions that is hypothesized to contain the true distribution for <math>X</math>. We are primarily interested in families of distributions that depend on parameters or families that admit a parametrization. The only situation that is relevant for the exam is when the statistical model has a finite dimension -- the statistical model can be described by a finite number of real valued parameters: | |||
<math display="block"> | |||
\Theta \subset \mathbb{R}^d \,. | |||
</math> | |||
For instance, the family of all normal distributions can be parametrized by a two dimensional space since every normal distribution is characterized by its mean <math>\mu</math> and its variance <math>\sigma^2</math>. In what follows, the parameter space <math>\Theta</math> represents the parametrization of the statistical model; consequently, every element of the space represents a possible probability distribution for <math>X</math>: | |||
<math display="block"> | |||
\operatorname{P}(X \in A \,;\, \theta \in \Theta) = p(A \, ; \, \theta)\, . | |||
</math> | |||
If the distributions in the statistical model admit a density then it is denoted by <math>f(x\,;\,\theta)</math>: | |||
<math display="block"> | |||
p(A \, ; \, \theta) = \int_{A}f(x\,;\,\theta) \,dx\,. | |||
</math> | |||
===The Bayesian Model === | |||
The Bayesian approach to statistical modelling is to consider <math>X</math> as being generated through a two step process: | |||
#Generate an inobservable random variable <math>Y</math> taking values in <math>\Theta</math> | |||
#Generate <math>X</math> with distribution corresponding to the parameter <math>Y</math>. | |||
The distribution of <math>Y</math> is called the ''prior distribution'' which we denote by <math>G(\theta)</math> (with density <math>g(\theta)</math> if it exists): | |||
<math display="block"> | |||
\operatorname{P}(Y\leq \theta) = G(\theta). | |||
</math> | |||
In the Bayesian context, the parametrized distributions <math>p(x ; \theta)</math> are written as <math>p(x |\theta)</math> to emphasize that we're dealing with conditional probability distributions (conditional on <math>Y</math> taking on the value <math>\theta</math>). | |||
==The Posterior Distribution == | |||
In the [[#The Bayesian Model|the Bayesian model]] context, the posterior probability is the probability distribution of <math>Y</math> given <math>X</math>: | |||
<math display="block"> | |||
\operatorname{P}(Y\leq \theta \mid X) = G(\theta \mid X). | |||
</math> | |||
It contrasts with the likelihood function, which is the probability of the evidence given the parameters. | |||
=== Calculation === | |||
The posterior probability distribution of one random variable given the value of another can be calculated with [[wikipedia:Bayes' theorem|Bayes' theorem]] by multiplying the prior probability distribution by the likelihood function, and then dividing by the normalizing constant, as follows: | |||
<math display="block"> | |||
G(y \mid X = x) = {\int_{\theta \leq y} \mathcal{L}(x \mid \theta)\, dG(\theta) \over{\int_{\Theta} \mathcal{L}(x \mid \theta) \, dG(\theta)}}\,. | |||
</math> | |||
When both the prior and each distribution in the statistical model admit a probability density function, the posterior distribution also admits a density given by | |||
<math display="block"> | |||
f(\theta \mid X = x) = {f(x \mid \theta) g(\theta) \over {\int_{\Theta} f(x \mid \theta) g(\theta) \,d\theta}} | |||
</math> | |||
with <math>g(\theta)</math> is the prior density of <math>Y</math>, <math>f(x\mid \theta)</math> is the likelihood function as a function of <math>\theta</math>, and <math>\int_{\Theta} f(x \mid \theta) g(\theta) \,d\theta</math> is the normalizing constant. | |||
=== Updating the Prior === | |||
Consider the following situation: let | |||
<math display = "block"> | |||
X = [X_1,\ldots,X_n] | |||
</math> | |||
denote a sequence of ''n'' random variables sharing an (unknown) common distribution belonging to some statistical model that has been conveniently parametrized by <math>\Theta \subset \mathbb{R}^d</math> as above. Following [[#The Bayesian Model|The Bayesian Model]], we assume that the data has been generated as follows: | |||
#Generate an inobservable random variable <math>Y</math> taking values in <math>\Theta</math> with prior distribution <math>G(\theta)</math> | |||
#Generate <math>X_1,\ldots,X_n</math> with common distribution corresponding to parameter <math>Y</math> and conditionally mutually independent given <math>Y</math>. | |||
{{alert-warning | The random variables <math>X_1,\ldots,X_n</math> are not mutually independent but conditionally independent. As we will see below, dependence can exist because information from one subset of the data affects the posterior distribution which in turn affects the distribution of the other random variables (see also [[#The Predictive Posterior|The Predictive Posterior]]).}} | |||
The [[#The Posterior Distribution|posterior distribution]] equals | |||
<math display="block"> | |||
G_n(y \mid X = x) = {\int_{\theta \leq y} \mathcal{L}_n(x \mid \theta)\, dG(\theta) \over{\int_{\Theta} \mathcal{L}_n(x \mid \theta) \, dG(\theta)}}\, | |||
</math> | |||
with | |||
<math display="block"> | |||
\mathcal{L}_n(x\mid \theta) = \prod_{i=1}^n \mathcal{L}(x_i \mid \theta) \, , x = [x_1,\ldots,x_n] \in \mathbb{R}^n | |||
</math> | |||
and <math>\mathcal{L}(x_i\mid \theta)</math> denoting the likelihood function associated with the random variable <math>X_i</math> (these likelihood functions are identical since the random variables are assumed to have a common distribution). | |||
{{alert-info|You can think of the posterior distribution as an updated prior distribution: we start with a prior distribution <math>G(\theta)</math> which is updated to <math>G_n(\theta)</math> based on the information contained in the data.}} | |||
==Bayesian Prediction == | |||
In this section, we assume the generic setup (see [[#The Posterior Distribution|The Posterior Distribution]]): | |||
#a random variable <math>Y</math> taking values in <math>\Theta</math> with prior distribution <math>G(\theta)</math> | |||
#<math>X_1,\ldots,X_n</math> are random variables with common distribution corresponding to parameter <math>Y</math> and conditionally mutually independent given <math>Y</math>. | |||
===Predictive Prior === | |||
What is the join distribution for the data when no data has yet been observed? Since <math>Y</math> is unobserved, the joint distribution is actually equal to the ''expected'' join distribution and is called the ''predictive prior'': | |||
<math display="block"> | |||
\begin{align*} | |||
\operatorname{P}(X_i\in A_i\,, i=1,\ldots,n) &= \operatorname{E}\left[\operatorname{E}\left[\prod_{i=1}^n 1_{X_i \in A_i} \mid Y\right] \right] \\ | |||
&= \operatorname{E}\left[\prod_{i=1}^n p(A_i \mid Y) \right] \\ | |||
&= \int_{\Theta} \prod_{i=1}^n p(A_i \mid \theta) \, dG(\theta). | |||
\end{align*} | |||
</math> | |||
===Predictive Posterior === | |||
What is the joint distribution for new data when old data is present? Or how can we quantify the dependence between the data? If <math>X = [X_1,\ldots,X_n]</math> then the ''predictive posterior'' is given by | |||
<math display="block"> | |||
\begin{align*} | |||
\operatorname{P}(X_{n+i}\in A_i \mid X) &= \int_{\Theta} \prod_{i=1}^k p(A_i \mid \theta) \, dG_n(\theta). | |||
\end{align*} | |||
</math> | |||
=== Bayesian Credibility Estimator === | |||
Given the <math>n</math> data points <math>X_1,\ldots,X_n</math>, which can be thought of as your ''information'', how can we estimate an unobservable random variable <math>Y</math>? One way to proceed is to try to find an estimator that minimizes the mean square error | |||
<math display="block"> | |||
\begin{equation} | |||
\label{mse} | |||
\min \operatorname{E}[(Z - Y)^2] \quad Z \,\,\text{depending on }\,X_1,\ldots,X_n \, . | |||
\end{equation} | |||
</math> | |||
To be mathematically precise, the minimizer must be [[wikipedia:measurable_function|measurable]] with respect to the [[wikipedia:sigma-field|sigma-field]] (information) generated by <math>X_1,\ldots,X_n</math>. It is well-known that the solution to \ref{mse}, often called the '''minimum mean square estimator''', is unique (in a well-defined sense) and equals the conditional expectation <math>\operatorname{E}[Y | X_1,\ldots,X_n].</math> | |||
Now suppose that the data <math>X_1,\ldots,X_n</math> correspond to quantities of interest to insurers such as pure premium, claim severities, claim frequencies, etc. associated with an insured belonging to a risk class <math>\Theta</math> (the data represents the ''experience'' of the insured). The minimum mean square estimator for the expected value <math>\mu(\Theta) = \operatorname{E}[X_i | \Theta] </math> is called the '''Bayesian credibility estimator''' and equals | |||
<math display="block"> | |||
\operatorname{E}[\mu(\Theta) | X_1,\ldots,X_n] = \int_{\Theta}\mu(\theta) \, dG_n(\theta). | |||
</math> | |||
==Conjugate Families == | |||
If the [[#The Posterior Distribution|posterior distributions]] are in the same family as the prior, the prior and posterior are then called '''conjugate distributions,''' and the prior is called a '''conjugate prior''' for the [[#Likelihood Functions|likelihood functions]]. A conjugate prior is an algebraic convenience, giving a closed-form expression for the posterior. Further, conjugate priors may give intuition, by more transparently showing how a likelihood function updates a prior distribution. All members of the exponential family have conjugate priors. The exponential families include many of the most common distributions, including the normal, exponential, gamma, beta and Poisson distributions. | |||
=== Exponential family: scalar parameter === | |||
A single-parameter exponential family is a set of probability distributions whose probability density function (or probability mass function), for the case of a [[wikipedia:discrete distribution|discrete distribution]]) can be expressed in the form<math display="block"> f(x\mid\theta) = h(x) \exp \left (\eta(\theta) \cdot T(x) -A(\theta)\right )</math> | |||
where <math>T(x), h(x),\eta(\theta)</math> and <math>A(\theta)</math> are known functions. An alternative, equivalent form often given is<math display="block"> f(x\mid\theta) = h(x) g(\theta) \exp \left ( \eta(\theta) \cdot T(x) \right )</math> | |||
or equivalently<math display="block"> f(x\mid\theta) = \exp \left (\eta(\theta) \cdot T(x) - A(\theta) + B(x) \right )</math> | |||
The value <math>\theta</math> is called the parameter of the family. | |||
Note that <math>x</math> is often a vector of measurements, in which case <math>T(x)</math> may be a function from the space of possible values of <math>x</math> to the real numbers. More generally, <math>\eta(\theta)</math> and <math>T(x)</math> can each be vector-valued such that <math>\eta(\theta)'\cdot T(x)</math> is real-valued. | |||
If <math>\eta(\theta) = \theta </math>, then the exponential family is said to be in ''[[wikipedia:canonical form|canonical form]]''. By defining a transformed parameter <math>\eta = \eta(\theta)</math>, it is always possible to convert an exponential family to canonical form. The canonical form is non-unique, since <math>\eta(\theta)</math> can be multiplied by any nonzero constant, provided that <math>T(x)</math> is multiplied by that constant's reciprocal, or a constant <math>c</math> can be added to <math>\eta(\theta)</math> and <math>h(x)</math> multiplied by <math>\exp (-c \cdot T(x)) </math> to offset it. | |||
Note also that the function <math>A(\theta)</math> or equivalently <math>g(\theta)</math> is automatically determined once the other functions have been chosen, and assumes a form that causes the distribution to be [[wikipedia:normalizing constant|normalized]] (sum or integrate to one over the entire domain). Furthermore, both of these functions can always be written as functions of <math>\eta</math>, even when <math>\eta(\theta)</math> is not a [[wikipedia:bijection|one-to-one]] function, i.e. two or more different values of <math>\theta</math> map to the same value of <math>\eta(\theta)</math>, and hence <math>\eta(\theta)</math> cannot be inverted. In such a case, all values of <math>\theta</math> mapping to the same <math>\eta(\theta)</math> will also have the same value for <math>A(\theta)</math> and <math>g(\theta)</math>. | |||
=== Exponential family: vector parameter === | |||
The definition in terms of one ''real-number'' parameter can be extended to one ''real-vector'' parameter<math display="block">{\boldsymbol \theta} = \left(\theta_1, \theta_2, \cdots, \theta_s \right )^T.</math> | |||
A family of distributions is said to belong to a vector exponential family if the probability density function (or probability mass function, for discrete distributions) can be written as<math display="block"> f(x\mid\boldsymbol \theta) = h(x) \exp\left(\sum_{i=1}^s \eta_i({\boldsymbol \theta}) T_i(x) - A({\boldsymbol \theta}) \right)</math> | |||
Or in a more compact form,<math display="block"> f(x\mid\boldsymbol \theta) = h(x) \exp\Big(\boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(x) - A({\boldsymbol \theta}) \Big) </math> | |||
This form writes the sum as a [[wikipedia:dot product|dot product]] of vector-valued functions <math>\boldsymbol\eta({\boldsymbol \theta})</math> and <math>\mathbf{T}(x)</math>. An alternative, equivalent form often seen is | |||
<math display="block"> f(x\mid\boldsymbol \theta) = h(x) g(\boldsymbol \theta) \exp\Big(\boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(x)\Big)</math> | |||
As in the scalar valued case, the exponential family is said to be in canonical form if<math display="block">\forall i: \quad \eta_i({\boldsymbol \theta}) = \theta_i.</math> | |||
A vector exponential family is said to be ''curved'' if the dimension of<math display="block">{\boldsymbol \theta} = \left (\theta_1, \theta_2, \ldots, \theta_d \right )^T</math> | |||
is less than the dimension of the vector<math display="block">{\boldsymbol \eta}(\boldsymbol \theta) = \left (\eta_1(\boldsymbol \theta), \eta_2(\boldsymbol \theta), \ldots, \eta_s(\boldsymbol \theta) \right )^T.</math> | |||
That is, if the ''dimension'' of the parameter vector is less than the ''number of functions'' of the parameter vector in the above representation of the probability density function. Note that most common distributions in the exponential family are ''not'' curved, and many algorithms designed to work with any member of the exponential family implicitly or explicitly assume that the distribution is not curved. | |||
Note that, as in the above case of a scalar-valued parameter, the function <math>A(\boldsymbol \theta)</math> or equivalently <math>g(\boldsymbol \theta)</math> is automatically determined once the other functions have been chosen, so that the entire distribution is normalized. In addition, as above, both of these functions can always be written as functions of <math>\boldsymbol\eta</math>, regardless of the form of the transformation that generates <math>\boldsymbol\eta</math> from <math>\boldsymbol\theta</math>. Hence an exponential family in its "natural form" (parametrized by its natural parameter) looks like<math display="block"> f(x\mid\boldsymbol \eta) = h(x) \exp\Big(\boldsymbol\eta \cdot \mathbf{T}(x) - A({\boldsymbol \eta})\Big)</math> | |||
or equivalently<math display="block"> f(x\mid\boldsymbol \eta) = h(x) g(\boldsymbol \eta) \exp\Big(\boldsymbol\eta \cdot \mathbf{T}(x)\Big)</math> | |||
Note that the above forms may sometimes be seen with <math>\boldsymbol\eta^T \mathbf{T}(x)</math> in place of <math>\boldsymbol\eta \cdot \mathbf{T}(x)</math>. These are exactly equivalent formulations, merely using different notation for the [[wikipedia:dot product|dot product]]. | |||
=== Interpretation === | |||
In the definitions above, the functions <math>T(x)</math>,<math>\eta(\theta)</math> and <math>A(\eta)</math> were apparently arbitrarily defined. However, these functions play a significant role in the resulting probability distribution. | |||
* <math>T(x)</math> is a sufficient statistic of the distribution. For exponential families, the sufficient statistic is a function of the data that fully summarizes the data <math>x</math> within the density function. This means that, for any data sets <math>x</math> and <math>y</math>, the density value is the same if <math>T(x) = T(y)</math>. This is true even if <math>x</math> and <math>y</math> are quite different—that is, <math> d(x,y)>0 </math>. The dimension of <math>T(x)</math> equals the number of parameters of <math>\theta</math> and encompasses all of the information regarding the data related to the parameter <math>\theta</math>. The sufficient statistic of a set of [[wikipedia:independent identically distributed|independent identically distributed]] data observations is simply the sum of individual sufficient statistics, and encapsulates all the information needed to describe the posterior distribution of the parameters, given the data (and hence to derive any desired estimate of the parameters). | |||
* <math>\eta</math> is called the ''natural parameter''. The set of values of <math>\eta</math> for which the function <math>f(x;\theta)</math> is finite is called the ''natural parameter space''. | |||
* <math>A(\eta)</math> is called the ''partition function'' because it is the logarithm of a normalization factor, without which <math>f(x;\theta)</math> would not be a probability distribution ("partition function" is often used in statistics as a synonym of "normalization factor"): | |||
<math display="block"> A(\eta) = \ln\left ( \int_x h(x) \exp (\eta(\theta) \cdot T(x)) \operatorname{d}x \right )</math> | |||
===Conjugate Priors for Exponential Family === | |||
A conjugate prior <math>\pi</math> for the parameter <math>\boldsymbol\eta</math> of an exponential family<math display="block"> f(x|\boldsymbol\eta) = h(x) \exp \left ( {\boldsymbol\eta}^{\rm T}\mathbf{T}(x) -A(\boldsymbol\eta)\right )</math> | |||
is given by<math display="block">p_\pi(\boldsymbol\eta\mid\boldsymbol\chi,\nu) = b(\boldsymbol\chi,\nu) \exp \left (\boldsymbol\eta \cdot \boldsymbol\chi - \nu A(\boldsymbol\eta) \right ),</math> | |||
or equivalently<math display="block">p_\pi(\boldsymbol\eta\mid\boldsymbol\chi,\nu) = b(\boldsymbol\chi,\nu) g(\boldsymbol\eta)^\nu \exp \left (\boldsymbol\eta^{\rm T} \boldsymbol\chi \right ), \qquad \boldsymbol\chi \in \mathbb{R}^s</math> | |||
where | |||
*<math>s</math> is the dimension of <math>\boldsymbol\eta</math> and <math>\nu > 0 </math> and <math>\boldsymbol\chi</math> are hyperparameters (parameters controlling parameters) | |||
*<math>\nu</math> corresponds to the effective number of observations that the prior distribution contributes | |||
*<math>\boldsymbol\chi</math> corresponds to the total amount that these pseudo-observations contribute to the [[wikipedia:sufficient statistic|sufficient statistic]] over all observations and pseudo-observations | |||
*<math>f(\boldsymbol\chi,\nu)</math> is a [[wikipedia:normalization constant|normalization constant]] that is automatically determined by the remaining functions and serves to ensure that the given function is a probability density function (i.e. it is normalized) | |||
*<math>A(\boldsymbol\eta)</math> and equivalently <math>g(\boldsymbol\eta)</math> are the same functions as in the definition of the distribution over which <math>\pi</math> is the conjugate prior. | |||
The posterior distribution after observing <math>n</math> observations has the same form as the prior with updated parameters <math display="block">\begin{align*} | |||
\boldsymbol\chi' &= \boldsymbol\chi + \mathbf{T}(\mathbf{X}) \\ | |||
&= \boldsymbol\chi + \sum_{i=1}^n \mathbf{T}(x_i) \\ | |||
\nu' &= \nu + n. | |||
\end{align*} </math> | |||
<div> | |||
<proofs page="guide_proofs:3f98833561" section="updeqns" label="Update Equations" /> | |||
</div> | |||
Note in particular that the data <math>\mathbf{X}</math> enters into this equation only in the expression<math display="block">\mathbf{T}(\mathbf{X}) = \sum_{i=1}^n \mathbf{T}(x_i),</math> | |||
which is termed the [[wikipedia:sufficient statistic|sufficient statistic]] of the data. That is, the value of the sufficient statistic is sufficient to completely determine the posterior distribution. The actual data points themselves are not needed, and all sets of data points with the same sufficient statistic will have the same distribution. This is important because the dimension of the sufficient statistic does not grow with the data size — it has only as many components as the components of <math>\boldsymbol\eta</math> (equivalently, the number of parameters of the distribution of a single data point). | |||
Note also that because of the way that the sufficient statistic is computed, it necessarily involves sums of components of the data (in some cases disguised as products or other forms — a product can be written in terms of a sum of logarithms. The cases where the update equations for particular distributions don't exactly match the above forms are cases where the conjugate prior has been expressed using a different parameterization than the one that produces a conjugate prior of the above form — often specifically because the above form is defined over the natural parameter <math>\boldsymbol\eta</math> while conjugate priors are usually defined over the actual parameter <math>\boldsymbol\theta .</math> | |||
===Beta-Binomial Model === | |||
The family of binomial distributions is a subfamily of the exponential family: | |||
<math display="block"> | |||
\begin{align*} | |||
p(k \mid \eta) &= \binom{m}{k} q^k \cdot (1-q)^{1-k} \\ | |||
&= \binom{m}{k} e^{k\eta - A(\eta)}\,,\, \eta = \log(q/(1-q)), A(\eta) = -\log(1-q) \,. | |||
\end{align*} | |||
</math> | |||
From the discussion above, the conjugate family is | |||
<math display="block"> | |||
p_\pi(\eta \mid \chi,\nu) = b(\boldsymbol\chi,\nu) \exp \left (\eta \cdot \chi - \nu A(\eta) \right ). | |||
</math> | |||
To identify the family, transform back to <math>q</math> via the implicit transformation <math>\eta = \log(q/(1-q)) = \eta(q) </math>: | |||
<math display="block"> | |||
\begin{equation} | |||
\label{cjfam-bb} | |||
p_\pi(\eta \mid \chi, \eta(q)) \frac{dq}{d\eta} = q^{\chi}(1-q)^{\nu - \chi} [q(1-q)]^{-1} = q^{\chi -1 }(1-q)^{\nu - \chi - 1}. | |||
\end{equation} | |||
</math> | |||
We recognize \ref{cjfam-bb} as the family of beta distributions with parameters <math>\alpha = \chi</math> and <math>\beta = \nu - \chi</math>. | |||
Let <math>N_1,\ldots,N_n</math> denote the number of claims related to a portfolio of insurance contracts at the different time periods <math>i=1, | |||
\ldots,n</math>. We assume the following: | |||
*The claim frequency <math>N_i</math> is assumed to have a binomial distribution with size parameter <math>m = w_i</math> (known) and success parameter <math>q</math> (unknown). | |||
*The claims are conditionally independent (conditional on <math>q</math>). | |||
To compute the posterior for <math>q</math> given the claim frequency data, we use the update equations (see the discussion above) and conclude that the posterior is a beta distribution with parameters | |||
<math display="block"> | |||
\begin{align*} | |||
\alpha^{\prime} = \chi^{\prime} &= \chi + \sum_{i=1}^n N_i =\alpha + \sum_{i=1}^n N_i \\ | |||
\beta^{\prime} = \nu^{\prime} - \chi ^{\prime} &= \nu + \sum_{i=1}^n w_i - \chi^{\prime} = \beta + w - \sum_{i=1}^n N_i | |||
\end{align*} | |||
</math> | |||
with <math>w = \sum_{i=1}^n w_i </math>. If we interpret the <math>w_i</math> as exposure units, the Bayesian credibility estimator for the claim frequency per unit of exposure equals | |||
<math display = "block"> | |||
\operatorname{E}[mq | N_1,\ldots,N_n] = m \frac{\alpha^{\prime}}{\alpha^{\prime} + \beta^{\prime}} = \frac{\alpha^{\prime} m}{w + \beta}. | |||
</math> | |||
===Poisson-Gamma Model === | |||
Proceeding as in the previous example, but with the following different assumption: the claim frequencies <math>N_i</math> have a Poisson distribution with mean <math>\lambda w_i </math> (with only <math>\lambda</math> unknown). If we set | |||
<math display="block"> | |||
\boldsymbol{N} = [N_1,\ldots,N_n]\,,\, \boldsymbol{k} = [k_1,\ldots,k_n] | |||
</math> | |||
, then | |||
<math display="block"> | |||
\begin{align*} | |||
\operatorname{P}(\boldsymbol{N} = \boldsymbol{k} \mid \eta) = p(\boldsymbol{k} \mid \eta) &=\prod_{i=1}^n \frac{(\lambda w_i)^{k_i} e^{-\lambda w_i}}{k_i!} \\ | |||
&\propto | |||
\exp\left(\eta \cdot T(\boldsymbol{k}) - A(\eta) \right) \quad (\text{change of variable}\,\, \eta = \log(\lambda)) | |||
\end{align*} | |||
</math> | |||
with | |||
<math display="block"> | |||
A(\eta) = we^{\eta},\,w=\sum_{i=1}^nw_i,\, T(\boldsymbol{k}) = \sum_{i=1}^n k_i \,. | |||
</math> | |||
The conjugate prior family is then given by | |||
<math display="block"> | |||
\begin{align*} | |||
p_\pi(\eta \mid \chi,\nu) &= b(\chi,\nu) \exp \left(\eta \chi - \nu A(\eta) \right) \\ | |||
& \propto \exp \left(\eta \chi - \nu w e^{\eta} \right) \\ | |||
& \propto \widetilde{\eta}^{\chi -1} e^{-\nu w \widetilde{\eta}} \quad (\text{change of variable}\,\, \widetilde{\eta} = e^{\eta}). | |||
\end{align*} | |||
</math> | |||
Hence the conjugate prior family is the family of gamma distributions with shape parameter <math>\alpha = \chi</math> and scale parameter <math>\beta = (\nu w)^{-1}</math> with corresponding gamma posterior with parameters (use the update equations derived in [[#Conjugate Priors for Exponential Family]] with the single data point <math>\boldsymbol{N}</math>): | |||
<math display="block"> | |||
\begin{align*} | |||
\alpha^{\prime} &= \chi^{\prime} = \chi + T(\boldsymbol{N}) = \alpha + \sum_{i=1}^n N_i \\ | |||
\beta^{\prime} &= (\nu^{\prime} w)^{-1} = [(\nu + 1)w]^{-1} = (w + \beta^{-1})^{-1}. | |||
\end{align*} | |||
</math> | |||
If we interpret the <math>w_i</math> as exposure units, the Bayesian credibility estimator for the claim frequency per unit of exposure equals | |||
<math display = "block"> | |||
\operatorname{E}[\lambda | N_1,\ldots,N_n] = \alpha^{\prime}\beta^{\prime}. | |||
</math> | |||
==Wikipedia References== | |||
*{{cite web |url = https://en.wikipedia.org/w/index.php?title=Bayesian_inference&oldid=900333115 | title= Bayesian inference | author = Wikipedia contributors | website= Wikipedia |publisher= Wikipedia |access-date = 21 June 2019 }} | |||
*{{cite web |url = https://en.wikipedia.org/w/index.php?title=Exponential_family&oldid=897643759 | title= Exponential family | author = Wikipedia contributors | website= Wikipedia |publisher= Wikipedia |access-date = 21 June 2019 }} | |||
*{{cite web |url = https://en.wikipedia.org/w/index.php?title=Likelihood_function&oldid=902423819 | title= Likelihood function | author = Wikipedia contributors | website= Wikipedia |publisher= Wikipedia |access-date = 21 June 2019 }} |
Revision as of 22:58, 22 August 2022
Bayesian credibility attempts to estimate a relevant quantity by taking a weighted average of estimates with the weighting directly influenced by the individual experience.
Statistical Models
Let [math]X[/math] denote a random variable. A statistical model for [math]X[/math] is a family of probability distributions that is hypothesized to contain the true distribution for [math]X[/math]. We are primarily interested in families of distributions that depend on parameters or families that admit a parametrization. The only situation that is relevant for the exam is when the statistical model has a finite dimension -- the statistical model can be described by a finite number of real valued parameters:
For instance, the family of all normal distributions can be parametrized by a two dimensional space since every normal distribution is characterized by its mean [math]\mu[/math] and its variance [math]\sigma^2[/math]. In what follows, the parameter space [math]\Theta[/math] represents the parametrization of the statistical model; consequently, every element of the space represents a possible probability distribution for [math]X[/math]:
If the distributions in the statistical model admit a density then it is denoted by [math]f(x\,;\,\theta)[/math]:
The Bayesian Model
The Bayesian approach to statistical modelling is to consider [math]X[/math] as being generated through a two step process:
- Generate an inobservable random variable [math]Y[/math] taking values in [math]\Theta[/math]
- Generate [math]X[/math] with distribution corresponding to the parameter [math]Y[/math].
The distribution of [math]Y[/math] is called the prior distribution which we denote by [math]G(\theta)[/math] (with density [math]g(\theta)[/math] if it exists):
In the Bayesian context, the parametrized distributions [math]p(x ; \theta)[/math] are written as [math]p(x |\theta)[/math] to emphasize that we're dealing with conditional probability distributions (conditional on [math]Y[/math] taking on the value [math]\theta[/math]).
The Posterior Distribution
In the the Bayesian model context, the posterior probability is the probability distribution of [math]Y[/math] given [math]X[/math]:
It contrasts with the likelihood function, which is the probability of the evidence given the parameters.
Calculation
The posterior probability distribution of one random variable given the value of another can be calculated with Bayes' theorem by multiplying the prior probability distribution by the likelihood function, and then dividing by the normalizing constant, as follows:
When both the prior and each distribution in the statistical model admit a probability density function, the posterior distribution also admits a density given by
with [math]g(\theta)[/math] is the prior density of [math]Y[/math], [math]f(x\mid \theta)[/math] is the likelihood function as a function of [math]\theta[/math], and [math]\int_{\Theta} f(x \mid \theta) g(\theta) \,d\theta[/math] is the normalizing constant.
Updating the Prior
Consider the following situation: let
denote a sequence of n random variables sharing an (unknown) common distribution belonging to some statistical model that has been conveniently parametrized by [math]\Theta \subset \mathbb{R}^d[/math] as above. Following The Bayesian Model, we assume that the data has been generated as follows:
- Generate an inobservable random variable [math]Y[/math] taking values in [math]\Theta[/math] with prior distribution [math]G(\theta)[/math]
- Generate [math]X_1,\ldots,X_n[/math] with common distribution corresponding to parameter [math]Y[/math] and conditionally mutually independent given [math]Y[/math].
The posterior distribution equals
with
and [math]\mathcal{L}(x_i\mid \theta)[/math] denoting the likelihood function associated with the random variable [math]X_i[/math] (these likelihood functions are identical since the random variables are assumed to have a common distribution).
Bayesian Prediction
In this section, we assume the generic setup (see The Posterior Distribution):
- a random variable [math]Y[/math] taking values in [math]\Theta[/math] with prior distribution [math]G(\theta)[/math]
- [math]X_1,\ldots,X_n[/math] are random variables with common distribution corresponding to parameter [math]Y[/math] and conditionally mutually independent given [math]Y[/math].
Predictive Prior
What is the join distribution for the data when no data has yet been observed? Since [math]Y[/math] is unobserved, the joint distribution is actually equal to the expected join distribution and is called the predictive prior:
Predictive Posterior
What is the joint distribution for new data when old data is present? Or how can we quantify the dependence between the data? If [math]X = [X_1,\ldots,X_n][/math] then the predictive posterior is given by
Bayesian Credibility Estimator
Given the [math]n[/math] data points [math]X_1,\ldots,X_n[/math], which can be thought of as your information, how can we estimate an unobservable random variable [math]Y[/math]? One way to proceed is to try to find an estimator that minimizes the mean square error
To be mathematically precise, the minimizer must be measurable with respect to the sigma-field (information) generated by [math]X_1,\ldots,X_n[/math]. It is well-known that the solution to \ref{mse}, often called the minimum mean square estimator, is unique (in a well-defined sense) and equals the conditional expectation [math]\operatorname{E}[Y | X_1,\ldots,X_n].[/math]
Now suppose that the data [math]X_1,\ldots,X_n[/math] correspond to quantities of interest to insurers such as pure premium, claim severities, claim frequencies, etc. associated with an insured belonging to a risk class [math]\Theta[/math] (the data represents the experience of the insured). The minimum mean square estimator for the expected value [math]\mu(\Theta) = \operatorname{E}[X_i | \Theta] [/math] is called the Bayesian credibility estimator and equals
Conjugate Families
If the posterior distributions are in the same family as the prior, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood functions. A conjugate prior is an algebraic convenience, giving a closed-form expression for the posterior. Further, conjugate priors may give intuition, by more transparently showing how a likelihood function updates a prior distribution. All members of the exponential family have conjugate priors. The exponential families include many of the most common distributions, including the normal, exponential, gamma, beta and Poisson distributions.
Exponential family: scalar parameter
A single-parameter exponential family is a set of probability distributions whose probability density function (or probability mass function), for the case of a discrete distribution) can be expressed in the form
where [math]T(x), h(x),\eta(\theta)[/math] and [math]A(\theta)[/math] are known functions. An alternative, equivalent form often given is
or equivalently
The value [math]\theta[/math] is called the parameter of the family.
Note that [math]x[/math] is often a vector of measurements, in which case [math]T(x)[/math] may be a function from the space of possible values of [math]x[/math] to the real numbers. More generally, [math]\eta(\theta)[/math] and [math]T(x)[/math] can each be vector-valued such that [math]\eta(\theta)'\cdot T(x)[/math] is real-valued.
If [math]\eta(\theta) = \theta [/math], then the exponential family is said to be in canonical form. By defining a transformed parameter [math]\eta = \eta(\theta)[/math], it is always possible to convert an exponential family to canonical form. The canonical form is non-unique, since [math]\eta(\theta)[/math] can be multiplied by any nonzero constant, provided that [math]T(x)[/math] is multiplied by that constant's reciprocal, or a constant [math]c[/math] can be added to [math]\eta(\theta)[/math] and [math]h(x)[/math] multiplied by [math]\exp (-c \cdot T(x)) [/math] to offset it.
Note also that the function [math]A(\theta)[/math] or equivalently [math]g(\theta)[/math] is automatically determined once the other functions have been chosen, and assumes a form that causes the distribution to be normalized (sum or integrate to one over the entire domain). Furthermore, both of these functions can always be written as functions of [math]\eta[/math], even when [math]\eta(\theta)[/math] is not a one-to-one function, i.e. two or more different values of [math]\theta[/math] map to the same value of [math]\eta(\theta)[/math], and hence [math]\eta(\theta)[/math] cannot be inverted. In such a case, all values of [math]\theta[/math] mapping to the same [math]\eta(\theta)[/math] will also have the same value for [math]A(\theta)[/math] and [math]g(\theta)[/math].
Exponential family: vector parameter
The definition in terms of one real-number parameter can be extended to one real-vector parameter
A family of distributions is said to belong to a vector exponential family if the probability density function (or probability mass function, for discrete distributions) can be written as
Or in a more compact form,
This form writes the sum as a dot product of vector-valued functions [math]\boldsymbol\eta({\boldsymbol \theta})[/math] and [math]\mathbf{T}(x)[/math]. An alternative, equivalent form often seen is
As in the scalar valued case, the exponential family is said to be in canonical form if
A vector exponential family is said to be curved if the dimension of
is less than the dimension of the vector
That is, if the dimension of the parameter vector is less than the number of functions of the parameter vector in the above representation of the probability density function. Note that most common distributions in the exponential family are not curved, and many algorithms designed to work with any member of the exponential family implicitly or explicitly assume that the distribution is not curved.
Note that, as in the above case of a scalar-valued parameter, the function [math]A(\boldsymbol \theta)[/math] or equivalently [math]g(\boldsymbol \theta)[/math] is automatically determined once the other functions have been chosen, so that the entire distribution is normalized. In addition, as above, both of these functions can always be written as functions of [math]\boldsymbol\eta[/math], regardless of the form of the transformation that generates [math]\boldsymbol\eta[/math] from [math]\boldsymbol\theta[/math]. Hence an exponential family in its "natural form" (parametrized by its natural parameter) looks like
or equivalently
Note that the above forms may sometimes be seen with [math]\boldsymbol\eta^T \mathbf{T}(x)[/math] in place of [math]\boldsymbol\eta \cdot \mathbf{T}(x)[/math]. These are exactly equivalent formulations, merely using different notation for the dot product.
Interpretation
In the definitions above, the functions [math]T(x)[/math],[math]\eta(\theta)[/math] and [math]A(\eta)[/math] were apparently arbitrarily defined. However, these functions play a significant role in the resulting probability distribution.
- [math]T(x)[/math] is a sufficient statistic of the distribution. For exponential families, the sufficient statistic is a function of the data that fully summarizes the data [math]x[/math] within the density function. This means that, for any data sets [math]x[/math] and [math]y[/math], the density value is the same if [math]T(x) = T(y)[/math]. This is true even if [math]x[/math] and [math]y[/math] are quite different—that is, [math] d(x,y)\gt0 [/math]. The dimension of [math]T(x)[/math] equals the number of parameters of [math]\theta[/math] and encompasses all of the information regarding the data related to the parameter [math]\theta[/math]. The sufficient statistic of a set of independent identically distributed data observations is simply the sum of individual sufficient statistics, and encapsulates all the information needed to describe the posterior distribution of the parameters, given the data (and hence to derive any desired estimate of the parameters).
- [math]\eta[/math] is called the natural parameter. The set of values of [math]\eta[/math] for which the function [math]f(x;\theta)[/math] is finite is called the natural parameter space.
- [math]A(\eta)[/math] is called the partition function because it is the logarithm of a normalization factor, without which [math]f(x;\theta)[/math] would not be a probability distribution ("partition function" is often used in statistics as a synonym of "normalization factor"):
Conjugate Priors for Exponential Family
A conjugate prior [math]\pi[/math] for the parameter [math]\boldsymbol\eta[/math] of an exponential family
is given by
or equivalently
where
- [math]s[/math] is the dimension of [math]\boldsymbol\eta[/math] and [math]\nu \gt 0 [/math] and [math]\boldsymbol\chi[/math] are hyperparameters (parameters controlling parameters)
- [math]\nu[/math] corresponds to the effective number of observations that the prior distribution contributes
- [math]\boldsymbol\chi[/math] corresponds to the total amount that these pseudo-observations contribute to the sufficient statistic over all observations and pseudo-observations
- [math]f(\boldsymbol\chi,\nu)[/math] is a normalization constant that is automatically determined by the remaining functions and serves to ensure that the given function is a probability density function (i.e. it is normalized)
- [math]A(\boldsymbol\eta)[/math] and equivalently [math]g(\boldsymbol\eta)[/math] are the same functions as in the definition of the distribution over which [math]\pi[/math] is the conjugate prior.
The posterior distribution after observing [math]n[/math] observations has the same form as the prior with updated parameters
<proofs page="guide_proofs:3f98833561" section="updeqns" label="Update Equations" />
Note in particular that the data [math]\mathbf{X}[/math] enters into this equation only in the expression
which is termed the sufficient statistic of the data. That is, the value of the sufficient statistic is sufficient to completely determine the posterior distribution. The actual data points themselves are not needed, and all sets of data points with the same sufficient statistic will have the same distribution. This is important because the dimension of the sufficient statistic does not grow with the data size — it has only as many components as the components of [math]\boldsymbol\eta[/math] (equivalently, the number of parameters of the distribution of a single data point).
Note also that because of the way that the sufficient statistic is computed, it necessarily involves sums of components of the data (in some cases disguised as products or other forms — a product can be written in terms of a sum of logarithms. The cases where the update equations for particular distributions don't exactly match the above forms are cases where the conjugate prior has been expressed using a different parameterization than the one that produces a conjugate prior of the above form — often specifically because the above form is defined over the natural parameter [math]\boldsymbol\eta[/math] while conjugate priors are usually defined over the actual parameter [math]\boldsymbol\theta .[/math]
Beta-Binomial Model
The family of binomial distributions is a subfamily of the exponential family:
From the discussion above, the conjugate family is
To identify the family, transform back to [math]q[/math] via the implicit transformation [math]\eta = \log(q/(1-q)) = \eta(q) [/math]:
We recognize \ref{cjfam-bb} as the family of beta distributions with parameters [math]\alpha = \chi[/math] and [math]\beta = \nu - \chi[/math].
Let [math]N_1,\ldots,N_n[/math] denote the number of claims related to a portfolio of insurance contracts at the different time periods [math]i=1, \ldots,n[/math]. We assume the following:
- The claim frequency [math]N_i[/math] is assumed to have a binomial distribution with size parameter [math]m = w_i[/math] (known) and success parameter [math]q[/math] (unknown).
- The claims are conditionally independent (conditional on [math]q[/math]).
To compute the posterior for [math]q[/math] given the claim frequency data, we use the update equations (see the discussion above) and conclude that the posterior is a beta distribution with parameters
with [math]w = \sum_{i=1}^n w_i [/math]. If we interpret the [math]w_i[/math] as exposure units, the Bayesian credibility estimator for the claim frequency per unit of exposure equals
Poisson-Gamma Model
Proceeding as in the previous example, but with the following different assumption: the claim frequencies [math]N_i[/math] have a Poisson distribution with mean [math]\lambda w_i [/math] (with only [math]\lambda[/math] unknown). If we set
, then
with
The conjugate prior family is then given by
Hence the conjugate prior family is the family of gamma distributions with shape parameter [math]\alpha = \chi[/math] and scale parameter [math]\beta = (\nu w)^{-1}[/math] with corresponding gamma posterior with parameters (use the update equations derived in #Conjugate Priors for Exponential Family with the single data point [math]\boldsymbol{N}[/math]):
If we interpret the [math]w_i[/math] as exposure units, the Bayesian credibility estimator for the claim frequency per unit of exposure equals
Wikipedia References
- Wikipedia contributors. "Bayesian inference". Wikipedia. Wikipedia. Retrieved 21 June 2019.
- Wikipedia contributors. "Exponential family". Wikipedia. Wikipedia. Retrieved 21 June 2019.
- Wikipedia contributors. "Likelihood function". Wikipedia. Wikipedia. Retrieved 21 June 2019.