guide:4a8e1ae47c: Difference between revisions
No edit summary |
mNo edit summary |
||
Line 1: | Line 1: | ||
Of the countless number of possible mechanisms and processes that could have produced the data, how can one even begin to choose the best model? Two of the most commonly used criterions are (i) the '''Akaike information criterion''' and (ii) the '''Bayesian information criterion'''. | |||
==Akaike Information Criterion (AIC) == | |||
The '''Akaike information criterion''' ('''AIC''') is an estimator of [[wikipedia:in-sample|in-sample]] prediction error and thereby relative quality of statistical models for a given set of data.<ref>{{cite book |first=Trevor |last=Hastie |authorlink=wikipedia:Trevor Hastie |title=The Elements of Statistical Learning |location= |publisher=Springer |year=2009 |isbn=978-0-387-84857-0 |page=203 |quote=The Akaike information criterion is a[n] [...] estimate of Err_{in} when the log-likelihood loss function is used. |url=https://books.google.com/books?id=yPfZBwAAQBAJ&pg=PA203 }}</ref> In-sample prediction error is the expected error in predicting the resampled response to a training sample. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection. | |||
AIC is founded on [[wikipedia:information theory|information theory]]. In estimating the amount of information lost by a model, AIC deals with the trade-off between the goodness of fit of the model and the simplicity of the model. In other words, AIC deals with both the risk of overfitting and the risk of underfitting. | |||
The Akaike information criterion is named after the Japanese statistician [[wikipedia:Hirotugu Akaike |Hirotugu Akaike ]]. | |||
=== Definition === | |||
Let <math>d</math> equal the number of estimated parameters in the model and let <math>\hat L</math> be the maximum value of the likelihood function for the model. Then the AIC value of the model is the following:<ref>{{Harvnb|Burnham|Anderson|2002|loc=§2.2}}</ref> | |||
<math display="block">\mathrm{AIC} \, = \, 2d - 2\ln(\widehat L)</math> | |||
Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value. Thus, AIC rewards goodness of fit (as assessed by the likelihood function), but it also includes a penalty that is an increasing function of the number of estimated parameters. The penalty discourages [[wikipedia:overfitting |overfitting]], which is desired because increasing the number of parameters in the model almost always improves the goodness of the fit. | |||
Note that AIC tells nothing about the absolute quality of a model, only the quality relative to other models. Thus, if all the candidate models fit poorly, AIC will not give any warning of that. Hence, after selecting a model via AIC, it is usually good practice to validate the absolute quality of the model. | |||
=== Kullback–Leibler divergence === | |||
Suppose a family of parametrized probability distributions <math>\operatorname{Q}(\theta)</math> which constitutes a hypothesized model for the true probability distribution <math>\operatorname{P}</math>. For simplicity, we will assume that all distributions admit a density function: <math>d\operatorname{Q}(\theta) = f_{\theta}(x) dx </math> and <math>d\operatorname{P} = g(x) dx </math>. A measure of how <math>\operatorname{Q}(\theta)</math> differs from <math>\operatorname{P}</math> is given by the '''Kullback–Leibler divergence''' (also called relative entropy): | |||
<math display="block">D_\text{KL}(\operatorname{P} \parallel \operatorname{Q}(\theta)) = \int \log\left(\frac{f_{\theta}(x)}{g(x)}\right)\, g(x) \, dx. </math> | |||
Let <math>\mathcal{L}(\theta \, | x) </math> denote the likelihood function for the distribution <math>\operatorname{Q}(\theta)</math> and let <math>\hat{\theta}_n</math> denote the MLE given a sample size equal to <math>n</math>. Assuming that the model contains the true probability distribution <math>\operatorname{P}</math>, the following approximation holds as <math>n </math> tends to infinity: | |||
<math display = "block"> | |||
\operatorname{E}[\int \log \mathcal{L}(\hat{\theta}_n \, ; x )\, g(x) \, dx] \approx \operatorname{E}[\frac{\operatorname{-AIC}}{2}]. | |||
</math> | |||
The approximation above shows that, for large <math>n</math>, the AIC selects the model that minimizes the expected Kullback–Leibler divergence between the true probability distribution (assuming that the true distribution belongs to that particular model) and the parametric distribution corresponding to the maximum likelihood estimator for that particular model. | |||
== Bayesian Information Criterion (BIC) == | |||
The '''Bayesian information criterion''' ('''BIC''') or '''Schwarz information criterion''' (also '''SIC''', '''SBC''', '''SBIC''') is a criterion for model selection among a finite set of models; the model with the lowest BIC is preferred. It is based, in part, on the likelihood function and it is closely related to the [[#Akaike Information Criterion (AIC) |Akaike information criterion]] (AIC). | |||
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in [[wikipedia:overfitting|overfitting]]. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC. | |||
The BIC was developed by Gideon E. Schwarz and published in a 1978 paper,<ref>{{citation | last=Schwarz |first=Gideon E. |title=Estimating the dimension of a model |journal= [[wikipedia:Annals of Statistics | Annals of Statistics ]] |year=1978 |volume=6 |issue=2 |pages=461–464 |doi=10.1214/aos/1176344136 |mr=468014 |doi-access=free }}.</ref> where he gave a Bayesian argument for adopting it. | |||
=== Definition === | |||
The BIC is formally defined as<ref>{{Cite journal | |||
| doi = 10.1111/j.1467-9574.2012.00530.x | |||
| volume = 66 | issue = 3 | pages = 217–236 | |||
| last = Wit | first = Ernst |author2=Edwin van den Heuvel |author3=Jan-Willem Romeyn | |||
| title = 'All models are wrong...': an introduction to model uncertainty | |||
| journal = Statistica Neerlandica | |||
| year = 2012 | |||
| url = https://pure.rug.nl/ws/files/13270992/2012StatistNeerlWit.pdf }}</ref>{{efn|The AIC, AICc and BIC defined by Claeskens and Hjort<ref>{{Citation |last1=Claeskens |first1=Gerda|first2=Nils Lid |last2=Hjort|year=2008 |title=Model Selection and Model Averaging |publisher=[[wikipedia:Cambridge University Press|Cambridge University Press]] }}</ref> are the negatives of those defined in this article and in most other standard references.}} | |||
<math display = "block"> \mathrm{BIC} = d\ln(n) - 2\ln(\widehat L) </math> | |||
where <math>\widehat L</math> equals the maximized value of the likelihood function of the model; <math>n</math> is the sample size; and <math>d</math> is the number of parameters estimated by the model. | |||
=== Approximating the Bayes' Factor === | |||
Suppose we are considering the <math>k</math> models <math>\mathcal{M}_1, \ldots, \mathcal{M}_k</math>. Assume that <math>p_j</math> is the prior probability that the <math>j^{\textrm{th}} </math> model is correct, then the ''posterior'' probability that the <math>j^{\textrm{th}} </math> model is correct equals | |||
<math display = "block"> | |||
\operatorname{P}(\mathcal{M}_j | X_1, \ldots, X_n ) = \frac{p_j \int \mathcal{L}(\theta_j \, ; X_1, \ldots, X_n ) g_j(\theta_j) \, d\theta_j}{\sum_i p_i \int\mathcal{L}(\theta_i \, ; X_1, \ldots, X_n ) g_i(\theta_i) \, d\theta_i} | |||
</math> | |||
with <math>\mathcal{L}(\theta_j \, ; x ) </math> denoting the likelihood function for the <math>j^{\textrm{th}} </math> model. Under very restrictive conditions, we have the following approximation as <math>n </math> tends to infinity: | |||
<math display = "block" > | |||
\int \mathcal{L}(\theta_j) g_j(\theta_j) \, d\theta_j \approx \exp(\ln(\widehat L) - d_j\ln(n)) = \exp(-\operatorname{BIC}_j / 2). | |||
</math> | |||
Using the approximation above, we can approximate the [[wikipedia:Bayes factor|Bayes factor ]] for two competing models <math>\mathcal{M}_i </math> and <math>\mathcal{M}_j </math>: | |||
<math display = "block"> | |||
\frac{\operatorname{P}(\mathcal{M}_i | X_1,\ldots, X_n)}{\operatorname{P}(\mathcal{M}_j | X_1,\ldots, X_n)} \approx \frac{p_i}{p_j} \exp[(\operatorname{BIC}_j - \operatorname{BIC}_i)/2]. | |||
</math> | |||
In other words, given certain conditions on the models, the BIC is ''asymptotically'' equivalent to the [[wikipedia:Bayes factor | Bayesian model comparison method]] for model selection. | |||
==References== | |||
<references/> | |||
==Notes== | |||
{{notelist}} | |||
==Wikipedia References== | |||
*{{cite web |url = https://en.wikipedia.org/w/index.php?title=Model_selection&oldid=916835318 | title= Model selection | author = Wikipedia contributors | website= Wikipedia |publisher= Wikipedia |access-date = 30 September 2020 }} | |||
*{{cite web |url = https://en.wikipedia.org/w/index.php?title=Akaike_information_criterion&oldid=916478815 | title= Akaike information criterion | author = Wikipedia contributors | website= Wikipedia |publisher= Wikipedia |access-date = 30 September 2020 }} | |||
*{{cite web |url = https://en.wikipedia.org/w/index.php?title=Bayesian_information_criterion&oldid=959629062 | title= Bayesian information criterion | author = Wikipedia contributors | website= Wikipedia |publisher= Wikipedia |access-date = 30 September 2020 }} |
Latest revision as of 22:55, 22 August 2022
Of the countless number of possible mechanisms and processes that could have produced the data, how can one even begin to choose the best model? Two of the most commonly used criterions are (i) the Akaike information criterion and (ii) the Bayesian information criterion.
Akaike Information Criterion (AIC)
The Akaike information criterion (AIC) is an estimator of in-sample prediction error and thereby relative quality of statistical models for a given set of data.[1] In-sample prediction error is the expected error in predicting the resampled response to a training sample. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.
AIC is founded on information theory. In estimating the amount of information lost by a model, AIC deals with the trade-off between the goodness of fit of the model and the simplicity of the model. In other words, AIC deals with both the risk of overfitting and the risk of underfitting.
The Akaike information criterion is named after the Japanese statistician Hirotugu Akaike .
Definition
Let [math]d[/math] equal the number of estimated parameters in the model and let [math]\hat L[/math] be the maximum value of the likelihood function for the model. Then the AIC value of the model is the following:[2]
Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value. Thus, AIC rewards goodness of fit (as assessed by the likelihood function), but it also includes a penalty that is an increasing function of the number of estimated parameters. The penalty discourages overfitting, which is desired because increasing the number of parameters in the model almost always improves the goodness of the fit.
Note that AIC tells nothing about the absolute quality of a model, only the quality relative to other models. Thus, if all the candidate models fit poorly, AIC will not give any warning of that. Hence, after selecting a model via AIC, it is usually good practice to validate the absolute quality of the model.
Kullback–Leibler divergence
Suppose a family of parametrized probability distributions [math]\operatorname{Q}(\theta)[/math] which constitutes a hypothesized model for the true probability distribution [math]\operatorname{P}[/math]. For simplicity, we will assume that all distributions admit a density function: [math]d\operatorname{Q}(\theta) = f_{\theta}(x) dx [/math] and [math]d\operatorname{P} = g(x) dx [/math]. A measure of how [math]\operatorname{Q}(\theta)[/math] differs from [math]\operatorname{P}[/math] is given by the Kullback–Leibler divergence (also called relative entropy):
Let [math]\mathcal{L}(\theta \, | x) [/math] denote the likelihood function for the distribution [math]\operatorname{Q}(\theta)[/math] and let [math]\hat{\theta}_n[/math] denote the MLE given a sample size equal to [math]n[/math]. Assuming that the model contains the true probability distribution [math]\operatorname{P}[/math], the following approximation holds as [math]n [/math] tends to infinity:
The approximation above shows that, for large [math]n[/math], the AIC selects the model that minimizes the expected Kullback–Leibler divergence between the true probability distribution (assuming that the true distribution belongs to that particular model) and the parametric distribution corresponding to the maximum likelihood estimator for that particular model.
Bayesian Information Criterion (BIC)
The Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; the model with the lowest BIC is preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.
The BIC was developed by Gideon E. Schwarz and published in a 1978 paper,[3] where he gave a Bayesian argument for adopting it.
Definition
The BIC is formally defined as[4][a]
where [math]\widehat L[/math] equals the maximized value of the likelihood function of the model; [math]n[/math] is the sample size; and [math]d[/math] is the number of parameters estimated by the model.
Approximating the Bayes' Factor
Suppose we are considering the [math]k[/math] models [math]\mathcal{M}_1, \ldots, \mathcal{M}_k[/math]. Assume that [math]p_j[/math] is the prior probability that the [math]j^{\textrm{th}} [/math] model is correct, then the posterior probability that the [math]j^{\textrm{th}} [/math] model is correct equals
with [math]\mathcal{L}(\theta_j \, ; x ) [/math] denoting the likelihood function for the [math]j^{\textrm{th}} [/math] model. Under very restrictive conditions, we have the following approximation as [math]n [/math] tends to infinity:
Using the approximation above, we can approximate the Bayes factor for two competing models [math]\mathcal{M}_i [/math] and [math]\mathcal{M}_j [/math]:
In other words, given certain conditions on the models, the BIC is asymptotically equivalent to the Bayesian model comparison method for model selection.
References
- Hastie, Trevor (2009). The Elements of Statistical Learning. Springer. p. 203. ISBN 978-0-387-84857-0.
The Akaike information criterion is a[n] [...] estimate of Err_{in} when the log-likelihood loss function is used.
- Burnham & Anderson 2002, §2.2
- Schwarz, Gideon E. (1978), "Estimating the dimension of a model", Annals of Statistics , 6 (2): 461–464, doi:10.1214/aos/1176344136, MR 0468014.
- Wit, Ernst (2012). "'All models are wrong...': an introduction to model uncertainty". Statistica Neerlandica 66 (3): 217–236. doi: .
- Claeskens, Gerda; Hjort, Nils Lid (2008), Model Selection and Model Averaging, Cambridge University Press
Notes
- The AIC, AICc and BIC defined by Claeskens and Hjort[5] are the negatives of those defined in this article and in most other standard references.
Wikipedia References
- Wikipedia contributors. "Model selection". Wikipedia. Wikipedia. Retrieved 30 September 2020.
- Wikipedia contributors. "Akaike information criterion". Wikipedia. Wikipedia. Retrieved 30 September 2020.
- Wikipedia contributors. "Bayesian information criterion". Wikipedia. Wikipedia. Retrieved 30 September 2020.