guide:E109f9358c: Difference between revisions

From Stochiki
mNo edit summary
mNo edit summary
Line 1: Line 1:
'''Bootstrap aggregating''', also called '''bagging''' (from '''b'''ootstrap '''agg'''regat'''ing'''), is a [[wikipedia:Ensemble learning|machine learning ensemble]] [[wikipedia:meta-algorithm|meta-algorithm]] designed to improve the [[wikipedia:Stability (learning theory)|stability]] and accuracy of [[wikipedia:machine learning|machine learning]] algorithms used in [[wikipedia:statistical classification|statistical classification]] and [[wikipedia:Regression analysis|regression]]. It also reduces [[wikipedia:variance|variance]] and helps to avoid [[wikipedia:overfitting|overfitting]]. Although it is usually applied to [[wikipedia:Decision tree learning|decision tree]] methods, it can be used with any type of method. Bagging is a special case of the [[wikipedia:Ensemble learning|model averaging]] approach.
'''Bootstrap aggregating''', also called '''bagging''' (from '''b'''ootstrap '''agg'''regat'''ing'''), is a [[Ensemble learning|machine learning ensemble]] [[meta-algorithm|meta-algorithm]] designed to improve the [[Stability (learning theory)|stability]] and accuracy of machine learning algorithms used in [[guide:811da45443#Classification|statistical classification]] and [[guide:811da45443#Regression|regression]]. It also reduces [[variance|variance]] and helps to avoid [[overfitting|overfitting]]. Although it is usually applied to [[guide:67caf22221|decision tree]] methods, it can be used with any type of method. Bagging is a special case of the [[Ensemble learning|model averaging]] approach.


==Description of the technique==
==Description of the technique==
Given a standard [[wikipedia:training set|training set]] <math>D</math> of size <math>n</math>, bagging generates <math>m</math> new training sets <math>D_i</math>, each of size <math>n</math>, by [[wikipedia:Sampling (statistics)|sampling]] from <math>D</math> [[wikipedia:Probability distribution#With finite support|uniformly]] and [[wikipedia:Sampling (statistics)#Replacement of selected units|with replacement]]. By sampling with replacement, some observations may be repeated in each <math>D_i</math>. If <math>n=n'</math>, then for large <math>n</math> the set <math>D_i</math> is expected to have the fraction (1 - 1/''[[wikipedia:e (mathematical constant)|e]]'') (≈63.2%) of the unique examples of <math>D</math>, the rest being duplicates.<ref>Aslam, Javed A.; Popa, Raluca A.; and Rivest, Ronald L. (2007); [http://people.csail.mit.edu/rivest/pubs/APR07.pdf ''On Estimating the Size and Confidence of a Statistical Audit''], Proceedings of the Electronic Voting Technology Workshop (EVT '07), Boston, MA, August 6, 2007. More generally, when drawing with replacement <math>n</math> values out of a set of <math>n</math> (different and equally likely), the expected number of unique draws is <math>n(1 - e^{-n'/n})</math>.</ref> This kind of sample is known as a [[wikipedia:bootstrap (statistics)|bootstrap]] sample. Sampling with replacement ensures each bootstrap is independent from its peers, as it does not depend on previous chosen samples when sampling.  Then, <math>m</math> models  are fitted using the above <math>m</math> bootstrap samples and combined by averaging the output (for regression) or voting (for classification).
Given a standard [[training set|training set]] <math>D</math> of size <math>n</math>, bagging generates <math>m</math> new training sets <math>D_i</math>, each of size <math>n</math>, by [[Sampling (statistics)|sampling]] from <math>D</math> [[Probability distribution#With finite support|uniformly]] and [[Sampling (statistics)#Replacement of selected units|with replacement]]. By sampling with replacement, some observations may be repeated in each <math>D_i</math>. If <math>n=n'</math>, then for large <math>n</math> the set <math>D_i</math> is expected to have the fraction (1 - 1/''[[e (mathematical constant)|e]]'') (≈63.2%) of the unique examples of <math>D</math>, the rest being duplicates.<ref>Aslam, Javed A.; Popa, Raluca A.; and Rivest, Ronald L. (2007); [http://people.csail.mit.edu/rivest/pubs/APR07.pdf ''On Estimating the Size and Confidence of a Statistical Audit''], Proceedings of the Electronic Voting Technology Workshop (EVT '07), Boston, MA, August 6, 2007. More generally, when drawing with replacement <math>n</math> values out of a set of <math>n</math> (different and equally likely), the expected number of unique draws is <math>n(1 - e^{-n'/n})</math>.</ref> This kind of sample is known as a [[bootstrap (statistics)|bootstrap]] sample. Sampling with replacement ensures each bootstrap is independent from its peers, as it does not depend on previous chosen samples when sampling.  Then, <math>m</math> models  are fitted using the above <math>m</math> bootstrap samples and combined by averaging the output (for regression) or voting (for classification).


[[File:Ensemble Bagging.svg|thumb|right|upright=2.0|302px|An illustration for the concept of bootstrap aggregating]]
[[File:Ensemble Bagging.svg|thumb|right|upright=2.0|302px|An illustration for the concept of bootstrap aggregating]]


Bagging leads to "improvements for unstable procedures",<ref name=":0">{{cite journal|last=Breiman|first=Leo|author-link=Leo Breiman|year=1996|title=Bagging predictors|journal=[[wikipedia:Machine Learning (journal)|Machine Learning]]|volume=24|issue=2|pages=123–140|citeseerx=10.1.1.32.9399|doi=10.1007/BF00058655|s2cid=47328136}}</ref> which include, for example, [[wikipedia:artificial neural networks|artificial neural networks]], [[wikipedia:classification and regression tree|classification and regression tree]]s, and subset selection in [[wikipedia:linear regression|linear regression]].<ref name = ":1">{{cite journal|last=Breiman |first=Leo |date=September 1994 |title=Bagging Predictors |url=https://www.stat.berkeley.edu/~breiman/bagging.pdf |journal=Technical Report |publisher=Department of Statistics, University of California Berkeley |number=421 |access-date=2019-07-28}}</ref> Bagging was shown to improve preimage learning.<ref>Sahu, A., Runger, G., Apley, D., [https://www.researchgate.net/profile/Anshuman_Sahu/publication/254023773_Image_denoising_with_a_multi-phase_kernel_principal_component_approach_and_an_ensemble_version/links/5427b5e40cf2e4ce940a4410/Image-denoising-with-a-multi-phase-kernel-principal-component-approach-and-an-ensemble-version.pdf Image denoising with a multi-phase kernel principal component approach and an ensemble version], IEEE Applied Imagery Pattern Recognition Workshop, pp.1-7, 2011.</ref><ref>Shinde, Amit, Anshuman Sahu, Daniel Apley, and George Runger. "[https://www.researchgate.net/profile/Anshuman_Sahu/publication/263388433_Preimages_for_variation_patterns_from_kernel_PCA_and_bagging/links/5427b3930cf26120b7b35ebd/Preimages-for-variation-patterns-from-kernel-PCA-and-bagging.pdf Preimages for Variation Patterns from Kernel PCA and Bagging]." IIE Transactions, Vol.46, Iss.5, 2014</ref> On the other hand, it can mildly degrade the performance of stable methods such as K-nearest neighbors.<ref name=":0" />
Bagging leads to "improvements for unstable procedures",<ref name=":0">{{cite journal|last=Breiman|first=Leo|author-link=Leo Breiman|year=1996|title=Bagging predictors|journal=[[Machine Learning (journal)|Machine Learning]]|volume=24|issue=2|pages=123–140|citeseerx=10.1.1.32.9399|doi=10.1007/BF00058655|s2cid=47328136}}</ref> which include, for example, [[artificial neural networks|artificial neural networks]], [[classification and regression tree|classification and regression tree]]s, and subset selection in [[guide:F7d7868547|linear regression]].<ref name = ":1">{{cite journal|last=Breiman |first=Leo |date=September 1994 |title=Bagging Predictors |url=https://www.stat.berkeley.edu/~breiman/bagging.pdf |journal=Technical Report |publisher=Department of Statistics, University of California Berkeley |number=421 |access-date=2019-07-28}}</ref> Bagging was shown to improve preimage learning.<ref>Sahu, A., Runger, G., Apley, D., [https://www.researchgate.net/profile/Anshuman_Sahu/publication/254023773_Image_denoising_with_a_multi-phase_kernel_principal_component_approach_and_an_ensemble_version/links/5427b5e40cf2e4ce940a4410/Image-denoising-with-a-multi-phase-kernel-principal-component-approach-and-an-ensemble-version.pdf Image denoising with a multi-phase kernel principal component approach and an ensemble version], IEEE Applied Imagery Pattern Recognition Workshop, pp.1-7, 2011.</ref><ref>Shinde, Amit, Anshuman Sahu, Daniel Apley, and George Runger. "[https://www.researchgate.net/profile/Anshuman_Sahu/publication/263388433_Preimages_for_variation_patterns_from_kernel_PCA_and_bagging/links/5427b3930cf26120b7b35ebd/Preimages-for-variation-patterns-from-kernel-PCA-and-bagging.pdf Preimages for Variation Patterns from Kernel PCA and Bagging]." IIE Transactions, Vol.46, Iss.5, 2014</ref> On the other hand, it can mildly degrade the performance of stable methods such as K-nearest neighbors.<ref name=":0" />


=== Creating the bootstrap dataset ===
=== Creating the bootstrap dataset ===
Line 24: Line 24:


== Example: ozone data ==
== Example: ozone data ==
To illustrate the basic principles of bagging, below is an analysis on the relationship between [[wikipedia:ozone|ozone]] and temperature (data from [[wikipedia:Peter Rousseeuw|Rousseeuw]] and Leroy (1986), analysis done in [[wikipedia:R (programming language)|R]]).
To illustrate the basic principles of bagging, below is an analysis on the relationship between [[ozone|ozone]] and temperature (data from [[Peter Rousseeuw|Rousseeuw]] and Leroy (1986), analysis done in [[R (programming language)|R]]).


The relationship between temperature and ozone appears to be nonlinear in this data set, based on the scatter plot. To mathematically describe this relationship, [[wikipedia:local regression|LOESS]] smoothers (with bandwidth 0.5) are used. Rather than building a single smoother for the complete data set, 100 [[wikipedia:bootstrap (statistics)|bootstrap]] samples were drawn. Each sample is composed of a random subset of the original data and maintains a semblance of the master set’s distribution and variability. For each bootstrap sample, a LOESS smoother was fit. Predictions from these 100 smoothers were then made across the range of the data. The black lines represent these initial predictions. The lines lack agreement in their predictions and tend to overfit their data points: evident by the wobbly flow of the lines.
The relationship between temperature and ozone appears to be nonlinear in this data set, based on the scatter plot. To mathematically describe this relationship, [[local regression|LOESS]] smoothers (with bandwidth 0.5) are used. Rather than building a single smoother for the complete data set, 100 [[bootstrap (statistics)|bootstrap]] samples were drawn. Each sample is composed of a random subset of the original data and maintains a semblance of the master set’s distribution and variability. For each bootstrap sample, a LOESS smoother was fit. Predictions from these 100 smoothers were then made across the range of the data. The black lines represent these initial predictions. The lines lack agreement in their predictions and tend to overfit their data points: evident by the wobbly flow of the lines.


[[image:Ozone.png|center]]
[[image:Ozone.png|center]]
Line 36: Line 36:
  [[File:Random forest diagram complete.png|thumb|Diagram of a random decision forest]]
  [[File:Random forest diagram complete.png|thumb|Diagram of a random decision forest]]


'''Random forests''' or '''random decision forests''' is an [[wikipedia:ensemble learning|ensemble learning]] method for [[wikipedia:statistical classification|classification]], [[wikipedia:regression analysis|regression]] and other tasks that operates by constructing a multitude of [[wikipedia:decision tree learning|decision trees]] at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned.<ref name="ho1995"/><ref name="ho1998"/> Random decision forests correct for decision trees' habit of [[wikipedia:overfitting|overfitting]] to their [[wikipedia:Test set|training set]].{{r|elemstatlearn}}{{rp|587–588}} Random forests generally outperform [[wikipedia:Decision tree learning|decision trees]], but their accuracy is lower than gradient boosted trees. However, data characteristics can affect their performance.<ref name=":02">{{Cite journal|last1=Piryonesi S. Madeh|last2=El-Diraby Tamer E.|date=2020-06-01|title=Role of Data Analytics in Infrastructure Asset Management: Overcoming Data Size and Quality Problems|journal=Journal of Transportation Engineering, Part B: Pavements|volume=146|issue=2|pages=04020022|doi=10.1061/JPEODX.0000175|s2cid=216485629}}</ref><ref name=":00">{{Cite journal|last1=Piryonesi|first1=S. Madeh|last2=El-Diraby|first2=Tamer E.|date=2021-02-01|title=Using Machine Learning to Examine Impact of Type of Performance Indicator on Flexible Pavement Deterioration Modeling|url=http://ascelibrary.org/doi/10.1061/%28ASCE%29IS.1943-555X.0000602|journal=Journal of Infrastructure Systems|language=en|volume=27|issue=2|pages=04021005|doi=10.1061/(ASCE)IS.1943-555X.0000602|s2cid=233550030|issn=1076-0342|via=}}</ref>
'''Random forests''' or '''random decision forests''' is an [[ensemble learning|ensemble learning]] method for [[guide:811da45443#Classification|statistical classification]], [[guide:811da45443#Regression|regression]] and other tasks that operates by constructing a multitude of [[guide:67caf22221|decision trees]] at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned.<ref name="ho1995"/><ref name="ho1998"/> Random decision forests correct for decision trees' habit of [[overfitting|overfitting]] to their [[Test set|training set]].{{r|elemstatlearn}}{{rp|587–588}} Random forests generally outperform [[guide:67caf22221|decision trees]], but their accuracy is lower than gradient boosted trees. However, data characteristics can affect their performance.<ref name=":02">{{Cite journal|last1=Piryonesi S. Madeh|last2=El-Diraby Tamer E.|date=2020-06-01|title=Role of Data Analytics in Infrastructure Asset Management: Overcoming Data Size and Quality Problems|journal=Journal of Transportation Engineering, Part B: Pavements|volume=146|issue=2|pages=04020022|doi=10.1061/JPEODX.0000175|s2cid=216485629}}</ref><ref name=":00">{{Cite journal|last1=Piryonesi|first1=S. Madeh|last2=El-Diraby|first2=Tamer E.|date=2021-02-01|title=Using Machine Learning to Examine Impact of Type of Performance Indicator on Flexible Pavement Deterioration Modeling|url=http://ascelibrary.org/doi/10.1061/%28ASCE%29IS.1943-555X.0000602|journal=Journal of Infrastructure Systems|language=en|volume=27|issue=2|pages=04021005|doi=10.1061/(ASCE)IS.1943-555X.0000602|s2cid=233550030|issn=1076-0342|via=}}</ref>


The first algorithm for random decision forests was created in 1995 by [[wikipedia:Tin Kam Ho|Tin Kam Ho]]<ref name="ho1995">{{cite conference
The first algorithm for random decision forests was created in 1995 by [[Tin Kam Ho|Tin Kam Ho]]<ref name="ho1995">{{cite conference
  |first        = Tin Kam
  |first        = Tin Kam
  |last        = Ho
  |last        = Ho
Line 51: Line 51:
  |url-status    = dead
  |url-status    = dead
  |df          = dmy-all
  |df          = dmy-all
}}</ref> using the [[wikipedia:random subspace method|random subspace method]],<ref name="ho1998">{{cite journal | first = Tin Kam | last = Ho | name-list-style = vanc | title = The Random Subspace Method for Constructing Decision Forests | journal = IEEE Transactions on Pattern Analysis and Machine Intelligence | year = 1998 | volume = 20 | issue = 8 | pages = 832–844 | doi = 10.1109/34.709601 | url = http://ect.bell-labs.com/who/tkh/publications/papers/df.pdf }}</ref> which, in Ho's formulation, is a way to implement the "stochastic discrimination" approach to classification proposed by Eugene Kleinberg.<ref name="kleinberg1990">{{cite journal |first=Eugene |last=Kleinberg | name-list-style = vanc |title=Stochastic Discrimination |journal=[[wikipedia:Annals of Mathematics and Artificial Intelligence|Annals of Mathematics and Artificial Intelligence]] |year=1990 |volume=1 |issue=1–4 |pages=207–239 |url=https://pdfs.semanticscholar.org/faa4/c502a824a9d64bf3dc26eb90a2c32367921f.pdf |archive-url=https://web.archive.org/web/20180118124007/https://pdfs.semanticscholar.org/faa4/c502a824a9d64bf3dc26eb90a2c32367921f.pdf |url-status=dead |archive-date=2018-01-18 |doi=10.1007/BF01531079|citeseerx=10.1.1.25.6750 |s2cid=206795835 }}</ref><ref name="kleinberg1996">{{cite journal |first=Eugene |last=Kleinberg | name-list-style = vanc |title=An Overtraining-Resistant Stochastic Modeling Method for Pattern Recognition |journal=[[wikipedia:Annals of Statistics|Annals of Statistics]] |year=1996 |volume=24 |issue=6 |pages=2319–2349 |doi=10.1214/aos/1032181157 |mr=1425956|doi-access=free }}</ref><ref name="kleinberg2000">{{cite journal|first=Eugene|last=Kleinberg| name-list-style = vanc |title=On the Algorithmic Implementation of Stochastic Discrimination|journal=IEEE Transactions on PAMI|year=2000|volume=22|issue=5|pages=473–490|url=https://pdfs.semanticscholar.org/8956/845b0701ec57094c7a8b4ab1f41386899aea.pdf|archive-url=https://web.archive.org/web/20180118124006/https://pdfs.semanticscholar.org/8956/845b0701ec57094c7a8b4ab1f41386899aea.pdf|url-status=dead|archive-date=2018-01-18|doi=10.1109/34.857004|citeseerx=10.1.1.33.4131|s2cid=3563126}}</ref>
}}</ref> using the [[random subspace method|random subspace method]],<ref name="ho1998">{{cite journal | first = Tin Kam | last = Ho | name-list-style = vanc | title = The Random Subspace Method for Constructing Decision Forests | journal = IEEE Transactions on Pattern Analysis and Machine Intelligence | year = 1998 | volume = 20 | issue = 8 | pages = 832–844 | doi = 10.1109/34.709601 | url = http://ect.bell-labs.com/who/tkh/publications/papers/df.pdf }}</ref> which, in Ho's formulation, is a way to implement the "stochastic discrimination" approach to classification proposed by Eugene Kleinberg.<ref name="kleinberg1990">{{cite journal |first=Eugene |last=Kleinberg | name-list-style = vanc |title=Stochastic Discrimination |journal=[[Annals of Mathematics and Artificial Intelligence|Annals of Mathematics and Artificial Intelligence]] |year=1990 |volume=1 |issue=1–4 |pages=207–239 |url=https://pdfs.semanticscholar.org/faa4/c502a824a9d64bf3dc26eb90a2c32367921f.pdf |archive-url=https://web.archive.org/web/20180118124007/https://pdfs.semanticscholar.org/faa4/c502a824a9d64bf3dc26eb90a2c32367921f.pdf |url-status=dead |archive-date=2018-01-18 |doi=10.1007/BF01531079|citeseerx=10.1.1.25.6750 |s2cid=206795835 }}</ref><ref name="kleinberg1996">{{cite journal |first=Eugene |last=Kleinberg | name-list-style = vanc |title=An Overtraining-Resistant Stochastic Modeling Method for Pattern Recognition |journal=[[Annals of Statistics|Annals of Statistics]] |year=1996 |volume=24 |issue=6 |pages=2319–2349 |doi=10.1214/aos/1032181157 |mr=1425956|doi-access=free }}</ref><ref name="kleinberg2000">{{cite journal|first=Eugene|last=Kleinberg| name-list-style = vanc |title=On the Algorithmic Implementation of Stochastic Discrimination|journal=IEEE Transactions on PAMI|year=2000|volume=22|issue=5|pages=473–490|url=https://pdfs.semanticscholar.org/8956/845b0701ec57094c7a8b4ab1f41386899aea.pdf|archive-url=https://web.archive.org/web/20180118124006/https://pdfs.semanticscholar.org/8956/845b0701ec57094c7a8b4ab1f41386899aea.pdf|url-status=dead|archive-date=2018-01-18|doi=10.1109/34.857004|citeseerx=10.1.1.33.4131|s2cid=3563126}}</ref>


An extension of the algorithm was developed by [[wikipedia:Leo Breiman|Leo Breiman]]<ref name="breiman2001">{{cite journal | first = Leo | last = Breiman | author-link = Leo Breiman | name-list-style = vanc | title = Random Forests | journal = [[wikipedia:Machine Learning (journal)|Machine Learning]] | year = 2001 | volume = 45 | issue = 1 | pages = 5–32 | doi = 10.1023/A:1010933404324 | bibcode = 2001MachL..45....5B | doi-access = free }}</ref> and [[wikipedia:Adele Cutler|Adele Cutler]],<ref name="rpackage"/> who registered<ref>U.S. trademark registration number 3185828, registered 2006/12/19.</ref> "Random Forests" as a [[wikipedia:trademark|trademark]] in 2006 ({{As of|lc=y|2019}}, owned by [[wikipedia:Minitab|Minitab, Inc.]]).<ref>{{cite web|url=https://trademarks.justia.com/786/42/random-78642027.html|title=RANDOM FORESTS Trademark of Health Care Productivity, Inc. - Registration Number 3185828 - Serial Number 78642027 :: Justia Trademarks}}</ref> The extension combines Breiman's "[[wikipedia:Bootstrap aggregating|bagging]]" idea and random selection of features, introduced first by Ho<ref name="ho1995"/> and later independently by Amit and [[wikipedia:Donald Geman|Geman]]<ref name="amitgeman1997">{{cite journal | last1 = Amit | first1 = Yali | last2 = Geman | first2 = Donald | author-link2 = Donald Geman | name-list-style = vanc | title = Shape quantization and recognition with randomized trees | journal = [[wikipedia:Neural Computation (journal)|Neural Computation]] | year = 1997 | volume = 9 | issue = 7 | pages = 1545–1588 | doi = 10.1162/neco.1997.9.7.1545 | url = http://www.cis.jhu.edu/publications/papers_in_database/GEMAN/shape.pdf | citeseerx = 10.1.1.57.6069 | s2cid = 12470146 }}</ref> in order to construct a collection of decision trees with controlled variance.
An extension of the algorithm was developed by [[Leo Breiman|Leo Breiman]]<ref name="breiman2001">{{cite journal | first = Leo | last = Breiman | author-link = Leo Breiman | name-list-style = vanc | title = Random Forests | journal = [[Machine Learning (journal)|Machine Learning]] | year = 2001 | volume = 45 | issue = 1 | pages = 5–32 | doi = 10.1023/A:1010933404324 | bibcode = 2001MachL..45....5B | doi-access = free }}</ref> and [[Adele Cutler|Adele Cutler]],<ref name="rpackage"/> who registered<ref>U.S. trademark registration number 3185828, registered 2006/12/19.</ref> "Random Forests" as a [[trademark|trademark]] in 2006.<ref>{{cite web|url=https://trademarks.justia.com/786/42/random-78642027.html|title=RANDOM FORESTS Trademark of Health Care Productivity, Inc. - Registration Number 3185828 - Serial Number 78642027 :: Justia Trademarks}}</ref> The extension combines Breiman's "[[Bootstrap aggregating|bagging]]" idea and random selection of features, introduced first by Ho<ref name="ho1995"/> and later independently by Amit and [[Donald Geman|Geman]]<ref name="amitgeman1997">{{cite journal | last1 = Amit | first1 = Yali | last2 = Geman | first2 = Donald | author-link2 = Donald Geman | name-list-style = vanc | title = Shape quantization and recognition with randomized trees | journal = [[Neural Computation (journal)|Neural Computation]] | year = 1997 | volume = 9 | issue = 7 | pages = 1545–1588 | doi = 10.1162/neco.1997.9.7.1545 | url = http://www.cis.jhu.edu/publications/papers_in_database/GEMAN/shape.pdf | citeseerx = 10.1.1.57.6069 | s2cid = 12470146 }}</ref> in order to construct a collection of decision trees with controlled variance.


Random forests are frequently used as "blackbox" models in businesses, as they generate reasonable predictions across a wide range of data while requiring little configuration.
Random forests are frequently used as "blackbox" models in businesses, as they generate reasonable predictions across a wide range of data while requiring little configuration.
==Algorithm==


===Preliminaries: decision tree learning===
===Preliminaries: decision tree learning===


Decision trees are a popular method for various machine learning tasks. Tree learning "come[s] closest to meeting the requirements for serving as an off-the-shelf procedure for data mining", say [[wikipedia:Trevor Hastie|Hastie]] ''et al.'', "because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate".<ref name="elemstatlearn">{{ElemStatLearn}}</ref>{{rp|352}}
Decision trees are a popular method for various machine learning tasks. Tree learning "come[s] closest to meeting the requirements for serving as an off-the-shelf procedure for data mining", say [[Trevor Hastie|Hastie]] ''et al.'', "because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate".<ref name="elemstatlearn">{{ElemStatLearn}}</ref>{{rp|352}}


In particular, trees that are grown very deep tend to learn highly irregular patterns: they [[wikipedia:overfitting|overfit]] their training sets, i.e. have [[wikipedia:Bias–variance tradeoff|low bias, but very high variance]]. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance.<ref name="elemstatlearn"/>{{rp|587–588}} This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.
In particular, trees that are grown very deep tend to learn highly irregular patterns: they [[overfitting|overfit]] their training sets, i.e. have [[guide:E0f9e256bf#Motivation|low bias, but very high variance]]. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance.<ref name="elemstatlearn"/>{{rp|587–588}} This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.


Forests are like the pulling together of decision tree algorithm efforts. Taking the teamwork of many trees thus improving the performance of a single random tree. Though not quite similar, forests give the effects of a [[wikipedia:Cross-validation (statistics)#k-fold_cross-validation|k-fold cross validation]].
Forests are like the pulling together of decision tree algorithm efforts. Taking the teamwork of many trees thus improving the performance of a single random tree.  


===Bagging===
===Bagging===


The training algorithm for random forests applies the general technique of [[wikipedia:bootstrap aggregating|bootstrap aggregating]], or bagging, to tree learners. Given a training set <math>X = x_1,\ldots,x_n</math> with responses <math>Y=y_1,\ldots,y_n</math>, bagging repeatedly (<math>B</math> times) selects a [[wikipedia:Sampling (statistics)#Replacement of selected units|random sample with replacement]] of the training set and fits trees to these samples:
The training algorithm for random forests applies the general technique of [[#Description of the technique|bootstrap aggregating]], or bagging, to tree learners. Given a training set <math>X = x_1,\ldots,x_n</math> with responses <math>Y=y_1,\ldots,y_n</math>, bagging repeatedly (<math>B</math> times) selects a random sample with replacement of the training set and fits trees to these samples:


<proc label = "Bagging">
<proc label = "Bagging">
Line 80: Line 78:




This bootstrapping procedure leads to better model performance because it decreases the [[wikipedia:Bias–variance dilemma|variance]] of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets.
This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets.


Additionally, an estimate of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the individual regression trees on <math>x'</math>:
Additionally, an estimate of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the individual regression trees on <math>x'</math>:
Line 86: Line 84:
<math display = "block">\sigma = \sqrt{\frac{\sum_{b=1}^B (f_b(x')  - \hat{f})^2}{B-1} }.</math>
<math display = "block">\sigma = \sqrt{\frac{\sum_{b=1}^B (f_b(x')  - \hat{f})^2}{B-1} }.</math>


The number of samples/trees, <math>B</math>, is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. An optimal number of trees <math>B</math>can be found using [[wikipedia:Cross-validation (statistics)|cross-validation]], or by observing the ''[[wikipedia:out-of-bag error|out-of-bag error]]'': the mean prediction error on each training sample {{mvar|x<sub>i</sub>}}, using only the trees that did not have <math>x_i</math> in their bootstrap sample.<ref name="islr">{{cite book |author1=Gareth James |author2=Daniela Witten |author3=Trevor Hastie |author4=Robert Tibshirani |title=An Introduction to Statistical Learning |publisher=Springer |year=2013 |url=http://www-bcf.usc.edu/~gareth/ISL/ |pages=316–321}}</ref>
The number of samples/trees, <math>B</math>, is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. An optimal number of trees <math>B</math>can be found using [[Cross-validation (statistics)|cross-validation]], or by observing the ''[[out-of-bag error|out-of-bag error]]'': the mean prediction error on each training sample {{mvar|x<sub>i</sub>}}, using only the trees that did not have <math>x_i</math> in their bootstrap sample.<ref name="islr">{{cite book |author1=Gareth James |author2=Daniela Witten |author3=Trevor Hastie |author4=Robert Tibshirani |title=An Introduction to Statistical Learning |publisher=Springer |year=2013 |url=http://www-bcf.usc.edu/~gareth/ISL/ |pages=316–321}}</ref>
The training and test error tend to level off after some number of trees have been fit.
The training and test error tend to level off after some number of trees have been fit.


===From bagging to random forests===
===From bagging to random forests===


The above procedure describes the original bagging algorithm for trees. Random forests also include another type of bagging scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a [[wikipedia:Random subspace method|random subset of the features]]. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few [[wikipedia:Feature (machine learning)|features]] are very strong predictors for the response variable (target output), these features will be selected in many of the <math>B</math>trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho.<ref name="ho2002">
The above procedure describes the original bagging algorithm for trees. Random forests also include another type of bagging scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a [[Random subspace method|random subset of the features]]. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the <math>B</math>trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho.<ref name="ho2002">
{{cite journal | first = Tin Kam | last = Ho | title = A Data Complexity Analysis of Comparative Advantages of Decision Forest Constructors | journal = Pattern Analysis and Applications | volume = 5 | issue = 2 | year = 2002 | pages = 102–112 | url = http://ect.bell-labs.com/who/tkh/publications/papers/compare.pdf | doi = 10.1007/s100440200009 | s2cid = 7415435 }}</ref>
{{cite journal | first = Tin Kam | last = Ho | title = A Data Complexity Analysis of Comparative Advantages of Decision Forest Constructors | journal = Pattern Analysis and Applications | volume = 5 | issue = 2 | year = 2002 | pages = 102–112 | url = http://ect.bell-labs.com/who/tkh/publications/papers/compare.pdf | doi = 10.1007/s100440200009 | s2cid = 7415435 }}</ref>


Line 98: Line 96:
=== Variable importance ===
=== Variable importance ===


Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way.  The following technique was described in Breiman's original paper<ref name=breiman2001/> and is implemented in the [[wikipedia:R (programming language)|R]] package ''randomForest''.<ref name="rpackage">{{cite web |url=https://cran.r-project.org/web/packages/randomForest/randomForest.pdf |title=Documentation for R package randomForest |first1=Andy |last1=Liaw | name-list-style = vanc | date=16 October 2012 |access-date=15 March 2013}}
Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way.  The following technique was described in Breiman's original paper<ref name=breiman2001/> and is implemented in the [[R (programming language)|R]] package ''randomForest''.<ref name="rpackage">{{cite web |url=https://cran.r-project.org/web/packages/randomForest/randomForest.pdf |title=Documentation for R package randomForest |first1=Andy |last1=Liaw | name-list-style = vanc | date=16 October 2012 |access-date=15 March 2013}}
</ref>
</ref>


The first step in measuring the variable importance in a data set <math>\mathcal{D}_n =\{(X_i, Y_i)\}_{i=1}^n</math> is to fit a random forest to the data.  During the fitting process the [[wikipedia:out-of-bag error|out-of-bag error]] for each data point is recorded and averaged over the forest (errors on an independent test set can be substituted if bagging is not used during training).
The first step in measuring the variable importance in a data set <math>\mathcal{D}_n =\{(X_i, Y_i)\}_{i=1}^n</math> is to fit a random forest to the data.  During the fitting process the [[out-of-bag error|out-of-bag error]] for each data point is recorded and averaged over the forest (errors on an independent test set can be substituted if bagging is not used during training).


To measure the importance of the <math>j</math>-th feature after training, the values of the <math>j</math>-th feature are permuted among the training data and the out-of-bag error is again computed on this perturbed data set.  The importance score for the <math>j</math>-th feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees.  The score is normalized by the standard deviation of these differences.
To measure the importance of the <math>j</math>-th feature after training, the values of the <math>j</math>-th feature are permuted among the training data and the out-of-bag error is again computed on this perturbed data set.  The importance score for the <math>j</math>-th feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees.  The score is normalized by the standard deviation of these differences.
Line 107: Line 105:
Features which produce large values for this score are ranked as more important than features which produce small values. The statistical definition of the variable importance measure was given and analyzed by Zhu ''et al.''<ref>{{cite journal | vauthors = Zhu R, Zeng D, Kosorok MR | title = Reinforcement Learning Trees | journal = Journal of the American Statistical Association | volume = 110 | issue = 512 | pages = 1770–1784 | date = 2015 | pmid = 26903687 | pmc = 4760114 | doi = 10.1080/01621459.2015.1036994 }}</ref>
Features which produce large values for this score are ranked as more important than features which produce small values. The statistical definition of the variable importance measure was given and analyzed by Zhu ''et al.''<ref>{{cite journal | vauthors = Zhu R, Zeng D, Kosorok MR | title = Reinforcement Learning Trees | journal = Journal of the American Statistical Association | volume = 110 | issue = 512 | pages = 1770–1784 | date = 2015 | pmid = 26903687 | pmc = 4760114 | doi = 10.1080/01621459.2015.1036994 }}</ref>


This method of determining variable importance has some drawbacks.  For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Methods such as [[wikipedia:partial permutation|partial permutation]]s<ref>{{cite conference
This method of determining variable importance has some drawbacks.  For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Methods such as [[partial permutation|partial permutation]]s<ref>{{cite conference
|author=Deng, H.|author2=Runger, G. |author3=Tuv, E.
|author=Deng, H.|author2=Runger, G. |author3=Tuv, E.
  |title=Bias of importance measures for multi-valued attributes and solutions
  |title=Bias of importance measures for multi-valued attributes and solutions
Line 118: Line 116:
==Boosting==
==Boosting==


In [[wikipedia:machine learning|machine learning]], '''boosting''' is an [[wikipedia:Ensemble learning|ensemble]] [[wikipedia:meta-algorithm|meta-algorithm]] for primarily reducing [[wikipedia:Supervised learning#Bias-variance tradeoff|bias]], and also variance<ref>{{cite web|url=http://oz.berkeley.edu/~breiman/arcall96.pdf|archive-url=https://web.archive.org/web/20150119081741/http://oz.berkeley.edu/~breiman/arcall96.pdf|url-status=dead|archive-date=2015-01-19|title=BIAS, VARIANCE, AND ARCING CLASSIFIERS|last1=Leo Breiman|date=1996|publisher=TECHNICAL REPORT|quote=Arcing [Boosting] is more successful than bagging in variance reduction|access-date=19 January 2015}}</ref> in [[wikipedia:supervised learning|supervised learning]], and a family of machine learning algorithms that convert weak learners to strong ones.<ref>{{cite book |last=Zhou Zhi-Hua  |date=2012 |title=Ensemble Methods: Foundations and Algorithms |publisher= Chapman and Hall/CRC |page=23 |isbn=978-1439830031 |quote=The term boosting refers to a family of algorithms that are able to convert weak learners to strong learners }}</ref> Boosting is based on the question posed by [[wikipedia:Michael Kearns (computer scientist)|Kearns]] and [[wikipedia:Leslie Valiant|Valiant]] (1988, 1989)<!--Please do not cite only one, because "Kearns and Valiant" is used as a convention to denote this question.-->:<ref name="Kearns88">Michael Kearns(1988); [http://www.cis.upenn.edu/~mkearns/papers/boostnote.pdf ''Thoughts on Hypothesis Boosting''], Unpublished manuscript (Machine Learning class project, December 1988)</ref><ref>{{cite book |last1=Michael Kearns |last2=Leslie Valiant  |date=1989 |title={{sic|Crytographic}} limitations on learning Boolean formulae and finite automata |journal=Symposium on Theory of Computing |publisher=ACM |volume=21 |pages=433–444 |doi=10.1145/73007.73049 |isbn= 978-0897913072|s2cid=536357 }}</ref> "Can a set of weak learners create a single strong learner?" A weak learner is defined to be a [[wikipedia:Classification (machine learning)|classifier]] that is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.
In machine learning, '''boosting''' is an [[Ensemble learning|ensemble]] [[meta-algorithm|meta-algorithm]] for primarily reducing [[guide:811da45443#Bias-variance tradeoff|bias]], and also variance<ref>{{cite web|url=http://oz.berkeley.edu/~breiman/arcall96.pdf|archive-url=https://web.archive.org/web/20150119081741/http://oz.berkeley.edu/~breiman/arcall96.pdf|url-status=dead|archive-date=2015-01-19|title=BIAS, VARIANCE, AND ARCING CLASSIFIERS|last1=Leo Breiman|date=1996|publisher=TECHNICAL REPORT|quote=Arcing [Boosting] is more successful than bagging in variance reduction|access-date=19 January 2015}}</ref> in [[guide:811da45443#Supervised Learning|supervised learning]], and a family of machine learning algorithms that convert weak learners to strong ones.<ref>{{cite book |last=Zhou Zhi-Hua  |date=2012 |title=Ensemble Methods: Foundations and Algorithms |publisher= Chapman and Hall/CRC |page=23 |isbn=978-1439830031 |quote=The term boosting refers to a family of algorithms that are able to convert weak learners to strong learners }}</ref> Boosting is based on the question posed by [[Michael Kearns (computer scientist)|Kearns]] and [[Leslie Valiant|Valiant]] (1988, 1989)<!--Please do not cite only one, because "Kearns and Valiant" is used as a convention to denote this question.-->:<ref name="Kearns88">Michael Kearns(1988); [http://www.cis.upenn.edu/~mkearns/papers/boostnote.pdf ''Thoughts on Hypothesis Boosting''], Unpublished manuscript (Machine Learning class project, December 1988)</ref><ref>{{cite book |last1=Michael Kearns |last2=Leslie Valiant  |date=1989 |title={{sic|Crytographic}} limitations on learning Boolean formulae and finite automata |journal=Symposium on Theory of Computing |publisher=ACM |volume=21 |pages=433–444 |doi=10.1145/73007.73049 |isbn= 978-0897913072|s2cid=536357 }}</ref> "Can a set of weak learners create a single strong learner?" A weak learner is defined to be a [[guide:811da45443|classifier]] that is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.


[[wikipedia:Robert Schapire|Robert Schapire]]'s affirmative answer in a 1990 paper<ref name="Schapire90">{{cite journal | first = Robert E. | last = Schapire | year = 1990 | citeseerx = 10.1.1.20.723 | url = http://www.cs.princeton.edu/~schapire/papers/strengthofweak.pdf | title = The Strength of Weak Learnability | journal = Machine Learning | volume = 5 | issue = 2 | pages = 197–227 | doi = 10.1007/bf00116037 | s2cid = 53304535 | access-date = 2012-08-23 | archive-url = https://web.archive.org/web/20121010030839/http://www.cs.princeton.edu/~schapire/papers/strengthofweak.pdf | archive-date = 2012-10-10 | url-status = dead }}</ref> to the question of Kearns and Valiant<!--Please do not cite only one, because "Kearns and Valiant" is used as a convention to denote this question.--> has had significant ramifications in [[wikipedia:machine learning|machine learning]] and [[wikipedia:statistics|statistics]], most notably leading to the development of boosting.<ref>{{cite journal |last = Leo Breiman |date = 1998|title = Arcing classifier (with discussion and a rejoinder by the author)|journal = Ann. Stat.|volume = 26|issue = 3|pages = 801–849|doi = 10.1214/aos/1024691079|quote = Schapire (1990) proved that boosting is possible. (Page 823)|doi-access = free}}</ref>
[[Robert Schapire|Robert Schapire]]'s affirmative answer in a 1990 paper<ref name="Schapire90">{{cite journal | first = Robert E. | last = Schapire | year = 1990 | citeseerx = 10.1.1.20.723 | url = http://www.cs.princeton.edu/~schapire/papers/strengthofweak.pdf | title = The Strength of Weak Learnability | journal = Machine Learning | volume = 5 | issue = 2 | pages = 197–227 | doi = 10.1007/bf00116037 | s2cid = 53304535 | access-date = 2012-08-23 | archive-url = https://web.archive.org/web/20121010030839/http://www.cs.princeton.edu/~schapire/papers/strengthofweak.pdf | archive-date = 2012-10-10 | url-status = dead }}</ref> to the question of Kearns and Valiant<!--Please do not cite only one, because "Kearns and Valiant" is used as a convention to denote this question.--> has had significant ramifications in machine learning and statistics, most notably leading to the development of boosting.<ref>{{cite journal |last = Leo Breiman |date = 1998|title = Arcing classifier (with discussion and a rejoinder by the author)|journal = Ann. Stat.|volume = 26|issue = 3|pages = 801–849|doi = 10.1214/aos/1024691079|quote = Schapire (1990) proved that boosting is possible. (Page 823)|doi-access = free}}</ref>


When first introduced, the ''hypothesis boosting problem'' simply referred to the process of turning a weak learner into a strong learner. "Informally, [the hypothesis boosting] problem asks whether an efficient learning algorithm […] that outputs a hypothesis whose performance is only slightly better than random guessing [i.e. a weak learner] implies the existence of an efficient algorithm that outputs a hypothesis of arbitrary accuracy [i.e. a strong learner]."<ref name="Kearns88" /> Algorithms that achieve hypothesis boosting quickly became simply known as "boosting". [[wikipedia:Yoav Freund|Freund]] and Schapire's arcing (Adapt[at]ive Resampling and Combining),<ref>Yoav Freund and Robert E. Schapire (1997); [https://www.cis.upenn.edu/~mkearns/teaching/COLT/adaboost.pdf ''A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting''], Journal of Computer and System Sciences, 55(1):119-139</ref> as a general technique, is more or less synonymous with boosting.<ref>Leo Breiman (1998); [http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aos/1024691079 ''Arcing Classifier (with Discussion and a Rejoinder by the Author)''], Annals of Statistics, vol. 26, no. 3, pp. 801-849: "The concept of weak learning was introduced by Kearns and Valiant (1988<!-- Michael Kearns, Leslie G. Valiant (1988); ''Learning Boolean Formulae or Finite Automata is as Hard as Factoring'', Technical Report TR-14-88, Harvard University Aiken Computation Laboratory, August 1988 -->, 1989<!-- Michael Kearns, Leslie G. Valiant (1989) ''Cryptographic Limitations on Learning Boolean Formulae and Finite Automata'', Proceedings of the Twenty-First Annual ACM Symposium on Theory of Computing (pp. 433-444). New York, NY: ACM Press, later republished in the Journal of the Association for Computing Machinery, 41(1):67–95, January 1994 -->), who left open the question of whether weak and strong learnability are equivalent. The question was termed the ''boosting problem'' since [a solution must] boost the low accuracy of a weak learner to the high accuracy of a strong learner. Schapire (1990) proved that boosting is possible. A ''boosting algorithm'' is a method that takes a weak learner and converts it into a strong learner. Freund and Schapire (1997) proved that an algorithm similar to arc-fs is boosting.</ref>
When first introduced, the ''hypothesis boosting problem'' simply referred to the process of turning a weak learner into a strong learner. "Informally, [the hypothesis boosting] problem asks whether an efficient learning algorithm […] that outputs a hypothesis whose performance is only slightly better than random guessing [i.e. a weak learner] implies the existence of an efficient algorithm that outputs a hypothesis of arbitrary accuracy [i.e. a strong learner]."<ref name="Kearns88" /> Algorithms that achieve hypothesis boosting quickly became simply known as "boosting". [[Yoav Freund|Freund]] and Schapire's arcing (Adapt[at]ive Resampling and Combining),<ref>Yoav Freund and Robert E. Schapire (1997); [https://www.cis.upenn.edu/~mkearns/teaching/COLT/adaboost.pdf ''A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting''], Journal of Computer and System Sciences, 55(1):119-139</ref> as a general technique, is more or less synonymous with boosting.<ref>Leo Breiman (1998); [http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aos/1024691079 ''Arcing Classifier (with Discussion and a Rejoinder by the Author)''], Annals of Statistics, vol. 26, no. 3, pp. 801-849: "The concept of weak learning was introduced by Kearns and Valiant (1988<!-- Michael Kearns, Leslie G. Valiant (1988); ''Learning Boolean Formulae or Finite Automata is as Hard as Factoring'', Technical Report TR-14-88, Harvard University Aiken Computation Laboratory, August 1988 -->, 1989<!-- Michael Kearns, Leslie G. Valiant (1989) ''Cryptographic Limitations on Learning Boolean Formulae and Finite Automata'', Proceedings of the Twenty-First Annual ACM Symposium on Theory of Computing (pp. 433-444). New York, NY: ACM Press, later republished in the Journal of the Association for Computing Machinery, 41(1):67–95, January 1994 -->), who left open the question of whether weak and strong learnability are equivalent. The question was termed the ''boosting problem'' since [a solution must] boost the low accuracy of a weak learner to the high accuracy of a strong learner. Schapire (1990) proved that boosting is possible. A ''boosting algorithm'' is a method that takes a weak learner and converts it into a strong learner. Freund and Schapire (1997) proved that an algorithm similar to arc-fs is boosting.</ref>


=== Boosting algorithms ===
=== Boosting algorithms ===
Line 129: Line 127:
[[File:Ensemble Boosting.svg|thumb|An illustration presenting the intuition behind the boosting algorithm, consisting of the parallel learners and weighted dataset.]]
[[File:Ensemble Boosting.svg|thumb|An illustration presenting the intuition behind the boosting algorithm, consisting of the parallel learners and weighted dataset.]]


There are many boosting algorithms.  The original ones, proposed by [[wikipedia:Robert Schapire|Robert Schapire]] (a [[wikipedia:Recursion (computer science)|recursive]] majority gate formulation)<ref name="Schapire90" /> and [[wikipedia:Yoav Freund|Yoav Freund]] (boost by majority),<ref name="Mason00">Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean (2000); ''Boosting Algorithms as Gradient Descent'', in [[wikipedia:Sara Solla|S. A. Solla]], T. K. Leen, and K.-R. Muller, editors, ''Advances in Neural Information Processing Systems'' 12, pp.&nbsp;512-518, MIT Press</ref> were not [[wikipedia:Adaptive behavior|adaptive]] and could not take full advantage of the weak learners.  Schapire and Freund then developed [[wikipedia:AdaBoost|AdaBoost]], an adaptive boosting algorithm that won the prestigious [[wikipedia:Gödel Prize|Gödel Prize]].
There are many boosting algorithms.  The original ones, proposed by [[Robert Schapire|Robert Schapire]] (a [[Recursion (computer science)|recursive]] majority gate formulation)<ref name="Schapire90" /> and [[Yoav Freund|Yoav Freund]] (boost by majority),<ref name="Mason00">Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean (2000); ''Boosting Algorithms as Gradient Descent'', in [[Sara Solla|S. A. Solla]], T. K. Leen, and K.-R. Muller, editors, ''Advances in Neural Information Processing Systems'' 12, pp.&nbsp;512-518, MIT Press</ref> were not [[Adaptive behavior|adaptive]] and could not take full advantage of the weak learners.  Schapire and Freund then developed [[AdaBoost|AdaBoost]], an adaptive boosting algorithm that won the prestigious [[Gödel Prize|Gödel Prize]].
 
Only algorithms that are provable boosting algorithms in the [[wikipedia:probably approximately correct learning|probably approximately correct learning]] formulation can accurately be called ''boosting algorithms''.  Other algorithms that are similar in spirit to boosting algorithms are sometimes called "leveraging algorithms", although they are also sometimes incorrectly called boosting algorithms.<ref name="Mason00"/>


The main variation between many boosting algorithms is their method of [[wikipedia:weighting|weighting]] [[wikipedia:training data|training data]] points and [[wikipedia:Hypothesis|hypotheses]].  [[wikipedia:AdaBoost|AdaBoost]] is very popular and the most significant historically as it was the first algorithm that could adapt to the weak learners.  It is often the basis of introductory coverage of boosting in university machine learning courses.<ref>{{Cite web|url=http://math.mit.edu/~rothvoss/18.304.3PM/Presentations/1-Eric-Boosting304FinalRpdf.pdf|title=Boosting (AdaBoost algorithm)|last=Emer|first=Eric|website=MIT|access-date=2018-10-10}}</ref>  There are many more recent algorithms such as [[wikipedia:LPBoost|LPBoost]], TotalBoost, [[wikipedia:BrownBoost|BrownBoost]], [[wikipedia:xgboost|xgboost]], MadaBoost, [[wikipedia:LogitBoost|LogitBoost]], and others.  Many boosting algorithms fit into the AnyBoost framework,<ref name="Mason00"/> which shows that boosting performs [[wikipedia:gradient descent|gradient descent]] in a [[wikipedia:function space|function space]] using a [[wikipedia:Convex function|convex]] [[wikipedia:Loss functions for classification|cost function]].
The main variation between many boosting algorithms is their method of [[weighting|weighting]] [[training data|training data]] points and [[Hypothesis|hypotheses]].  [[AdaBoost|AdaBoost]] is very popular and the most significant historically as it was the first algorithm that could adapt to the weak learners.  It is often the basis of introductory coverage of boosting in university machine learning courses.<ref>{{Cite web|url=http://math.mit.edu/~rothvoss/18.304.3PM/Presentations/1-Eric-Boosting304FinalRpdf.pdf|title=Boosting (AdaBoost algorithm)|last=Emer|first=Eric|website=MIT|access-date=2018-10-10}}</ref>  There are many more recent algorithms such as [[LPBoost|LPBoost]], TotalBoost, [[BrownBoost|BrownBoost]], [[xgboost|xgboost]], MadaBoost, [[LogitBoost|LogitBoost]], and others.  Many boosting algorithms fit into the AnyBoost framework,<ref name="Mason00"/> which shows that boosting performs [[gradient descent|gradient descent]] in a [[function space|function space]] using a [[Convex function|convex]] [[Loss functions for classification|cost function]].


==References==
==References==
Line 143: Line 139:
  | first = Leo  
  | first = Leo  
  | title = Bagging predictors  
  | title = Bagging predictors  
  | journal = [[wikipedia:Machine Learning (journal)|Machine Learning]]  
  | journal = [[Machine Learning (journal)|Machine Learning]]  
  | volume = 24  
  | volume = 24  
  | issue = 2  
  | issue = 2  

Revision as of 09:58, 14 April 2024

Bootstrap aggregating, also called bagging (from bootstrap aggregating), is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

Description of the technique

Given a standard training set [math]D[/math] of size [math]n[/math], bagging generates [math]m[/math] new training sets [math]D_i[/math], each of size [math]n[/math], by sampling from [math]D[/math] uniformly and with replacement. By sampling with replacement, some observations may be repeated in each [math]D_i[/math]. If [math]n=n'[/math], then for large [math]n[/math] the set [math]D_i[/math] is expected to have the fraction (1 - 1/e) (≈63.2%) of the unique examples of [math]D[/math], the rest being duplicates.[1] This kind of sample is known as a bootstrap sample. Sampling with replacement ensures each bootstrap is independent from its peers, as it does not depend on previous chosen samples when sampling. Then, [math]m[/math] models are fitted using the above [math]m[/math] bootstrap samples and combined by averaging the output (for regression) or voting (for classification).

An illustration for the concept of bootstrap aggregating

Bagging leads to "improvements for unstable procedures",[2] which include, for example, artificial neural networks, classification and regression trees, and subset selection in linear regression.[3] Bagging was shown to improve preimage learning.[4][5] On the other hand, it can mildly degrade the performance of stable methods such as K-nearest neighbors.[2]

Creating the bootstrap dataset

The bootstrap dataset is made by randomly picking objects from the original dataset. Also, it must be the same size as the original dataset. However, the difference is that the bootstrap dataset can have duplicate objects. Here is simple example to demonstrate how it works along with the illustration below:

Bootstrap Example
Bootstrap Example


Suppose the original dataset is a group of 12 people. These guys are Emily, Jessie, George, Constantine, Lexi, Theodore, John, James, Rachel, Anthony, Ellie, and Jamal.

By randomly picking a group of names, let us say our bootstrap dataset had James, Ellie, Constantine, Lexi, John, Constantine, Theodore, Constantine, Anthony, Lexi, Constantine, and Theodore. In this case, the bootstrap sample contained four duplicates for Constantine, and two duplicates for Lexi, and Theodore.

Creating the out-of-bag dataset

The out-of-bag dataset represents the remaining people who were not in the bootstrap dataset. It can be calculated by taking the difference between the original and the bootstrap datasets. In this case, the remaining samples who were not selected are Emily, Jessie, George, Rachel, and Jamal. Keep in mind that since both datasets are sets, when taking the difference the duplicate names are ignored in the bootstrap dataset. The illustration below shows how the math is done:

Complete Example
Complete Example

Example: ozone data

To illustrate the basic principles of bagging, below is an analysis on the relationship between ozone and temperature (data from Rousseeuw and Leroy (1986), analysis done in R).

The relationship between temperature and ozone appears to be nonlinear in this data set, based on the scatter plot. To mathematically describe this relationship, LOESS smoothers (with bandwidth 0.5) are used. Rather than building a single smoother for the complete data set, 100 bootstrap samples were drawn. Each sample is composed of a random subset of the original data and maintains a semblance of the master set’s distribution and variability. For each bootstrap sample, a LOESS smoother was fit. Predictions from these 100 smoothers were then made across the range of the data. The black lines represent these initial predictions. The lines lack agreement in their predictions and tend to overfit their data points: evident by the wobbly flow of the lines.

By taking the average of 100 smoothers, each corresponding to a subset of the original data set, we arrive at one bagged predictor (red line). The red line's flow is stable and does not overly conform to any data point(s).

Random Forests

Diagram of a random decision forest

Random forests or random decision forests is an ensemble learning method for statistical classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned.[6][7] Random decision forests correct for decision trees' habit of overfitting to their training set.Template:R:587–588 Random forests generally outperform decision trees, but their accuracy is lower than gradient boosted trees. However, data characteristics can affect their performance.[8][9]

The first algorithm for random decision forests was created in 1995 by Tin Kam Ho[6] using the random subspace method,[7] which, in Ho's formulation, is a way to implement the "stochastic discrimination" approach to classification proposed by Eugene Kleinberg.[10][11][12]

An extension of the algorithm was developed by Leo Breiman[13] and Adele Cutler,[14] who registered[15] "Random Forests" as a trademark in 2006.[16] The extension combines Breiman's "bagging" idea and random selection of features, introduced first by Ho[6] and later independently by Amit and Geman[17] in order to construct a collection of decision trees with controlled variance.

Random forests are frequently used as "blackbox" models in businesses, as they generate reasonable predictions across a wide range of data while requiring little configuration.

Preliminaries: decision tree learning

Decision trees are a popular method for various machine learning tasks. Tree learning "come[s] closest to meeting the requirements for serving as an off-the-shelf procedure for data mining", say Hastie et al., "because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate".[18]:352

In particular, trees that are grown very deep tend to learn highly irregular patterns: they overfit their training sets, i.e. have low bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance.[18]:587–588 This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.

Forests are like the pulling together of decision tree algorithm efforts. Taking the teamwork of many trees thus improving the performance of a single random tree.

Bagging

The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set [math]X = x_1,\ldots,x_n[/math] with responses [math]Y=y_1,\ldots,y_n[/math], bagging repeatedly ([math]B[/math] times) selects a random sample with replacement of the training set and fits trees to these samples:

Bagging

For [math]b=1,\ldots,B[/math]:

  1. Sample, with replacement, [math]n[/math] training examples from [math]X[/math], [math]Y[/math]; call these [math]X_b[/math], [math]Y_b[/math].
  2. Train a classification or regression tree [math]f_b[/math] on [math]X_b[/math], [math]Y_b[/math].
  3. After training, predictions for unseen samples [math]x'[/math] can be made by averaging the predictions from all the individual regression trees on [math]x'[/math]
    [[math]]\hat{f} = \frac{1}{B} \sum_{b=1}^Bf_b (x')[[/math]]
    or by taking the majority vote in the case of classification trees.


This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets.

Additionally, an estimate of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the individual regression trees on [math]x'[/math]:

[[math]]\sigma = \sqrt{\frac{\sum_{b=1}^B (f_b(x') - \hat{f})^2}{B-1} }.[[/math]]

The number of samples/trees, [math]B[/math], is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. An optimal number of trees [math]B[/math]can be found using cross-validation, or by observing the out-of-bag error: the mean prediction error on each training sample xi, using only the trees that did not have [math]x_i[/math] in their bootstrap sample.[19] The training and test error tend to level off after some number of trees have been fit.

From bagging to random forests

The above procedure describes the original bagging algorithm for trees. Random forests also include another type of bagging scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the [math]B[/math]trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho.[20]

Typically, for a classification problem with [math]p[/math] features, [math]\sqrt{p}[/math](rounded down) features are used in each split.[18]:592 For regression problems the inventors recommend [math]p/3[/math] (rounded down) with a minimum node size of 5 as the default.[18]:592 In practice the best values for these parameters will depend on the problem, and they should be treated as tuning parameters.[18]:592

Variable importance

Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way. The following technique was described in Breiman's original paper[13] and is implemented in the R package randomForest.[14]

The first step in measuring the variable importance in a data set [math]\mathcal{D}_n =\{(X_i, Y_i)\}_{i=1}^n[/math] is to fit a random forest to the data. During the fitting process the out-of-bag error for each data point is recorded and averaged over the forest (errors on an independent test set can be substituted if bagging is not used during training).

To measure the importance of the [math]j[/math]-th feature after training, the values of the [math]j[/math]-th feature are permuted among the training data and the out-of-bag error is again computed on this perturbed data set. The importance score for the [math]j[/math]-th feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees. The score is normalized by the standard deviation of these differences.

Features which produce large values for this score are ranked as more important than features which produce small values. The statistical definition of the variable importance measure was given and analyzed by Zhu et al.[21]

This method of determining variable importance has some drawbacks. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Methods such as partial permutations[22][23][8] and growing unbiased trees[24][25] can be used to solve the problem. If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups.[26]

Boosting

In machine learning, boosting is an ensemble meta-algorithm for primarily reducing bias, and also variance[27] in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones.[28] Boosting is based on the question posed by Kearns and Valiant (1988, 1989):[29][30] "Can a set of weak learners create a single strong learner?" A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.

Robert Schapire's affirmative answer in a 1990 paper[31] to the question of Kearns and Valiant has had significant ramifications in machine learning and statistics, most notably leading to the development of boosting.[32]

When first introduced, the hypothesis boosting problem simply referred to the process of turning a weak learner into a strong learner. "Informally, [the hypothesis boosting] problem asks whether an efficient learning algorithm […] that outputs a hypothesis whose performance is only slightly better than random guessing [i.e. a weak learner] implies the existence of an efficient algorithm that outputs a hypothesis of arbitrary accuracy [i.e. a strong learner]."[29] Algorithms that achieve hypothesis boosting quickly became simply known as "boosting". Freund and Schapire's arcing (Adapt[at]ive Resampling and Combining),[33] as a general technique, is more or less synonymous with boosting.[34]

Boosting algorithms

While boosting is not algorithmically constrained, most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. When they are added, they are weighted in a way that is related to the weak learners' accuracy. After a weak learner is added, the data weights are readjusted, known as "re-weighting". Misclassified input data gain a higher weight and examples that are classified correctly lose weight. Thus, future weak learners focus more on the examples that previous weak learners misclassified.

An illustration presenting the intuition behind the boosting algorithm, consisting of the parallel learners and weighted dataset.

There are many boosting algorithms. The original ones, proposed by Robert Schapire (a recursive majority gate formulation)[31] and Yoav Freund (boost by majority),[35] were not adaptive and could not take full advantage of the weak learners. Schapire and Freund then developed AdaBoost, an adaptive boosting algorithm that won the prestigious Gödel Prize.

The main variation between many boosting algorithms is their method of weighting training data points and hypotheses. AdaBoost is very popular and the most significant historically as it was the first algorithm that could adapt to the weak learners. It is often the basis of introductory coverage of boosting in university machine learning courses.[36] There are many more recent algorithms such as LPBoost, TotalBoost, BrownBoost, xgboost, MadaBoost, LogitBoost, and others. Many boosting algorithms fit into the AnyBoost framework,[35] which shows that boosting performs gradient descent in a function space using a convex cost function.

References

  1. Aslam, Javed A.; Popa, Raluca A.; and Rivest, Ronald L. (2007); On Estimating the Size and Confidence of a Statistical Audit, Proceedings of the Electronic Voting Technology Workshop (EVT '07), Boston, MA, August 6, 2007. More generally, when drawing with replacement [math]n[/math] values out of a set of [math]n[/math] (different and equally likely), the expected number of unique draws is [math]n(1 - e^{-n'/n})[/math].
  2. 2.0 2.1 Breiman, Leo (1996). "Bagging predictors". Machine Learning 24 (2): 123–140. doi:10.1007/BF00058655. 
  3. Breiman, Leo (September 1994). "Bagging Predictors". Technical Report. Department of Statistics, University of California Berkeley. 
  4. Sahu, A., Runger, G., Apley, D., Image denoising with a multi-phase kernel principal component approach and an ensemble version, IEEE Applied Imagery Pattern Recognition Workshop, pp.1-7, 2011.
  5. Shinde, Amit, Anshuman Sahu, Daniel Apley, and George Runger. "Preimages for Variation Patterns from Kernel PCA and Bagging." IIE Transactions, Vol.46, Iss.5, 2014
  6. 6.0 6.1 6.2 Ho, Tin Kam (1995). Random Decision Forests (PDF). Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995. pp. 278–282. Archived from the original (PDF) on 17 April 2016. Retrieved 5 June 2016.
  7. 7.0 7.1 Ho, Tin Kam (1998). "The Random Subspace Method for Constructing Decision Forests". IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (8): 832–844. doi:10.1109/34.709601. 
  8. 8.0 8.1 "Role of Data Analytics in Infrastructure Asset Management: Overcoming Data Size and Quality Problems" (2020-06-01). Journal of Transportation Engineering, Part B: Pavements 146 (2): 04020022. doi:10.1061/JPEODX.0000175. 
  9. "Using Machine Learning to Examine Impact of Type of Performance Indicator on Flexible Pavement Deterioration Modeling" (in en) (2021-02-01). Journal of Infrastructure Systems 27 (2): 04021005. doi:10.1061/(ASCE)IS.1943-555X.0000602. ISSN 1076-0342. 
  10. Kleinberg, Eugene (1990). "Stochastic Discrimination". Annals of Mathematics and Artificial Intelligence 1 (1–4): 207–239. doi:10.1007/BF01531079. 
  11. Kleinberg, Eugene (1996). "An Overtraining-Resistant Stochastic Modeling Method for Pattern Recognition". Annals of Statistics 24 (6): 2319–2349. doi:10.1214/aos/1032181157. 
  12. Kleinberg, Eugene (2000). "On the Algorithmic Implementation of Stochastic Discrimination". IEEE Transactions on PAMI 22 (5): 473–490. doi:10.1109/34.857004. 
  13. 13.0 13.1 Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32. doi:10.1023/A:1010933404324. Bibcode2001MachL..45....5B. 
  14. 14.0 14.1 Liaw A (16 October 2012). "Documentation for R package randomForest" (PDF). Retrieved 15 March 2013.
  15. U.S. trademark registration number 3185828, registered 2006/12/19.
  16. "RANDOM FORESTS Trademark of Health Care Productivity, Inc. - Registration Number 3185828 - Serial Number 78642027 :: Justia Trademarks".
  17. "Shape quantization and recognition with randomized trees" (1997). Neural Computation 9 (7): 1545–1588. doi:10.1162/neco.1997.9.7.1545. 
  18. 18.0 18.1 18.2 18.3 18.4 Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2008). The Elements of Statistical Learning (2nd ed.). Springer. ISBN 0-387-95284-5.
  19. Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013). An Introduction to Statistical Learning. Springer. pp. 316–321.
  20. Ho, Tin Kam (2002). "A Data Complexity Analysis of Comparative Advantages of Decision Forest Constructors". Pattern Analysis and Applications 5 (2): 102–112. doi:10.1007/s100440200009. 
  21. "Reinforcement Learning Trees" (2015). Journal of the American Statistical Association 110 (512): 1770–1784. doi:10.1080/01621459.2015.1036994. PMID 26903687. 
  22. Deng, H.; Runger, G.; Tuv, E. (2011). Bias of importance measures for multi-valued attributes and solutions. Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN). pp. 293–300.
  23. "Permutation importance: a corrected feature importance measure" (May 2010). Bioinformatics 26 (10): 1340–7. doi:10.1093/bioinformatics/btq134. PMID 20385727. 
  24. "Unbiased split selection for classification trees based on the Gini index" (2007). Computational Statistics & Data Analysis 52: 483–501. doi:10.1016/j.csda.2006.12.030. 
  25. "Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance" (2017). IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (11): 2142–2153. doi:10.1109/tpami.2016.2636831. PMID 28114007. 
  26. "Classification with correlated features: unreliability of feature ranking and solutions" (July 2011). Bioinformatics 27 (14): 1986–94. doi:10.1093/bioinformatics/btr300. PMID 21576180. 
  27. Leo Breiman (1996). "BIAS, VARIANCE, AND ARCING CLASSIFIERS" (PDF). TECHNICAL REPORT. Archived from the original (PDF) on 2015-01-19. Retrieved 19 January 2015. Arcing [Boosting] is more successful than bagging in variance reduction
  28. Zhou Zhi-Hua (2012). Ensemble Methods: Foundations and Algorithms. Chapman and Hall/CRC. p. 23. ISBN 978-1439830031. The term boosting refers to a family of algorithms that are able to convert weak learners to strong learners
  29. 29.0 29.1 Michael Kearns(1988); Thoughts on Hypothesis Boosting, Unpublished manuscript (Machine Learning class project, December 1988)
  30. Michael Kearns; Leslie Valiant (1989). Template:Sic limitations on learning Boolean formulae and finite automata. Symposium on Theory of Computing. 21. ACM. pp. 433–444. doi:10.1145/73007.73049. ISBN 978-0897913072. S2CID 536357.
  31. 31.0 31.1 Schapire, Robert E. (1990). "The Strength of Weak Learnability". Machine Learning 5 (2): 197–227. doi:10.1007/bf00116037. 
  32. Leo Breiman (1998). "Arcing classifier (with discussion and a rejoinder by the author)". Ann. Stat. 26 (3): 801–849. doi:10.1214/aos/1024691079. “Schapire (1990) proved that boosting is possible. (Page 823)” 
  33. Yoav Freund and Robert E. Schapire (1997); A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, Journal of Computer and System Sciences, 55(1):119-139
  34. Leo Breiman (1998); Arcing Classifier (with Discussion and a Rejoinder by the Author), Annals of Statistics, vol. 26, no. 3, pp. 801-849: "The concept of weak learning was introduced by Kearns and Valiant (1988, 1989), who left open the question of whether weak and strong learnability are equivalent. The question was termed the boosting problem since [a solution must] boost the low accuracy of a weak learner to the high accuracy of a strong learner. Schapire (1990) proved that boosting is possible. A boosting algorithm is a method that takes a weak learner and converts it into a strong learner. Freund and Schapire (1997) proved that an algorithm similar to arc-fs is boosting.
  35. 35.0 35.1 Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean (2000); Boosting Algorithms as Gradient Descent, in S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Advances in Neural Information Processing Systems 12, pp. 512-518, MIT Press
  36. Emer, Eric. "Boosting (AdaBoost algorithm)" (PDF). MIT. Retrieved 2018-10-10.

Further Reading

Wikipedia References