Chapter12

Publications: 0 | Followers: 0

Chapter12

Publish on 31st October 2019 Category: Birds 268

Ensemblelearning
12.1 Combiningmultiple modelsThe basicidea12.2 BaggingBias-variance decomposition, bagging withcosts12.4 BoostingAdaBoost, the power ofboosting12.7 Stacking
2
Combining multiple models
Basic idea:build different “experts”, let them voteAdvantage:often improves predictive performanceDisadvantage:usually produces output that is very hard to analyzebut: there are approaches that aim to produce a single comprehensible structure
3
Bagging
Combining predictions by voting/averagingEachmodel receives equal weight“Idealized” version:Sample several training sets of sizen(instead of just having one training set of sizen)Build a classifier for each training setCombine the classifiers’ predictionsLearning scheme isunstable→almost always improves performanceUnstable learner: smallchange in training data can make big change in model (e.g., when learningdecision trees)
4
Bias-variance decomposition
Thebias-variancedecompositionisusedto analyze how muchrestriction to a single trainingset affectsperformanceAssume we have the idealized ensemble classifier discussed on the previous slideWe can decompose the expectederror of any individual ensemble memberas follows:Bias=expectederror of theensemble classifieron new dataVariance=component of theexpectederror due totheparticulartraining setbeing used to built ourclassifierTotal expected error=bias +varianceNote(A): we assume noise inherent in the data is part of the bias component as it cannot normally be measuredNote (B): multiple versions of this decomposition exist forzero-one loss but the basic idea is always the same
5
More on bagging
The idealized version of bagging improves performance becauseiteliminates thevariancecomponent of the errorNote: in some pathological hypothetical situations the overall errormay increase when zero-one loss is used (i.e., there is negative “variance”)The bias-variance decomposition was originally only known for numeric prediction with squared error where the errorneverincreasesProblem: we only have one dataset!Solution: generate newdatasets ofsizenby sampling fromthe original datasetwithreplacementThis is whatbaggingdoes and even though the datasets are all dependent, bagging often reduces variance, and, thus, errorCan beapplied to numericpredictionand classificationCan help a lot if the data is noisyUsually, the more classifiers thebetter, with diminishing returns
6
Baggingclassifiers
Letnbe the number of instances in the training dataFor each oftiterations:Sampleninstances from trainingset(with replacement)Apply learning algorithm to the sampleStore resulting model
7
Bagging with costs
Baggingunpruneddecision treesis knownto produce good probability estimatesWhere, instead of voting, the individual classifiers' probability estimates are averagedNote: this can also improve thezero-one lossCan use this withthe minimum-expected cost approach for learning problems withcostsNote that the minimum-expected cost approach requires accurate probabilities to work wellProblem:ensemble classifier is notinterpretableMetaCostre-labelsthe trainingdata using bagging with costs and then buildsa single tree from this data
8
Boosting
Bagging can easily be parallelized because ensemble members are created independentlyBoostingis analternative approachAlsouses voting/averagingBut: weightsmodels according to performanceIterative: new models are influenced by performance of previously built onesEncourage new model to become an “expert” for instances misclassified by earlier modelsIntuitive justification: models should be experts that complement each otherMany variants of boosting exist, we cover a couple
9
Boosting using AdaBoost.M1
Assign equal weight to each training instanceFortiterations:Apply learning algorithm to weighted dataset,store resulting modelCompute model’s erroreon weighted datasetIfe= 0 ore0.5:Terminate model generationFor each instance in dataset:If classified correctly by model:Multiply instance’s weight bye/(1-e)Normalize weight of all instances
10
Comments on AdaBoost.M1
Boosting needs weights … butcanadapt learning algorithm ...orcanapply boostingwithoutweights:Resample data withprobability determined by weightsDisadvantage: not all instances are usedAdvantage: if error > 0.5, can resample againThe AdaBoost.M1 boosting algorithm stems from work incomputational learning theoryTheoretical result:Trainingerror decreasesexponentially as iterations are performedOther theoretical results:Works wellif base classifiers are not toocomplexandtheirerrordoesnotbecome too large tooquickly as more iterations are performed
11
Morecomments on boosting
Continue boosting after training error = 0?Puzzlingfact:generalizationerror continues to decrease!Seems to contradict Occam’s RazorPossible explanation:considermargin(confidence), notjust errorA possible definition ofmargin: differencebetween estimated probability for true class and nearest other class (between –1 and 1)Margin continues toincrease with more iterationsAdaBoost.M1 works well with so-calledweaklearners; onlycondition: errordoesnotexceed0.5Example of weak learner: decision stumpIn practice, boosting sometimesoverfitsif too many iterations are performed (in contrast to bagging)
12
Stacking
Question: howto build aheterogeneousensemble consisting of different types of models (e.g., decision tree and neural network)Problem:models can be vastly different in accuracyIdea: tocombine predictions of base learners,donotjust vote,instead, usemetalearnerIn stacking, the base learners are also calledlevel-0 modelsMetalearner iscalledlevel-1 modelPredictions of base learners are input to meta learnerBase learners are usually differentlearning schemesCaveat: cannotuse predictions on training data to generate data for level-1 model!Instead usescheme basedoncross-validation
13
Generating the level-1 training data
Training data for level-1 model contains predictions of level-0 models as attributes; class attribute remains the sameProblem: we cannot use the level-0modelspredictions on theirtrainingdata to obtain attribute values for the level-1 dataAssume we have a perfect rote learner as one of the level-0 learnerThen, the level-1 learner will learn to simply predict this level-0’s learners predictions, rendering the ensemble pointlessTo solve this, we generate the level-1 training data by running across-validationfor each of the level-0 algorithmsThen, the predictions (and actual class values) obtained for thetestinstancesencountered during the cross-validation are collectedThis pooled data obtained from the cross-validation for each level-0 model is used to train the level-1 model
14
More on stacking
Stacking is hard to analyze theoretically: “black magic”If the baselearners can outputclass probabilities, use those as input to meta learnerinstead of plain classificationsMakes more information available to the level-1learnerImportant question: whichalgorithm to useas the meta learner (akalevel-1 learner)?In principle, any learning schemeIn practice, prefer“relatively global, smooth”models becausebaselearners do most of thework andthis reduces the riskofoverfittingNote that stackingcanbetriviallyapplied to numeric predictiontoo
15

Embed

Upload

Make amazing presentation for free

Chapter12