Publications: 52 | Followers: 0


Publish on Category: Birds 0

12.1 Combiningmultiple modelsThe basicidea12.2 BaggingBias-variance decomposition, bagging withcosts12.4 BoostingAdaBoost, the power ofboosting12.7 Stacking
Combining multiple models
Basic idea:build different “experts”, let them voteAdvantage:often improves predictive performanceDisadvantage:usually produces output that is very hard to analyzebut: there are approaches that aim to produce a single comprehensible structure
Combining predictions by voting/averagingEachmodel receives equal weight“Idealized” version:Sample several training sets of sizen(instead of just having one training set of sizen)Build a classifier for each training setCombine the classifiers’ predictionsLearning scheme isunstable→almost always improves performanceUnstable learner: smallchange in training data can make big change in model (e.g., when learningdecision trees)
Bias-variance decomposition
Thebias-variancedecompositionisusedto analyze how muchrestriction to a single trainingset affectsperformanceAssume we have the idealized ensemble classifier discussed on the previous slideWe can decompose the expectederror of any individual ensemble memberas follows:Bias=expectederror of theensemble classifieron new dataVariance=component of theexpectederror due totheparticulartraining setbeing used to built ourclassifierTotal expected error=bias +varianceNote(A): we assume noise inherent in the data is part of the bias component as it cannot normally be measuredNote (B): multiple versions of this decomposition exist forzero-one loss but the basic idea is always the same
More on bagging
The idealized version of bagging improves performance becauseiteliminates thevariancecomponent of the errorNote: in some pathological hypothetical situations the overall errormay increase when zero-one loss is used (i.e., there is negative “variance”)The bias-variance decomposition was originally only known for numeric prediction with squared error where the errorneverincreasesProblem: we only have one dataset!Solution: generate newdatasets ofsizenby sampling fromthe original datasetwithreplacementThis is whatbaggingdoes and even though the datasets are all dependent, bagging often reduces variance, and, thus, errorCan beapplied to numericpredictionand classificationCan help a lot if the data is noisyUsually, the more classifiers thebetter, with diminishing returns
Letnbe the number of instances in the training dataFor each oftiterations:Sampleninstances from trainingset(with replacement)Apply learning algorithm to the sampleStore resulting model
Bagging with costs
Baggingunpruneddecision treesis knownto produce good probability estimatesWhere, instead of voting, the individual classifiers' probability estimates are averagedNote: this can also improve thezero-one lossCan use this withthe minimum-expected cost approach for learning problems withcostsNote that the minimum-expected cost approach requires accurate probabilities to work wellProblem:ensemble classifier is notinterpretableMetaCostre-labelsthe trainingdata using bagging with costs and then buildsa single tree from this data
Bagging can easily be parallelized because ensemble members are created independentlyBoostingis analternative approachAlsouses voting/averagingBut: weightsmodels according to performanceIterative: new models are influenced by performance of previously built onesEncourage new model to become an “expert” for instances misclassified by earlier modelsIntuitive justification: models should be experts that complement each otherMany variants of boosting exist, we cover a couple
Boosting using AdaBoost.M1
Assign equal weight to each training instanceFortiterations:Apply learning algorithm to weighted dataset,store resulting modelCompute model’s erroreon weighted datasetIfe= 0 ore0.5:Terminate model generationFor each instance in dataset:If classified correctly by model:Multiply instance’s weight bye/(1-e)Normalize weight of all instances
Comments on AdaBoost.M1
Boosting needs weights … butcanadapt learning algorithm ...orcanapply boostingwithoutweights:Resample data withprobability determined by weightsDisadvantage: not all instances are usedAdvantage: if error > 0.5, can resample againThe AdaBoost.M1 boosting algorithm stems from work incomputational learning theoryTheoretical result:Trainingerror decreasesexponentially as iterations are performedOther theoretical results:Works wellif base classifiers are not toocomplexandtheirerrordoesnotbecome too large tooquickly as more iterations are performed
Morecomments on boosting
Continue boosting after training error = 0?Puzzlingfact:generalizationerror continues to decrease!Seems to contradict Occam’s RazorPossible explanation:considermargin(confidence), notjust errorA possible definition ofmargin: differencebetween estimated probability for true class and nearest other class (between –1 and 1)Margin continues toincrease with more iterationsAdaBoost.M1 works well with so-calledweaklearners; onlycondition: errordoesnotexceed0.5Example of weak learner: decision stumpIn practice, boosting sometimesoverfitsif too many iterations are performed (in contrast to bagging)
Question: howto build aheterogeneousensemble consisting of different types of models (e.g., decision tree and neural network)Problem:models can be vastly different in accuracyIdea: tocombine predictions of base learners,donotjust vote,instead, usemetalearnerIn stacking, the base learners are also calledlevel-0 modelsMetalearner iscalledlevel-1 modelPredictions of base learners are input to meta learnerBase learners are usually differentlearning schemesCaveat: cannotuse predictions on training data to generate data for level-1 model!Instead usescheme basedoncross-validation
Generating the level-1 training data
Training data for level-1 model contains predictions of level-0 models as attributes; class attribute remains the sameProblem: we cannot use the level-0modelspredictions on theirtrainingdata to obtain attribute values for the level-1 dataAssume we have a perfect rote learner as one of the level-0 learnerThen, the level-1 learner will learn to simply predict this level-0’s learners predictions, rendering the ensemble pointlessTo solve this, we generate the level-1 training data by running across-validationfor each of the level-0 algorithmsThen, the predictions (and actual class values) obtained for thetestinstancesencountered during the cross-validation are collectedThis pooled data obtained from the cross-validation for each level-0 model is used to train the level-1 model
More on stacking
Stacking is hard to analyze theoretically: “black magic”If the baselearners can outputclass probabilities, use those as input to meta learnerinstead of plain classificationsMakes more information available to the level-1learnerImportant question: whichalgorithm to useas the meta learner (akalevel-1 learner)?In principle, any learning schemeIn practice, prefer“relatively global, smooth”models becausebaselearners do most of thework andthis reduces the riskofoverfittingNote that stackingcanbetriviallyapplied to numeric predictiontoo





Make amazing presentation for free