Ensemble Learning Explained! Part 2

5 min readMay 17, 2019

This is a two-part article on Ensemble Learning. The first part is linked here. This article is going to focus on the algorithm based on Bagging and Boosting

Algorithms based on Bagging

Bagging or Bootstrap Aggeregation is an ensmeble technique where the goal is to reduce the variance of a decision trees. The idea is to create multiple subsets on random from the original data, and train the decision tree on each of the model. As a result, we get multiple decision models, and average of these are used which is more robust than a single decision tree.

Under bagging algorithm, we have two techniques:

Bagging Meta-estimator
Random Forest

Bagging Meta-Estimator

Bagging estimator is used for both Classification and Regression problems. The typical bagging techniques follows these steps

Bootstrapping(Creating random subsets).
Subsets include all the features.
A custom base estimator is fitted on all of the smaller sets.
Prediction from each model is combined to get the final result.

The sample code for Bagging Classifier and Regressor is

BaggingClassifier(base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)class sklearn.ensemble.BaggingRegressor(base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)

Here the important parameters are:

base_estimator: It defines the base estimator to fit on random subsets of the dataset.
n_estimators: The number of base estimators in the ensemble.
max_sample: The number of samples to draw from X to train each base estimator.
max_features: number of features drawn from the dataset
n_jobs: The number of jobs to run in parallel.
random_state: random split

Random Forest

Random Forest is an extension of the previously mentioned bagging technique. The base_estimators for RF is decision trees. RF selects a feature that is the best split at each node of the decision tree.

Random subsets are created from the original dataset (bootstrapping).
At each node in the decision tree, only a random set of features are considered to decide the best split.
A decision tree model is fitted on each of the subsets.
Prediction from each model is combined to get the final result.

RandomForestClassifier(n_estimators=’warn’, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)

Here the important parameters are:

n_estimators: It defines the number of decision trees to be created in a random forest. Generally, a higher number makes the predictions stronger and more stable, but a very large number can result in higher training time.
criterion: the function to be used for splitting (default=gini).
max_features: maximum number of splits thats required for split in each decision tree.
max_depth: maximum depth of the decision trees.
min_samples_split: Used to define the minimum number of samples required in a leaf node before a split is attempted. If the number of samples is less than the required number, the node is not split.
min_samples_leaf: This defines the minimum number of samples required to be at a leaf node. Smaller leaf size, larger noise.
n_jobs: number of jobs to run in parallel.
random_state: random split number.

Algorithms based on Boosting:

Boosting is an ensemble technique, where learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. In other words, we fit consecutive trees (random sample) and at every step, the goal is to solve for net error from the prior tree.

Some algorithms are:

AdaBoost
GBM
XGBM

There are few other Boosting algorithms like Catboost and Light GBA

AdaBoost(Adaptive Boosting)

In AdaBoost, Multiple sequential models are created, each correcting the errors from the last model. AdaBoost assigns weights to the observations which are incorrectly predicted and the subsequent model works to predict these values correctly. The steps in a typical Adaboost algorithm are as follows:

All observations are given equal weights at first.
A model is built on the subset of data.
Using this model, predictions are made on the complete dataset.
Errors are calculated
During the next model, higher weights are given to the data points which are predicted incorrectly
Higher the error more will be the weights.
This process is repeated until the error function doesn't change, or max limit of the estimator is reached.

AdaBoostRegressor(base_estimator = None, n_estimators=50, learning_rate=1.0, loss=’linear’, random_state=None)

base_estimators: specifies the base estimator. The default is regression tree
n_estimators: the number of estimators, default is 10 but keeping the value higher gives better performance
learning_rate: parameter that contributes to the estimator of final combination
max_depth: maximum depth of individual estimators
n_jobs: Specifies the number of processors it is allowed to use.
random_state: random split

Gradient Boosting (GBM)

Gradient boosting is ensemble machine learning algorithm for both regression and classification. It combines a number of weak learners to form a strong learner. Regression trees used as a base learner.

GradientBoostingRegressor(loss=’ls’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001)

min_samples_split — minimum number of nodes required in a node to be considered for splitting. Used to control over-fitting
min_samples_leaf — minimum samples required in the terminal of leaf node
min_weight_fraction_leaf- min_samples_leaf but defined as a fraction of the total number of observations instead of an integer.
max_depth — depth of tree
max_leaf_nodes — maximum terminal nodes(2^n)
max_features — The number of features to consider while searching for the best split. These will be randomly selected.

XGBoost

Extreme gradient boosting — it is an advance implementation of GBA with high predictive power and almost 10x faster than other gradient boosting techniques. It also has various regularisation techniques. Reduces overfitting and improves overall performance, hence also called Regularised Boosting technique.

How it is better than other techniques:

Regularization — Reduce overfitting
Parallel processing — Faster than GBM
High Flexiblity — Custom optimization objectives and evaluation criteria
Handling Missing Values — built in nan values handling
Tree Pruning — XGBoost makes splits up to the max_depth specified and then starts pruning the tree backwards and removes splits beyond which there is no positive gain.
Built-in Cross-Validation — user can run a cross-validation at each iteration, hence better optimization

GradientBoostingClassifier(loss=’deviance’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001)

nthread — parallel processing
eta — Learning rate( similar to GBA)
min_child_weight — Minimum sum of weights
max_depth — max depth of tree
max_leaf_nodes — maximum number of terminal nodes
gamma — Specifies the minimum loss reduction required to make a split.
subsample — Same as the subsample of GBM. Denotes the fraction of observations to be randomly sampled for each tree.
colsample_bytree — fraction of data to be used for each tree

Sources:

Decision Tree Ensembles- Bagging and Boosting

Random Forest and Gradient Boosting

towardsdatascience.com

A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)

Introduction Tree based learning algorithms are considered to be one of the best and mostly used supervised learning…

www.analyticsvidhya.com

A Comprehensive Guide to Ensemble Learning (with Python codes)

When you want to purchase a new car, will you walk up to the first car shop and purchase one based on the advice of the…