Ensemble Learning Explained! Part 2
This is a two-part article on Ensemble Learning. The first part is linked here. This article is going to focus on the algorithm based on Bagging and Boosting
Algorithms based on Bagging
Bagging or Bootstrap Aggeregation is an ensmeble technique where the goal is to reduce the variance of a decision trees. The idea is to create multiple subsets on random from the original data, and train the decision tree on each of the model. As a result, we get multiple decision models, and average of these are used which is more robust than a single decision tree.
Under bagging algorithm, we have two techniques:
- Bagging Meta-estimator
- Random Forest
Bagging Meta-Estimator
Bagging estimator is used for both Classification and Regression problems. The typical bagging techniques follows these steps
- Bootstrapping(Creating random subsets).
- Subsets include all the features.
- A custom base estimator is fitted on all of the smaller sets.
- Prediction from each model is combined to get the final result.
The sample code for Bagging Classifier and Regressor is
BaggingClassifier(base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)class sklearn.ensemble.BaggingRegressor(base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)
Here the important parameters are:
- base_estimator: It defines the base estimator to fit on random subsets of the dataset.
- n_estimators: The number of base estimators in the ensemble.
- max_sample: The number of samples to draw from X to train each base estimator.
- max_features: number of features drawn from the dataset
- n_jobs: The number of jobs to run in parallel.
- random_state: random split
Random Forest
Random Forest is an extension of the previously mentioned bagging technique. The base_estimators for RF is decision trees. RF selects a feature that is the best split at each node of the decision tree.
- Random subsets are created from the original dataset (bootstrapping).
- At each node in the decision tree, only a random set of features are considered to decide the best split.
- A decision tree model is fitted on each of the subsets.
- Prediction from each model is combined to get the final result.
RandomForestClassifier(n_estimators=’warn’, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)
Here the important parameters are:
- n_estimators: It defines the number of decision trees to be created in a random forest. Generally, a higher number makes the predictions stronger and more stable, but a very large number can result in higher training time.
- criterion: the function to be used for splitting (default=gini).
- max_features: maximum number of splits thats required for split in each decision tree.
- max_depth: maximum depth of the decision trees.
- min_samples_split: Used to define the minimum number of samples required in a leaf node before a split is attempted. If the number of samples is less than the required number, the node is not split.
- min_samples_leaf: This defines the minimum number of samples required to be at a leaf node. Smaller leaf size, larger noise.
- n_jobs: number of jobs to run in parallel.
- random_state: random split number.
Algorithms based on Boosting:
Boosting is an ensemble technique, where learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. In other words, we fit consecutive trees (random sample) and at every step, the goal is to solve for net error from the prior tree.
Some algorithms are:
- AdaBoost
- GBM
- XGBM
There are few other Boosting algorithms like Catboost and Light GBA
AdaBoost(Adaptive Boosting)
In AdaBoost, Multiple sequential models are created, each correcting the errors from the last model. AdaBoost assigns weights to the observations which are incorrectly predicted and the subsequent model works to predict these values correctly. The steps in a typical Adaboost algorithm are as follows:
- All observations are given equal weights at first.
- A model is built on the subset of data.
- Using this model, predictions are made on the complete dataset.
- Errors are calculated
- During the next model, higher weights are given to the data points which are predicted incorrectly
- Higher the error more will be the weights.
- This process is repeated until the error function doesn't change, or max limit of the estimator is reached.
AdaBoostRegressor(base_estimator = None, n_estimators=50, learning_rate=1.0, loss=’linear’, random_state=None)
- base_estimators: specifies the base estimator. The default is regression tree
- n_estimators: the number of estimators, default is 10 but keeping the value higher gives better performance
- learning_rate: parameter that contributes to the estimator of final combination
- max_depth: maximum depth of individual estimators
- n_jobs: Specifies the number of processors it is allowed to use.
- random_state: random split
Gradient Boosting (GBM)
Gradient boosting is ensemble machine learning algorithm for both regression and classification. It combines a number of weak learners to form a strong learner. Regression trees used as a base learner.
GradientBoostingRegressor(loss=’ls’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001)
- min_samples_split — minimum number of nodes required in a node to be considered for splitting. Used to control over-fitting
- min_samples_leaf — minimum samples required in the terminal of leaf node
- min_weight_fraction_leaf- min_samples_leaf but defined as a fraction of the total number of observations instead of an integer.
- max_depth — depth of tree
- max_leaf_nodes — maximum terminal nodes(2^n)
- max_features — The number of features to consider while searching for the best split. These will be randomly selected.
XGBoost
Extreme gradient boosting — it is an advance implementation of GBA with high predictive power and almost 10x faster than other gradient boosting techniques. It also has various regularisation techniques. Reduces overfitting and improves overall performance, hence also called Regularised Boosting technique.
How it is better than other techniques:
- Regularization — Reduce overfitting
- Parallel processing — Faster than GBM
- High Flexiblity — Custom optimization objectives and evaluation criteria
- Handling Missing Values — built in nan values handling
- Tree Pruning — XGBoost makes splits up to the max_depth specified and then starts pruning the tree backwards and removes splits beyond which there is no positive gain.
- Built-in Cross-Validation — user can run a cross-validation at each iteration, hence better optimization
GradientBoostingClassifier(loss=’deviance’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001)
- nthread — parallel processing
- eta — Learning rate( similar to GBA)
- min_child_weight — Minimum sum of weights
- max_depth — max depth of tree
- max_leaf_nodes — maximum number of terminal nodes
- gamma — Specifies the minimum loss reduction required to make a split.
- subsample — Same as the subsample of GBM. Denotes the fraction of observations to be randomly sampled for each tree.
- colsample_bytree — fraction of data to be used for each tree
Sources:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble