Detail on AdaBoost - Adaptive Boosting¶
Understanding concept of Boosting¶
Boosting is an ensemble learning method aimed at transforming a collection of weak learners into a single, strong predictive model. In boosting, models are trained sequentially rather than concurrently. Each model in the sequence seeks to correct the errors of its predecessor by emphasizing the previously misclassified examples. This iterative strategy reduces both bias and variance, often resulting in a classifier that performs significantly better than any single weak learner alone. Boosting focuses on prediction error, particularly emphasizing the hard-to-classify instances, thereby gradually creating a model that generalizes well.
AdaBoost: Adaptive Boosting¶
AdaBoost, short for Adaptive Boosting, was one of the first practical boosting algorithms and remains foundational in machine learning. It works by assigning equal weights to all training instances at the start and then iteratively increasing the weights of misclassified examples. Subsequent weak classifiers are trained on this reweighted dataset so they concentrate on harder cases. Each weak learner’s contribution to the final prediction is weighted according to its accuracy. Over a number of rounds, this process yields a strong classifier made up of weighted weak learners.
AdaBoost commonly uses decision stumps, single‑split decision tree, as the weak learners. But in practice, it can combine even stronger base learners if needed, sometimes improving performance further on complex datasets. While it often resists overfitting on clean data, AdaBoost is known to be sensitive to noise and outliers, as these frequently misclassified points may receive disproportionately high weight, potentially skewing the model—though variants exist to mitigate that effect.
The core principle is to allocate more learning power to hard examples by reweighting the data. If a sample is hard to classify, subsequent learners are nudged to correct that mistake. Similarly, each learner’s influence on the final model is dictated by its success rate: better learners should count more. This idea is grounded in exponential loss minimization, where AdaBoost aims to minimize an upper bound of the training error by iteratively reweighting the samples to focus on mistakes. The update rules for weights stem from this formal objective. We will see this in detail with example shortly.
Theoretical Understanding of how AdaBoost works¶
Initial Setup¶
The AdaBoost algorithm starts by assigning equal weights to all training examples. Suppose you have a dataset with n
samples. Initially, each sample is given a weight of 1/n
. These weights indicate the importance of each sample during training. At this point, no classifier has been trained and all data points are treated equally.
Iterative Model Training¶
The algorithm proceeds in iterations. In each round, a weak learner (typically a shallow decision tree or a decision stump, which is a one-level decision tree) is trained on the weighted dataset. The model tries to classify the training data, and its performance is evaluated using the current weights. After training, the classifier's predictions are compared to the true labels. If a sample is misclassified, its weight is increased so that the next model in the sequence will pay more attention to this sample. On the other hand, if a sample is correctly classified, its weight is decreased.
This process ensures that the next classifier focuses more on the hard-to-classify data points. The logic is: if a model struggles to classify certain examples, the next model should try harder to get them right.
Classifier Weighting¶
Each weak learner is also assigned a weight based on its accuracy. If the classifier performs well, it receives a higher weight, meaning its influence in the final model will be greater. The accuracy is used to calculate an alpha value (usually denoted as α), which represents the confidence of the classifier. Poorer-performing classifiers receive smaller α values and have less say in the final prediction.
Mathematically, this alpha is calculated using:
$$ \alpha_t = \frac{1}{2} \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right) $$
where $\epsilon_t$ is the error rate of the weak classifier at iteration $t$. A lower error rate leads to a higher alpha, meaning the model is more trusted in the ensemble.
Weight Update for Data Points¶
Once a classifier is trained and its alpha is computed, the weights of the data points are updated for the next iteration. Misclassified samples get their weights increased, making them more influential for the next classifier. Correctly classified ones have their weights decreased. The new weights are typically updated using the following formula:
$$ w_i^{(t+1)} = w_i^{(t)} \cdot \exp(\alpha_t \cdot \mathbb{I}[y_i \neq h_t(x_i)]) $$
where $\mathbb{I}$ is the indicator function which is 1 when the prediction $h_t(x_i)$ is incorrect, and 0 otherwise. After this, all weights are normalized so that they sum up to 1.
Repeating the Process¶
The training continues for a predefined number of rounds (or until the error drops to zero). In each iteration, a new weak learner is added, focusing more on the difficult examples as defined by the updated weights. As the iterations proceed, the ensemble becomes increasingly accurate.
Final Prediction¶
To make a prediction on a new sample, AdaBoost takes the predictions from all weak learners and performs a weighted vote, where the weight of each model is its alpha value. In binary classification, if the weighted sum of predictions is positive, the sample is classified into class 1; otherwise, class 0.
Mathematically, the final strong classifier $H(x)$ is:
$$ H(x) = \text{sign}\left( \sum_{t=1}^{T} \alpha_t \cdot h_t(x) \right) $$
Here, $h_t(x)$ is the prediction of the t-th weak learner, and $\alpha_t$ is its corresponding weight. The final result is the sign of the weighted sum of individual classifiers’ outputs.
Importance of Features in AdaBoost¶
AdaBoost is often explained in terms of reweighting data points, but that leads to a common misconception - that it ignores the features. In reality, features are central to how weak learners are trained. Let’s break it down clearly using an example with multiple features and then extend it to multi-class classification.
In AdaBoost, a weak learner (commonly a decision stump) is trained to minimize the weighted classification error. A decision stump is simply a one-level decision tree: it picks one feature and a threshold, and splits the data into two groups based on that. But unlike in regular decision trees where all data points are treated equally, in AdaBoost, each data point has a weight. Initially, all samples have equal weights, but those weights will later be adjusted based on the errors made in this first round.
Suppose we have:
- 20 samples
- 6 features:
F1
,F2
,F3
,F4
,F5
,F6
- 1 binary target variable:
Y
(0 or 1) - All sample weights: $w_i = \frac{1}{20} = 0.05$
The training goal is to build a decision stump that, given these weights, splits the data with lowest weighted classification error. For each of the 6 features, the algorithm tests various threshold values (based on unique values in that feature). For every potential split, the model checks how the data would be classified - whether each sample falls into the predicted class or not. It then computes the total error as the sum of weights of misclassified samples.
Suppose
F3 < 5.2
is one candidate split.After applying this split:
- 14 samples are correctly classified
- 6 samples are misclassified
Since each sample has a weight of 0.05:
$$ \varepsilon = \sum_{\text{misclassified}} w_i = 6 \times 0.05 = 0.3 $$
So, the weighted classification error of this stump is 0.3.
The algorithm repeats this for all possible thresholds across all 6 features. Among all these candidates, the decision stump with the lowest weighted error is selected as the first weak learner. Let's say that F3 < 5.2
results in the lowest error (0.3), compared to other splits like F1 < 2.1
(0.35), F5 < 7.4
(0.4), etc. Therefore, the algorithm selects F3 < 5.2
as the first weak hypothesis $h_1(x)$.
The key property of a decision stump is that it only uses one feature to make a decision. The strength of AdaBoost lies in combining many such weak learners that are each simple, but together form a strong model.
What Happens to Features?¶
- No explicit reweighting of features happens.
- But over iterations, as sample weights change, the importance of different features may emerge.
- If a feature helps classify the newly emphasized (previously misclassified) samples, it is likely to be chosen in subsequent rounds.
So, AdaBoost builds a sequence of weak learners, each focusing on a different aspect of the data, possibly using different features at each round.
After training:
- You can evaluate feature importance by checking how often and how effectively each feature was used in decision stumps.
- Libraries like scikit-learn provide
.feature_importances_
for this.
Thus, although AdaBoost reweights samples, it leads to implicit feature selection over time.
Python Example explaining the scenario¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
# Simulate dataset with 20 samples, 6 features, and binary target
np.random.seed(42)
X = pd.DataFrame(np.random.randn(20, 6), columns=[f'F{i+1}' for i in range(6)])
y = np.random.choice([0, 1], size=20)
sample_weights = np.ones(len(y)) / len(y) # Equal initial weights
# Train decision stumps on each feature and record weighted errors
errors = {}
thresholds = {}
best_thresholds = {}
for feature in X.columns:
min_error = float('inf')
best_thresh = None
# Try different thresholds based on unique values in feature
for thresh in np.unique(X[feature]):
# Predict using threshold rule
preds = (X[feature] < thresh).astype(int) # classify 1 if value < threshold
incorrect = (preds != y)
weighted_error = np.sum(sample_weights * incorrect)
if weighted_error < min_error:
min_error = weighted_error
best_thresh = thresh
errors[feature] = min_error
thresholds[feature] = best_thresh
# Find best feature-threshold combination
best_feature = min(errors, key=errors.get)
best_thresh = thresholds[best_feature]
best_error = errors[best_feature]
# Visualization
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(errors.keys(), errors.values(), color='skyblue')
ax.set_title("Weighted Classification Errors for Decision Stumps")
ax.set_ylabel("Weighted Error")
ax.axhline(best_error, color='red', linestyle='--', label=f"Best Split: {best_feature} < {best_thresh:.2f}")
ax.legend()
best_feature, best_thresh, best_error
('F3', -1.1063349740060282, 0.35)
What Changes in Multiclass Classification?¶
AdaBoost was originally designed for binary classification. However, it has been extended to handle multiclass problems in several ways:
1. One-vs-All AdaBoost (OVA)¶
You train one AdaBoost classifier per class. Each classifier predicts whether the sample belongs to that class or not.
- Features are treated the same.
- Each classifier may prioritize different features, based on what best distinguishes its class from the rest.
- At prediction time, you run all classifiers and choose the one with the highest confidence score.
2. SAMME (Stagewise Additive Modeling using a Multiclass Exponential loss function)¶
This is a native multiclass extension of AdaBoost proposed by Zhu et al.
Each weak learner outputs class predictions (not just binary).
The model weight $\alpha_t$ is adjusted for multiclass using:
$$ \alpha_t = \ln\left(\frac{1 - \varepsilon_t}{\varepsilon_t}\right) + \ln(K - 1) $$
where $K$ is the number of classes.
Misclassified samples get higher weights as usual.
The final prediction is based on a weighted vote over class predictions.
3. SAMME.R (Real variant)¶
This is an improvement over SAMME. Instead of using class labels as weak outputs, it uses class probability estimates from weak learners.
- Final prediction combines class probability scores weighted by $\alpha_t$
- This leads to smoother decision boundaries and better accuracy.
Example: Multiclass with 3 Classes and 6 Features¶
Suppose you have a dataset:
- 6 features:
F1
toF6
- Target:
Y
with 3 classes —A
,B
, andC
- 20 data points
Training with SAMME:¶
- In each iteration, a weak learner is trained.
- The learner tries different splits using features.
- Based on weighted error over 3-class predictions, a feature is selected.
- Misclassified samples (in any class) get higher weights.
Here again, features influence the learning, but the error calculation and vote aggregation are adjusted for multiclass.
Over time, the model may find that F2
is useful for distinguishing A from B, while F5
is good at separating B and C. This emerges through the adaptive training process.
Hyperparameters of AdaBoost¶
AdaBoost has a few important hyperparameters that control its behavior. Tuning them can help improve performance on a given task.
1. n_estimators
¶
This determines the number of weak learners (e.g., decision stumps) to be used in the ensemble. A higher value means more models will be trained, often leading to better performance, but also longer training time and risk of overfitting on noisy data.
2. learning_rate
¶
This shrinks the contribution of each classifier by a constant factor. A lower learning rate means the model learns more slowly, but can achieve better generalization when used with more estimators. In scikit-learn, the default is 1.0. It acts as a regularization mechanism.
3. base_estimator
¶
This is the weak learner used in the boosting process. Typically, a decision stump is used, but it can be replaced with more complex models like deeper decision trees. However, using complex models may reduce the benefit of boosting and increase the chance of overfitting.
4. algorithm
¶
In binary classification, scikit-learn offers two options:
"SAMME"
: Uses class labels and is suitable for multi-class classification too."SAMME.R"
: Uses class probabilities and usually performs better but is limited to classifiers that provide probability estimates (like decision trees). This method is now deprecated and is likely to be removed soon.
Pros and Cons of AdaBoost¶
Pros¶
- Simple and Effective: AdaBoost is easy to understand and implement. It works well out-of-the-box for many classification problems.
- Reduces Bias and Variance: Since it builds sequential models focused on correcting previous errors, it reduces both bias and variance under many scenarios.
- Less Prone to Overfitting (with proper tuning): Unlike decision trees that can overfit on training data, AdaBoost with simple weak learners often generalizes better.
- Versatile: Can be used with different base learners, though typically shallow trees work best.
Cons¶
- Sensitive to Noisy Data and Outliers: Since AdaBoost increases the weight of misclassified points, noisy data or mislabeled examples can dominate the learning process and degrade performance.
- Computationally Expensive: As the number of estimators increases, training time increases linearly. Moreover, predictions can be slower compared to single models.
- Binary Focused: While extensions exist for multi-class problems, AdaBoost was originally designed for binary classification, and multi-class tasks may require adaptations.
- Harder to Interpret: Unlike a single decision tree, AdaBoost’s ensemble structure is more difficult to interpret and explain.
When to Use AdaBoost¶
AdaBoost is most suitable in scenarios where:
- The data is relatively clean, with minimal outliers.
- The features are structured (e.g., tabular data).
- A fast and simple baseline model is needed before exploring more complex boosting algorithms.
- The problem is binary classification, although it can be extended to multi-class.
It is less ideal in the following cases:
- Datasets with significant noise or mislabeled points.
- Large-scale datasets where computational cost becomes a concern.
- Problems with highly complex feature interactions where deeper tree-based models are more appropriate.
Why Other Boosting Methods Exist¶
Even though AdaBoost was the first popular boosting algorithm and performs well in many cases, it has limitations that later algorithms sought to overcome.
Gradient Boosting¶
Unlike AdaBoost, which reweights samples based on misclassification, Gradient Boosting takes a different route. It fits each new learner to the residual errors of the previous model by minimizing a loss function using gradient descent. This allows more flexibility in handling different types of problems, including regression and classification, and in using custom loss functions.
XGBoost¶
XGBoost (Extreme Gradient Boosting) improves over classic Gradient Boosting by introducing:
- Regularization to avoid overfitting.
- Parallel computation for faster training.
- Tree pruning and histogram-based splits for performance.
- Better handling of sparse and missing data.
XGBoost is widely used in Kaggle competitions and real-world applications because of its robustness and performance.
LightGBM¶
LightGBM is designed for speed and scale:
- Uses a leaf-wise tree growth strategy, which tends to yield better accuracy.
- Efficient with large datasets and high-dimensional data.
- Supports GPU training and efficient handling of categorical variables.
CatBoost¶
CatBoost is specifically tailored for datasets with many categorical features:
- Automatically handles categorical variables without preprocessing.
- Uses ordered boosting to avoid overfitting.
- Produces symmetric trees for faster inference.
Intuition Behind the Formulation of AdaBoost¶
The key idea behind AdaBoost is to focus learning on the hard examples. In traditional supervised learning, every data point is treated equally, and the model tries to minimize overall error. But some examples are inherently more difficult to classify than others.
The Question AdaBoost Answers:¶
Can we build a strong learner by combining many weak learners, each of which only does slightly better than random guessing?
This idea was formalized by Freund and Schapire in 1996, where they showed that even weak hypotheses (slightly better than 50% accuracy) could be combined into a strong hypothesis with arbitrarily low error.
Here’s the intuition:
- Start with equal attention to all data.
- After training one weak model, examine its mistakes.
- Increase attention (weights) on the mistakes, so that the next model tries harder to fix them.
- Keep combining new learners, where each one is focusing more and more on what the previous ones got wrong.
- Use a weighted vote to combine all the learners, giving more importance to models that did well.
The formulation uses exponential loss to increase penalties on misclassified points, which allows the algorithm to tightly focus on hard examples and reduce training error quickly. This approach is both greedy and intuitive - fix what's wrong, give more credit to what's right, and repeat.
Explaining AdaBoost with Example¶
We’ll work with a binary classification task having 10 samples and 3 rounds (T=3) of AdaBoost using decision stumps.
Step 0: Initial Setup¶
Each sample is assigned an equal weight initially:
$$ w_i^{(0)} = \frac{1}{10} = 0.1 \quad \text{for } i = 1 \text{ to } 10 $$
Round 1¶
Train First Weak Learner (h₁)¶
- Misclassified samples: 3, 7, 10
- Error:
$$ \varepsilon_1 = w_3 + w_7 + w_{10} = 0.1 + 0.1 + 0.1 = 0.3 $$
Compute Learner Weight (α₁)¶
$$ \alpha_1 = \frac{1}{2} \ln \left( \frac{1 - \varepsilon_1}{\varepsilon_1} \right) = \frac{1}{2} \ln \left( \frac{0.7}{0.3} \right) \approx 0.4236 $$
Update Weights¶
- Correct samples (weight down):
$$ w_i^{\text{new}} = 0.1 \cdot e^{-0.4236} \approx 0.0655 $$
- Incorrect samples (weight up):
$$ w_i^{\text{new}} = 0.1 \cdot e^{+0.4236} \approx 0.1527 $$
Normalize Weights¶
$$ Z_1 = 7 \cdot 0.0655 + 3 \cdot 0.1527 = 0.9166 $$
- Correct:
$$ w_i^{(1)} = \frac{0.0655}{0.9166} \approx 0.0715 $$
- Incorrect:
$$ w_i^{(1)} = \frac{0.1527}{0.9166} \approx 0.1666 $$
Updated Weights After Round 1¶
Sample | Misclassified | Weight w₁ |
---|---|---|
1 | No | 0.0715 |
2 | No | 0.0715 |
3 | Yes | 0.1666 |
4 | No | 0.0715 |
5 | No | 0.0715 |
6 | No | 0.0715 |
7 | Yes | 0.1666 |
8 | No | 0.0715 |
9 | No | 0.0715 |
10 | Yes | 0.1666 |
Round 2¶
Train Second Weak Learner (h₂)¶
This time the weighted error might occur on samples: 2, 5 (both previously correct, but now misclassified).
$$ \varepsilon_2 = w_2 + w_5 = 0.0715 + 0.0715 = 0.1430 $$
Compute Learner Weight (α₂)¶
$$ \alpha_2 = \frac{1}{2} \ln \left( \frac{1 - 0.1430}{0.1430} \right) = \frac{1}{2} \ln(5.996) \approx 0.8959 $$
Update Weights¶
- Correct:
$$ w_i^{\text{new}} = w_i \cdot e^{-\alpha_2} \approx w_i \cdot 0.408 $$
- Incorrect (samples 2, 5):
$$ w_i^{\text{new}} = w_i \cdot e^{+\alpha_2} \approx w_i \cdot 2.19 $$
Calculate Unnormalized Weights¶
Sample | Correct? | w₁ | New w₂ (before normalization) |
---|---|---|---|
1 | Yes | 0.0715 | 0.0715 × 0.408 ≈ 0.0292 |
2 | No | 0.0715 | 0.0715 × 2.19 ≈ 0.1567 |
3 | Yes | 0.1666 | 0.1666 × 0.408 ≈ 0.0679 |
4 | Yes | 0.0715 | 0.0715 × 0.408 ≈ 0.0292 |
5 | No | 0.0715 | 0.0715 × 2.19 ≈ 0.1567 |
6 | Yes | 0.0715 | 0.0715 × 0.408 ≈ 0.0292 |
7 | Yes | 0.1666 | 0.1666 × 0.408 ≈ 0.0679 |
8 | Yes | 0.0715 | 0.0715 × 0.408 ≈ 0.0292 |
9 | Yes | 0.0715 | 0.0715 × 0.408 ≈ 0.0292 |
10 | Yes | 0.1666 | 0.1666 × 0.408 ≈ 0.0679 |
Total sum (Z₂) ≈ 0.6661 Normalized weights are:
- For samples 2 and 5:
$$ \frac{0.1567}{0.6661} ≈ 0.2353 $$
- For others:
$$ \frac{0.0292}{0.6661} ≈ 0.0438 \text{ (for 1, 4, 6, 8, 9)} $$
$$ \frac{0.0679}{0.6661} ≈ 0.1020 \text{ (for 3, 7, 10)} $$
Round 3¶
Let’s assume third stump misclassifies samples 3, 6
$$ \varepsilon_3 = 0.1020 + 0.0438 = 0.1458 $$
$$ \alpha_3 = \frac{1}{2} \ln \left( \frac{1 - 0.1458}{0.1458} \right) \approx 0.8897 $$
Weights are updated again in similar fashion.
Final Prediction¶
Let’s say we want to predict a new point x
. Each weak learner gives:
- h₁(x) = +1
- h₂(x) = +1
- h₃(x) = -1
In AdaBoost, each weak learner $h_t(x)$ makes a prediction for a given input $x$. These predictions are assumed to be in the form of +1 or -1, not 1 or 0, because AdaBoost is originally formulated for binary classification with labels in $\{-1, +1\}$, not $\{0, 1\}$.
So:
- $h_1(x) = +1$ means weak learner 1 predicts class +1 for input $x$
- $h_3(x) = -1$ means weak learner 3 predicts class -1 for input $x$
These signs come directly from how each weak learner classifies the input in AdaBoost's formulation. The final prediction is then:
$$ \hat{y}(x) = \text{sign} \left( \sum_{t=1}^{T} \alpha_t h_t(x) \right) $$
This means:
- Each learner votes with a weight $\alpha_t$
- The sign of the total weighted vote determines the final class (+1 or -1)
Final prediction is:
$$ \hat{y}(x) = \text{sign} \left( \alpha_1 h_1(x) + \alpha_2 h_2(x) + \alpha_3 h_3(x) \right) $$
$$ \hat{y}(x) = \text{sign} (0.4236 \cdot 1 + 0.8959 \cdot 1 + 0.8897 \cdot (-1)) = \text{sign}(0.4298) = +1 $$
So, the final predicted class is +1.
Python Example¶
import numpy as np
import matplotlib.pyplot as plt
# Define 10 data points with initial weights
n_samples = 10
initial_weight = 1 / n_samples
weights = np.full(n_samples, initial_weight)
# Assume the true labels and predictions of weak learners
true_labels = np.array([1, 1, -1, 1, -1, 1, -1, 1, 1, -1]) # Labels in {-1, +1}
# Predictions by each weak learner
h1_pred = np.array([1, 1, 1, 1, -1, 1, 1, 1, 1, -1])
h2_pred = np.array([1, 1, -1, 1, -1, 1, -1, 1, 1, -1])
h3_pred = np.array([-1, -1, -1, 1, -1, 1, -1, 1, 1, -1])
# Assume error rates for each learner
errors = np.array([0.3, 0.2, 0.25])
alphas = 0.5 * np.log((1 - errors) / errors)
# Compute final prediction
def final_prediction(x1, x2, x3, a1, a2, a3):
vote = a1 * x1 + a2 * x2 + a3 * x3
return np.sign(vote), vote
# Compute predictions and weighted votes
final_preds = []
weighted_votes = []
for i in range(n_samples):
pred, vote = final_prediction(h1_pred[i], h2_pred[i], h3_pred[i], alphas[0], alphas[1], alphas[2])
final_preds.append(pred)
weighted_votes.append(vote)
# Visualize
plt.figure(figsize=(10, 6))
colors = ['green' if p == 1 else 'red' for p in final_preds]
plt.bar(range(1, n_samples+1), weighted_votes, color=colors)
plt.axhline(0, color='black', linestyle='--')
plt.xlabel('Sample Index')
plt.ylabel('Weighted Vote')
plt.title('Final AdaBoost Prediction (Green = +1, Red = -1)')
plt.show()
final_preds
[1.0, 1.0, -1.0, 1.0, -1.0, 1.0, -1.0, 1.0, 1.0, -1.0]
Variants of AdaBoost¶
Let's check some of the variants of AdaBoost now.
Discrete AdaBoost¶
This is the original version where each weak learner outputs a class label (not probability). Updates reweight samples based on misclassification. It is what scikit-learn’s AdaBoostClassifier uses by default under the hood for binary classification.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1),
n_estimators=50,
algorithm='SAMME',
learning_rate=1.0
)
clf.fit(X_train, y_train)
Here each stump’s vote is weighted by its accuracy; sample weights are adjusted iteratively based on misclassifications.
LogitBoost¶
LogitBoost is derived by minimizing the logistic loss (log‑likelihood), aligning boosting with logistic regression. It chooses base learners to approximate a Newton step for logistic loss, making the model robust in probabilistic terms.
You can use a lightweight implementation available via pip:
pip install logitboost
from logitboost import LogitBoost
clf = LogitBoost(n_estimators=50)
clf.fit(X_train, y_train)
# Supports both binary and multiclass
This base learner acts as a regression stump that fits to the pseudo-residuals rather than class labels.
SAMME / SAMME.R for Multiclass¶
While AdaBoost was designed for binary classification, SAMME (Stagewise Additive Modeling using a Multi-class Exponential loss) extends it to multiclass tasks. It adds a correction term log(K−1)
to the estimator weight to ensure convergence if a weak learner exceeds random accuracy (>1/K
), rather than 50% for two classes.
- SAMME uses class labels.
- SAMME.R uses probabilities and generalizes Real AdaBoost to multiclass.
clf = AdaBoostClassifier(
base_estimator=DecisionTreeClassifier(max_depth=1),
n_estimators=50,
algorithm='SAMME.R'
)
clf.fit(X_train, y_train) # where y_train has >2 classes
This yields smoother, probabilistic updates and often better multiclass performance.
Coding AdaBoost Classifier¶
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
# Create toy dataset
X, y = make_classification(n_samples=100, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1, random_state=42)
y = np.where(y == 0, -1, 1) # AdaBoost often assumes -1, 1
clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
n_estimators=10, learning_rate=1.0, algorithm='SAMME')
clf.fit(X, y)
AdaBoostClassifier(algorithm='SAMME', estimator=DecisionTreeClassifier(max_depth=1), n_estimators=10)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(algorithm='SAMME', estimator=DecisionTreeClassifier(max_depth=1), n_estimators=10)
DecisionTreeClassifier(max_depth=1)
DecisionTreeClassifier(max_depth=1)
def plot_boundary(model, X, y):
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500),
np.linspace(y_min, y_max, 500))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', edgecolor='k')
plt.title("AdaBoost Decision Boundary")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
plot_boundary(clf, X, y)
Coding AdaBoost with different Base Learner¶
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_wine
from sklearn import metrics
datasets = load_wine()
inputs, targets = datasets.data, datasets.target
targets
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
svc = SVC(probability=True, kernel='linear')
# Create adaboost classifer object with svc as base learner
abc = AdaBoostClassifier(n_estimators=50, estimator=svc,learning_rate=1,algorithm='SAMME')
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(inputs, targets, test_size=0.1, random_state=42, stratify=targets)
model = abc.fit(x_train, y_train)
y_pred = model.predict(x_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.9444444444444444