In-depth Detail on Gradient Boosting Classifier¶

Knowledge of Loss Functions¶

We need to know few things before we get into Gradient Boosting Classifier. First being the loss function.

A loss function measures how bad or wrong a model’s prediction is compared to the true value. It tells the model how far off it is, so it can adjust during training. In Regression, we need to predict continuous values like prices, temperature, or age. In Classification, we need to predict categories or labels like spam or not spam, cat or dog, etc.

Common Metric in Regression Loss is Mean Squared Error (MSE)

Formula:

$$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$

Where, $y_i$ is the actual value, $\hat{y}_i$ is the predicted value, and $n$ is the number of samples

Suppose you are predicting house prices.

House Actual Price (y) Predicted Price (ŷ) Error (y - ŷ) Squared Error
1 100 90 10 100
2 150 160 -10 100
3 200 210 -10 100

$$ \text{MSE} = \frac{100 + 100 + 100}{3} = 100 $$

In [1]:
import numpy as np
from sklearn.metrics import mean_squared_error

# Actual and predicted values
y_true = np.array([100, 150, 200])
y_pred = np.array([90, 160, 210])

# Calculate MSE manually
errors = y_true - y_pred
squared_errors = errors ** 2
mse_manual = np.mean(squared_errors)

# Or use sklearn
mse_sklearn = mean_squared_error(y_true, y_pred)

print("Manual MSE:", mse_manual)
print("Sklearn MSE:", mse_sklearn)
Manual MSE: 100.0
Sklearn MSE: 100.0

Common Metric in Classification Loss is Log Loss (Cross-Entropy Loss)

Binary classification formula:

$$ \text{Log Loss} = -\frac{1}{n} \sum_{i=1}^n \left[ y_i \cdot \log(\hat{p}_i) + (1 - y_i) \cdot \log(1 - \hat{p}_i) \right] $$

Where, $y_i$ is the actual label (0 or 1) and $\hat{p}_i$ is the predicted probability for class 1

Example:

Sample Actual Class (y) Predicted Prob (ŷ) Log Loss Component
1 1 0.9 $-\log(0.9) = 0.105$
2 0 0.2 $-\log(0.8) = 0.223$
3 1 0.6 $-\log(0.6) = 0.511$

$$ \text{Log Loss} = \frac{0.105 + 0.223 + 0.511}{3} = 0.28 $$

A lower value of log loss means the model is making predictions that are closer to the true labels and more confident and accurate, especially when predicting probabilities. In short, we can say that Lower log loss = better model performance in classification tasks.

In [2]:
from sklearn.metrics import log_loss

# Actual binary labels
y_true = [1, 0, 1]

# Predicted probabilities for class 1
y_prob = [0.9, 0.2, 0.6]

# Compute Log Loss
loss = log_loss(y_true, y_prob)
print("Log Loss:", loss)
Log Loss: 0.2797765635793423

For Multi-Class Classification

Suppose we have 3 classes, and the model predicts a probability distribution for each class.

Example:

Sample True Class Predicted Probs (class 0, 1, 2)
1 2 [0.1, 0.2, 0.7]
2 1 [0.2, 0.6, 0.2]

log_loss by default expects that all class labels start from 0 and are included in y_true. If your y_true does not include all the classes present in y_pred_proba, you must specify the full list of classes using the labels argument.

In [6]:
import numpy as np
from sklearn.metrics import log_loss

# True labels (classes 0, 1, 2 exist, but only 1 and 2 appear in y_true)
y_true = [2, 1]

# Predicted probabilities for all 3 classes (0, 1, 2)
y_pred_proba = np.array([
    [0.1, 0.2, 0.7],  # Sample 1
    [0.2, 0.6, 0.2]   # Sample 2
])

# Specify all possible class labels
loss = log_loss(y_true, y_pred_proba, labels=[0, 1, 2])
print("Multi-class Log Loss:", loss)
Multi-class Log Loss: 0.4337502838523616

How Log Loss Differs in Binary vs Multi-Class Classification¶

Binary Classification¶

In binary classification, you predict a single probability for the positive class (class 1). The log loss formula is:

$$ \text{LogLoss} = - \frac{1}{n} \sum_{i=1}^n \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right] $$

You only predict one probability per sample (for class 1). The other probability (for class 0) is simply $1 - p_i$

Multi-Class Classification¶

In multi-class classification (say for 3 classes), you predict a probability distribution across all classes:

$$ \text{LogLoss} = - \frac{1}{n} \sum_{i=1}^n \sum_{j=1}^{k} y_{ij} \log(p_{ij}) $$

Where:

  • $y_{ij} = 1$ if sample $i$'s true class is $j$, else 0
  • $p_{ij}$ is the predicted probability of class $j$ for sample $i$
  • $k$ is the number of classes

So, for each sample, the model is penalized based on how far the probability assigned to the true class is from 1. The more confident and correct the prediction, the lower the log loss.

Knowledge of Gradient Descent¶

Gradient Descent is an optimization algorithm used to minimize a loss function by updating model parameters in the opposite direction of the gradient of the loss with respect to those parameters.

The core idea is that - Start with initial values for model parameters, compute the gradient (direction of steepest increase of loss), and take a step in the opposite direction to reduce the loss. For regression**, it minimizes MSE by updating weights based on the difference between predicted and actual values. For classification, it minimizes Log Loss by updating weights based on the difference between predicted probability and actual class.

Mathematical Update Rule

$$ \theta := \theta - \alpha \cdot \frac{\partial L}{\partial \theta} $$

Where:

  • $\theta$: model parameter(s)
  • $L$: loss function
  • $\alpha$: learning rate (step size)
  • $\frac{\partial L}{\partial \theta}$: gradient of the loss with respect to parameter

Gradient Descent in Regression¶

Loss Function : For regression, we often use Mean Squared Error (MSE):

$$ L = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 $$

If we use Linear Regression:

$$ \hat{y}_i = w x_i + b $$

Parameters: $w$ (weight), $b$ (bias)

Compute Gradients

$$ \frac{\partial L}{\partial w} = -\frac{2}{n} \sum x_i (y_i - (w x_i + b)) $$

$$ \frac{\partial L}{\partial b} = -\frac{2}{n} \sum (y_i - (w x_i + b)) $$

Weight Update

$w := w - \alpha \cdot \frac{\partial L}{\partial w}$

Numerical Example for Regression¶

Suppose:

  • Data: $X = [1, 2], Y = [2, 4]$
  • Initial guess: $w = 0, b = 0$
  • Learning rate: $\alpha = 0.1$
Step 1: Predictions¶

$$ \hat{y} = w x + b = 0 $$

Step 2: Gradients¶

$$ \frac{\partial L}{\partial w} = -\frac{2}{2} [1(2-0)+2(4-0)] = -(2+8)= -10 $$

$$ \frac{\partial L}{\partial b} = -\frac{2}{2} [(2-0)+(4-0)] = -(6)= -6 $$

Step 3: Update Parameters¶

$$ w = 0 - 0.1 * (-10) = 1 $$

$$ b = 0 - 0.1 * (-6) = 0.6 $$

Next iteration will use updated $w=1, b=0.6$ and repeat until convergence.

In [7]:
import numpy as np

# Data
X = np.array([1, 2])
Y = np.array([2, 4])

# Initialize parameters
w, b = 0.0, 0.0
alpha = 0.1
epochs = 10

for i in range(epochs):
    # Predictions
    Y_pred = w * X + b
    
    # Compute gradients
    dw = (-2 / len(X)) * np.sum(X * (Y - Y_pred))
    db = (-2 / len(X)) * np.sum(Y - Y_pred)
    
    # Update parameters
    w -= alpha * dw
    b -= alpha * db
    
    loss = np.mean((Y - Y_pred)**2)
    print(f"Epoch {i+1}: w={w:.3f}, b={b:.3f}, Loss={loss:.3f}")
Epoch 1: w=1.000, b=0.600, Loss=10.000
Epoch 2: w=1.320, b=0.780, Loss=1.060
Epoch 3: w=1.426, b=0.828, Loss=0.173
Epoch 4: w=1.465, b=0.835, Loss=0.083
Epoch 5: w=1.482, b=0.828, Loss=0.073
Epoch 6: w=1.492, b=0.818, Loss=0.070
Epoch 7: w=1.501, b=0.807, Loss=0.068
Epoch 8: w=1.508, b=0.795, Loss=0.066
Epoch 9: w=1.516, b=0.784, Loss=0.064
Epoch 10: w=1.523, b=0.772, Loss=0.062
In [8]:
import numpy as np

# Data
X = np.array([1, 2])
Y = np.array([0, 1])

# Initialize
w, b = 0.0, 0.0
alpha = 0.1
epochs = 10

for i in range(epochs):
    # Predictions using sigmoid
    z = w * X + b
    Y_pred = 1 / (1 + np.exp(-z))
    
    # Compute gradients
    dw = np.mean((Y_pred - Y) * X)
    db = np.mean(Y_pred - Y)
    
    # Update
    w -= alpha * dw
    b -= alpha * db
    
    # Compute log loss
    loss = -np.mean(Y * np.log(Y_pred) + (1 - Y) * np.log(1 - Y_pred))
    print(f"Epoch {i+1}: w={w:.3f}, b={b:.3f}, Loss={loss:.4f}")
Epoch 1: w=0.025, b=0.000, Loss=0.6931
Epoch 2: w=0.048, b=-0.001, Loss=0.6871
Epoch 3: w=0.070, b=-0.003, Loss=0.6818
Epoch 4: w=0.091, b=-0.005, Loss=0.6770
Epoch 5: w=0.111, b=-0.009, Loss=0.6728
Epoch 6: w=0.129, b=-0.013, Loss=0.6690
Epoch 7: w=0.147, b=-0.017, Loss=0.6655
Epoch 8: w=0.163, b=-0.022, Loss=0.6623
Epoch 9: w=0.179, b=-0.028, Loss=0.6594
Epoch 10: w=0.194, b=-0.034, Loss=0.6567
In [ ]:
 
In [ ]:
 
In [ ]:
 

Gradient Descent in Classification¶

For classification, the most common method is Logistic Regression using Log Loss as the loss function.

Model

$$ \hat{p} = \sigma(w x + b) = \frac{1}{1 + e^{-(w x + b)}} $$

Loss Function

Log Loss for binary classification:

$$ L = -\frac{1}{n} \sum [y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)] $$

Compute Gradients

$$ \frac{\partial L}{\partial w} = \frac{1}{n} \sum ( \hat{p}_i - y_i ) x_i $$

$$ \frac{\partial L}{\partial b} = \frac{1}{n} \sum ( \hat{p}_i - y_i ) $$

Weight Update

$w := w - \alpha \cdot \frac{\partial L}{\partial w}$

Numerical Example for Classification¶

Data:

  • $X = [1, 2], Y = [0, 1]$
  • Initial: $w=0, b=0$
  • Learning rate: $\alpha = 0.1$
Step 1: Predictions¶

$$ \hat{p}_1 = 0.5, \hat{p}_2 = 0.5 $$

Step 2: Gradients¶

$$ \frac{\partial L}{\partial w} = \frac{1}{2}[(0.5-0)*1 + (0.5-1)*2] = (0.5 -1) /2 = -0.25 $$

$$ \frac{\partial L}{\partial b} = \frac{1}{2}[(0.5-0)+(0.5-1)] = (0.5 - 0.5)/2 = 0 $$

Update:

$$ w = 0 - 0.1 * (-0.25) = 0.025 $$

In [9]:
import numpy as np

# Data
X = np.array([1, 2])
Y = np.array([0, 1])

# Initialize
w, b = 0.0, 0.0
alpha = 0.1
epochs = 10

for i in range(epochs):
    # Predictions using sigmoid
    z = w * X + b
    Y_pred = 1 / (1 + np.exp(-z))
    
    # Compute gradients
    dw = np.mean((Y_pred - Y) * X)
    db = np.mean(Y_pred - Y)
    
    # Update
    w -= alpha * dw
    b -= alpha * db
    
    # Compute log loss
    loss = -np.mean(Y * np.log(Y_pred) + (1 - Y) * np.log(1 - Y_pred))
    print(f"Epoch {i+1}: w={w:.3f}, b={b:.3f}, Loss={loss:.4f}")
Epoch 1: w=0.025, b=0.000, Loss=0.6931
Epoch 2: w=0.048, b=-0.001, Loss=0.6871
Epoch 3: w=0.070, b=-0.003, Loss=0.6818
Epoch 4: w=0.091, b=-0.005, Loss=0.6770
Epoch 5: w=0.111, b=-0.009, Loss=0.6728
Epoch 6: w=0.129, b=-0.013, Loss=0.6690
Epoch 7: w=0.147, b=-0.017, Loss=0.6655
Epoch 8: w=0.163, b=-0.022, Loss=0.6623
Epoch 9: w=0.179, b=-0.028, Loss=0.6594
Epoch 10: w=0.194, b=-0.034, Loss=0.6567

Getting into Gradient Boosting¶

Gradient Boosting is a machine learning technique for building predictive models. It works by combining many simple models (often decision trees) to create a more accurate and powerful model. Gradient Boosting builds models in sequence. Each new model is trained to correct the errors (residuals) made by the previous ones. It does this by minimizing a loss function, using a method inspired by gradient descent.

Let’s say you want to predict the selling price of houses based on square footage. Your first model might predict that every house costs 50 lakh, which is the average of all prices in your dataset. For a house that actually costs 60 lakh, your model is off by 10 lakh. For another that costs 40 lakh, it is off by -10 lakh. Now, build a small decision tree that takes square footage into account and tries to predict these errors. Maybe it learns that houses larger than 1000 sq. ft tend to be underpriced and need their predictions raised. Add the corrections from the new tree to the original predictions. So, the house that was predicted at 50 lakh may now be predicted at 54 lakh if the tree says to add 4 lakh. This process is repeated with new trees that try to fix the remaining errors until the model is accurate enough or starts to overfit.

Gradient Boosting Powerful because -

  • It can handle many types of problems, such as predicting numbers (regression), classifying things (classification), or ranking results (used in search engines).
  • It works well with many kinds of data and often gives better results than other models.
  • You can use any differentiable loss function, which means it's flexible for custom needs.

Step-by-Step Explanation of Gradient Boosting¶

Step 1: Define a Loss Function¶
  • In regression, a common choice is the Mean Squared Error

    L(y, F(x)) = (y - F(x))^2

  • In binary classification, the Cross-Entropy or Log Loss is often used

    L(y, F(x)) = -[y log(p) + (1 - y) log(1 - p)], where p = sigmoid(F(x))

Different loss functions can be chosen for different tasks, provided they are differentiable.

Step 2: Initialize the Model¶

The first model is initialized with a constant value that minimizes the loss function over all training data.

  • For regression (mean squared error), the best constant is the average of target values

    F0(x) = mean(y)

  • For classification (log loss), the model starts with the log-odds of the target class distribution

    F0(x) = log(p / (1 - p)), where p is the proportion of class 1

This initial model serves as a base prediction.

Step 3: Compute the Pseudo-Residuals¶

At every iteration, compute the pseudo-residuals, which are the negative gradients of the loss function with respect to the current model’s predictions.

  • For regression with MSE, residuals are simply

    r_i = y_i - Fm(x_i)

  • For classification with log loss, residuals involve gradients of log loss with respect to prediction scores

The residuals show how much and in what direction the current model is wrong.

Step 4: Fit a New Model to the Residuals¶

Train a new weak learner (usually a shallow decision tree) to predict the pseudo-residuals. This learner identifies the patterns in the remaining errors that the ensemble needs to fix. This is equivalent to learning a function h(x) that approximates the gradient of the loss function.

Step 5: Update the Model¶

Update the model by adding the new learner’s predictions to the current ensemble. The contribution of the new model is scaled by a learning rate to prevent overfitting.

Fm(x) = Fm-1(x) + η * h(x)

where η is the learning rate, typically a small value like 0.1.

Step 6: Repeat the Process¶

Repeat steps 3 to 5 for a fixed number of iterations or until the model performance no longer improves. Each iteration brings the prediction closer to the true value.

Numerical Example (Regression)¶

Assume we are solving a regression problem. The dataset is as follows:

x y
1 4
2 5
3 6
4 8

Step 1: Initialize prediction Initial prediction is the mean of y F0 = (4 + 5 + 6 + 8) / 4 = 5.75

Step 2: Compute residuals

x y F0 Residual = y - F0
1 4 5.75 -1.75
2 5 5.75 -0.75
3 6 5.75 0.25
4 8 5.75 2.25

Step 3: Train a decision tree on residuals Suppose the tree learns a function h1(x) that predicts those residuals approximately

Step 4: Update prediction Let η = 0.1. New model becomes

F1(x) = F0(x) + 0.1 * h1(x)

Repeat the process to get better and better predictions.

Python Code Example with Visualization (Regression)¶

In [10]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
In [11]:
# Sample dataset
X = np.array([[1], [2], [3], [4]])
y = np.array([4, 5, 6, 8])

# Fit Gradient Boosting model
model = GradientBoostingRegressor(n_estimators=3, learning_rate=0.1, max_depth=1)
model.fit(X, y)
Out[11]:
GradientBoostingRegressor(max_depth=1, n_estimators=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingRegressor(max_depth=1, n_estimators=3)
In [12]:
# Predict
pred = model.predict(X)

# Plot predictions
plt.figure(figsize=(6,4))
plt.scatter(X, y, color='black', label='Actual')
plt.plot(X, pred, color='red', marker='o', label='Prediction')
plt.title('Gradient Boosting Regression (3 estimators)')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

This visualizes how the model fits the data through a series of weak learners.

Numerical Example (Binary Classification)¶

Assume we are solving a binary classification problem. The dataset is as follows:

x y
1 0
2 0
3 1
4 1

Here y = 0 represents class 0 and y = 1 represents class 1.

We aim to learn a function F(x) such that:

  • P(y=1 | x) = sigmoid(F(x))
  • P(y=0 | x) = 1 - sigmoid(F(x))

We use log loss (cross-entropy) as the loss function:

$$ L(y, F(x)) = -[y \cdot \log(p) + (1 - y) \cdot \log(1 - p)] $$

where $p = \text{sigmoid}(F(x)) = \frac{1}{1 + e^{-F(x)}}$

Step 1: Initialize Predictions (F₀)¶

We initialize predictions with the log-odds of the positive class.

$$ p = \frac{\sum y_i}{n} = \frac{1 + 1}{4} = 0.5 $$

$$ F_0 = \log\left(\frac{p}{1 - p}\right) = \log\left(\frac{0.5}{0.5}\right) = \log(1) = 0 $$

So the initial prediction for all samples is:

$$ F_0(x) = 0 \quad \text{for all } x $$

Step 2: Compute Probabilities and Pseudo-Residuals¶

We calculate the predicted probabilities:

$$ \hat{p}_i = \text{sigmoid}(F_0(x_i)) = \frac{1}{1 + e^{-0}} = 0.5 $$

Now compute the pseudo-residuals, which are the gradients of the log loss with respect to F(x):

$$ r_i = y_i - \hat{p}_i $$

x y F₀(x) p̂ (sigmoid(F₀)) Residual r = y - p̂
1 0 0 0.5 -0.5
2 0 0 0.5 -0.5
3 1 0 0.5 0.5
4 1 0 0.5 0.5
Step 3: Fit a Decision Tree on Residuals¶

We now train a regression tree to predict these residuals using x as the input.

Suppose the tree learns a function h₁(x) that maps:

  • h₁(x) = -0.5 when x in {1, 2}
  • h₁(x) = 0.5 when x in {3, 4}

This tree has learned the pattern in the gradients.

Step 4: Update Model with Learning Rate η¶

Let us use a learning rate η = 0.1. Update the model:

$$ F_1(x) = F_0(x) + \eta \cdot h_1(x) $$

So:

  • For x = 1 or 2: F₁(x) = 0 + 0.1 × (–0.5) = –0.05
  • For x = 3 or 4: F₁(x) = 0 + 0.1 × (0.5) = 0.05

Now convert these updated scores to predicted probabilities:

$$ p̂ = \frac{1}{1 + e^{-F(x)}} $$

x F₁(x) p̂ = sigmoid(F₁(x))
1 -0.05 0.4875
2 -0.05 0.4875
3 0.05 0.5125
4 0.05 0.5125

These probabilities are slightly better than the initial 0.5 guess. Now, we compute residuals again using the new predictions and repeat the steps.

Step 5: Iterate to Improve the Model¶

At each new iteration:

  • Compute updated residuals based on current predictions
  • Fit a tree to residuals
  • Update prediction using learning rate and new tree output

This process continues for a fixed number of rounds or until the loss stabilizes.

Python Code for Classification Version¶

In [13]:
from sklearn.ensemble import GradientBoostingClassifier

# Data
X = np.array([[1], [2], [3], [4]])
y = np.array([0, 0, 1, 1])

# Fit Gradient Boosting Classifier
model = GradientBoostingClassifier(n_estimators=3, learning_rate=0.1, max_depth=1)
model.fit(X, y)

# Predict probabilities
probs = model.predict_proba(X)[:, 1]
preds = model.predict(X)

# Visualize
plt.figure(figsize=(6,4))
plt.scatter(X, y, color='black', label='Actual')
plt.plot(X, probs, color='blue', marker='o', label='Predicted Probability')
plt.title('Gradient Boosting Classification (3 estimators)')
plt.xlabel('X')
plt.ylabel('Predicted probability of class 1')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image
In [14]:
# Show results
for i in range(len(X)):
    print(f"x = {X[i][0]}, Actual = {y[i]}, Predicted Prob = {probs[i]:.4f}, Predicted Class = {preds[i]}")
x = 1, Actual = 0, Predicted Prob = 0.3658, Predicted Class = 0
x = 2, Actual = 0, Predicted Prob = 0.3658, Predicted Class = 0
x = 3, Actual = 1, Predicted Prob = 0.6342, Predicted Class = 1
x = 4, Actual = 1, Predicted Prob = 0.6342, Predicted Class = 1

In-depth understanding of sequential model structure¶

Let's go deep into how Gradient Boosting works in a sequential manner taking classification problem, focusing on how each new model learns from the residuals (errors) of the previous model.

In gradient boosting, we build models one after another, not all at once. Each model learns to correct the mistakes made by the previous model. Instead of directly predicting the target values, every new model tries to predict the residuals or gradients, how far off the previous model was, and in which direction. The prediction keeps getting updated at every stage using:

$$ F_{m}(x) = F_{m-1}(x) + \eta \cdot h_m(x) $$

Here,

  • $F_{m}(x)$: prediction after mth model
  • $F_{m-1}(x)$: prediction after previous model
  • $h_m(x)$: prediction from the mth model trained on residuals
  • $\eta$: learning rate

Let’s use a small dataset with four samples:

x y
1 0
2 0
3 1
4 1

Let’s go through 3 iterations (stages) of gradient boosting with this dataset.

Step 1: Initial Prediction (F₀)¶

We start with a constant model that predicts the same value for all samples. Since the loss is log loss, the best constant value is the log odds of the positive class.

Total positives = 2 and Total samples = 4

So, probability of class 1: $p = 2 / 4 = 0.5$

$$ F_0(x) = \log\left(\frac{p}{1 - p}\right) = \log\left(\frac{0.5}{0.5}\right) = 0 $$

This means, initially the model predicts log-odds = 0, so probability of class 1 is:

$$ p = \frac{1}{1 + e^{-0}} = 0.5 $$

Step 2: Compute Residuals (Gradients of Log Loss)¶

Residuals are calculated using:

$$ r_i = y_i - \hat{p}_i $$

Since all predicted probabilities are 0.5:

x y p̂ = sigmoid(F₀) Residual (y - p̂)
1 0 0.5 -0.5
2 0 0.5 -0.5
3 1 0.5 0.5
4 1 0.5 0.5
Step 3: Train First Model h₁(x) on Residuals¶

We now fit a regression tree to predict the residuals.

Let’s say the tree splits the data like this:

  • For x = 1, 2 → predict -0.5
  • For x = 3, 4 → predict +0.5

This means the first tree has learned the residuals from step 2.

Step 4: Update Prediction¶

Let’s use a learning rate $\eta = 0.1$

Now, we update the predictions:

$$ F_1(x) = F_0(x) + 0.1 \cdot h_1(x) $$

x F₁(x) p̂ = sigmoid(F₁(x))
1 -0.05 1 / (1 + e^{0.05}) ≈ 0.4875
2 -0.05 0.4875
3 0.05 0.5125
4 0.05 0.5125

The probabilities have shifted slightly in the correct direction.

Step 5: Compute New Residuals (Stage 2)¶

Again we compute residuals:

$$ r_i = y_i - p̂_i $$

x y p̂ (F₁) Residual (y - p̂)
1 0 0.4875 -0.4875
2 0 0.4875 -0.4875
3 1 0.5125 0.4875
4 1 0.5125 0.4875
Step 6: Fit Second Model h₂(x) on New Residuals¶

Again, we train a regression tree on these new residuals. It may learn:

  • x = 1, 2 → predict -0.4875
  • x = 3, 4 → predict +0.4875
Step 7: Update Predictions Again¶

$$ F_2(x) = F_1(x) + 0.1 \cdot h_2(x) $$

x F₁(x) h₂(x) F₂(x) p̂ = sigmoid(F₂(x))
1 -0.05 -0.4875 -0.09875 ≈ 0.4753
2 -0.05 -0.4875 -0.09875 ≈ 0.4753
3 0.05 0.4875 0.09875 ≈ 0.5247
4 0.05 0.4875 0.09875 ≈ 0.5247

We now see the probabilities getting more confident.

Step 8: Repeat Further¶

At each stage:

  1. Compute residuals
  2. Fit a new tree
  3. Update the prediction
  4. Convert to probabilities

Each tree gradually reduces the error and pushes the predicted probability closer to the correct class.

Final Model¶

After many such steps, the final model is:

$$ F_M(x) = F_0(x) + \eta \cdot h_1(x) + \eta \cdot h_2(x) + \dots + \eta \cdot h_M(x) $$

And the final prediction is:

$$ P(y=1|x) = \frac{1}{1 + e^{-F_M(x)}} $$

In [15]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier

# Data
X = np.array([[1], [2], [3], [4]])
y = np.array([0, 0, 1, 1])

# Gradient Boosting Classifier with staged prediction
model = GradientBoostingClassifier(n_estimators=3, learning_rate=0.1, max_depth=1)
model.fit(X, y)

# Staged probability predictions
probs = list(model.staged_predict_proba(X))

# Plot how probabilities evolve for sample x = 1 to x = 4
plt.figure(figsize=(8,5))
for i in range(4):
    plt.plot(range(1, 4), [p[i][1] for p in probs], marker='o', label=f'Sample x={X[i][0]}')

plt.title("How predictions evolve with each boosting stage")
plt.xlabel("Stage")
plt.ylabel("Predicted probability for class 1")
plt.grid(True)
plt.legend()
plt.show()
No description has been provided for this image

Understanding Hyperparameters in Gradient Boosting¶

To effectively use gradient boosting, it is important to understand the key hyperparameters that influence its performance. One of the most central hyperparameters is the number of estimators. This refers to the total number of weak learners or decision trees that are trained one after the other. A small number may not capture the complexity of the data and result in underfitting, while too many trees can overfit the training data, making the model less generalizable.

The learning rate controls how much each additional tree contributes to the overall prediction. A smaller learning rate means each tree has a smaller influence, which generally leads to better generalization, though it requires a higher number of trees to reach optimal performance. In practice, a learning rate between 0.01 and 0.2 is often chosen, but the right value depends on the complexity of the dataset and should be tuned using cross-validation.

Another critical parameter is the depth of each individual tree. Shallow trees with fewer levels are less likely to overfit and are typically used as weak learners in boosting. Deeper trees can model more complex relationships but risk fitting to noise in the data. Usually, depths of three to five are a good starting point.

The subsample parameter introduces randomness by using only a fraction of the data to train each tree. This is similar in spirit to bagging, where only a subset of samples is used for learning. Setting the subsample value below one helps reduce variance and can make the model more robust, though if set too low it may hurt performance.

In addition to the above, other parameters such as minimum samples required to split a node or to be in a leaf node help control the tree’s complexity. These are especially useful in noisy datasets, where deeper splits might capture random fluctuations rather than true patterns.

Gradient boosting also allows for the selection of different loss functions depending on the type of problem. For classification problems, common loss functions include log loss and exponential loss. For regression tasks, squared error, absolute error, huber loss, and quantile loss are often used. Huber and quantile loss are particularly useful when the data contain outliers or have asymmetric error distributions.

Regularization parameters also play a vital role in controlling overfitting. These include things like the minimum reduction in impurity required for a node to be split and limiting the number of features considered when splitting. When these are set appropriately, they ensure that the trees do not become overly complex and memorize training data.

The Role and Importance of Boosting¶

Boosting techniques emerged as a solution to the problem where individual models are too weak to capture complex relationships. Instead of relying on a single strong model, boosting builds an ensemble of weak learners, where each model focuses on correcting the mistakes made by the previous ones. This method ensures that areas of the data space that were previously misclassified or poorly predicted are given more focus in the following iterations.

Gradient Boosting takes this a step further by framing the boosting process as an optimization problem. Instead of simply reweighting samples like in AdaBoost, gradient boosting trains each model to approximate the gradient of the loss function with respect to the prediction. This allows it to handle a wide variety of problems by simply defining the appropriate loss function and minimizing it iteratively.

Advantages of Gradient Boosting¶

One of the primary strengths of gradient boosting is its high prediction accuracy. When well-tuned, gradient boosting often outperforms other algorithms like logistic regression, support vector machines, and even random forests. The model is flexible enough to work with various types of data and loss functions, which makes it suitable for both classification and regression problems. Another important benefit is its ability to capture complex interactions between features, something many linear models struggle with. Gradient boosting also provides insight into feature importance, which can be helpful for model interpretation and variable selection.

Moreover, gradient boosting can handle missing values to some extent, and with techniques like early stopping and shrinkage, it is possible to make the model robust and prevent overfitting.

Limitations of Gradient Boosting¶

Despite its many strengths, gradient boosting is not without drawbacks. One of its major limitations is its relatively slow training time. Because the trees are built sequentially and each tree depends on the predictions from the previous one, it is difficult to parallelize the training process, unlike in random forests where trees are trained independently.

Gradient boosting models are also very sensitive to hyperparameters. A small change in learning rate or tree depth can significantly impact performance. Therefore, careful tuning and validation are essential to get the best out of this method. In addition, while the method performs well with clean and structured data, it may overfit noisy datasets unless proper regularization is applied.

Interpretability is another area where gradient boosting may fall short. While it is possible to compute feature importance, the internal structure of hundreds of trees is complex, making it difficult to understand how the final decision is made. Techniques like SHAP values and LIME can help provide some post-hoc explanations, but the model itself remains a black box to a large extent.

Use Cases Where Gradient Boosting¶

Gradient boosting has found applications in a wide range of real-world problems. In the financial sector, it is commonly used for credit scoring, fraud detection, and loan default prediction. The ability to capture nonlinear interactions and focus on hard-to-predict examples makes it highly effective in these areas. In marketing and customer analytics, gradient boosting is used to predict customer churn, lifetime value, and segmentation. Because of its ability to rank features, it also helps businesses identify key drivers behind customer behavior. In healthcare, gradient boosting is applied in disease prediction, risk modeling, and even in genomic data analysis, where it deals with highly structured and imbalanced datasets. Its precision makes it suitable for applications where false positives and false negatives carry different costs.

The model is also popular in competitions such as Kaggle, where structured data dominates. Its performance has consistently made it the preferred choice among data scientists for winning solutions.

Practical Implementation of Gradient Boosting¶

In [16]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
In [17]:
# Load the dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the model
gb = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    subsample=0.8,
    random_state=42
)
gb.fit(X_train, y_train)
Out[17]:
GradientBoostingClassifier(random_state=42, subsample=0.8)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(random_state=42, subsample=0.8)
In [18]:
# Make predictions
y_pred = gb.predict(X_test)

# Evaluate accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))
Accuracy: 0.9590643274853801

Types of Gradient Boosting¶

There are several variants of Gradient Boosting that have emerged to improve its speed, scalability, and robustness. While they all build upon the core idea of sequentially adding weak learners to minimize a loss function, each variant has specific design improvements that address limitations of the original method.

1. Traditional Gradient Boosting (GBM)¶

This is the original gradient boosting algorithm, often referred to as Gradient Boosting Machines (GBMs). It was introduced to boost model accuracy through additive modeling and gradient descent optimization. It trains one weak learner (usually a decision tree) at a time and tries to correct the mistakes made by previous models.

Key Characteristics¶
  • Uses CART (Classification and Regression Trees) as base learners
  • Sensitive to overfitting unless parameters are tuned well
  • Training is sequential, hence slower
  • Cannot handle sparse data or missing values natively
In [19]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)

model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
model.fit(X_train, y_train)
print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))
Accuracy: 0.951048951048951

2. XGBoost (Extreme Gradient Boosting)¶

XGBoost was designed to optimize speed and performance, especially for large datasets and machine learning competitions like Kaggle. It improves over traditional GBM by offering regularization, parallel computation, and handling of missing data.

Key Characteristics¶

  • Includes L1 and L2 regularization to prevent overfitting
  • Supports parallelized tree construction
  • Efficient for sparse input and missing values
  • Provides built-in cross-validation**, early stopping, and GPU acceleration
In [25]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data['data'], data['target'])

model = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, 
                          max_depth=3, eval_metric='logloss')
model.fit(X_train, y_train)
print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))
Accuracy: 0.9790209790209791

3. LightGBM (Light Gradient Boosting Machine)¶

LightGBM was developed by Microsoft to further optimize gradient boosting for large-scale and high-dimensional data. It introduces novel techniques to drastically reduce training time and memory usage.

Key Characteristics¶
  • Uses histogram-based algorithms to bucket continuous features
  • Grows trees leaf-wise rather than level-wise
  • Can handle categorical features directly
  • Excellent for large datasets with many features
In [34]:
import lightgbm as lgb

# Load dataset
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)

model = lgb.LGBMClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    force_col_wise=True,
    verbosity=-1
)

model.fit(X_train, y_train)
print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))
Accuracy: 0.972027972027972

4. CatBoost (Categorical Boosting)¶

CatBoost was developed by Yandex and is specifically designed to handle categorical variables without preprocessing. It also addresses the prediction shift problem caused by conventional target encoding of categories.

Key Characteristics¶
  • Handles categorical variables natively
  • Uses ordered boosting to reduce overfitting
  • No need for extensive preprocessing
  • Often robust on imbalanced datasets
In [37]:
from catboost import CatBoostClassifier

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)

model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=3, verbose=0)
model.fit(X_train, y_train)
print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))
Accuracy: 0.9370629370629371

Coding in Gradient Boosting¶

Classifying Digits Dataset¶

In [38]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_digits


X, y = load_digits(return_X_y=True)

train_X, test_X, train_y, test_y = train_test_split(X, y, 
                                                    test_size = 0.25, 
                                                    random_state = 42)

gbc = GradientBoostingClassifier(n_estimators=300,
                                 learning_rate=0.05,
                                 random_state=100,
                                 max_features=5 )
                                 
gbc.fit(train_X, train_y)

pred_y = gbc.predict(test_X)

acc = accuracy_score(test_y, pred_y)
print("Gradient Boosting Classifier accuracy is : {:.2f}".format(acc))
Gradient Boosting Classifier accuracy is : 0.98

Predicting on Diabetes Dataset¶

In [47]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_diabetes


X, y = load_diabetes(return_X_y=True)

train_X, test_X, train_y, test_y = train_test_split(X, y, 
                                                    test_size = 0.20, 
                                                    random_state = 42)

gbr = GradientBoostingRegressor(loss='absolute_error',
                                learning_rate=0.1,
                                n_estimators=500,
                                max_depth = 4,
                                max_features = 4)

gbr.fit(train_X, train_y)

pred_y = gbr.predict(test_X)

test_rmse = mean_squared_error(test_y, pred_y) ** (1 / 2)

print('Root mean Square error: {:.2f}'.format(test_rmse))
Root mean Square error: 56.55
In [ ]: