Random Forest: A Detailed Explanation¶

Introduction to Ensemble Methods and Random Forest¶

In machine learning, an ensemble method refers to a technique that combines the predictions of multiple base models to produce a single, stronger prediction. The core idea is simple: many weak learners can come together to form a strong learner, much like how multiple experts can make better decisions than a single one. Instead of relying on one model (which may have high bias or variance), ensemble methods aim to reduce variance, reduce bias, or improve predictions, depending on how they are constructed.

Random Forest is one of the most powerful and versatile machine learning algorithms that falls under the category of non-neural network methods. Despite not being a deep learning model, it often delivers competitive performance for both classification and regression problems. Its ability to produce high accuracy, along with its robustness to overfitting and minimal need for data preprocessing, makes it a favorite choice across many domains.

At its core, Random Forest is an ensemble method, which means it combines the predictive power of multiple base models - in this case, decision trees - to form a strong overall model. The foundation of Random Forest lies in the concept of "wisdom of the crowd", where the collective opinion of a group is often more accurate than that of any single member. Each decision tree in the forest is built independently, and each tree is trained on a different random subset of the data. Additionally, at every split in the tree, a random subset of features is selected. This introduces diversity among the trees and prevents them from becoming too similar or overfitting on the same patterns.

Because a Random Forest can consist of hundreds or even thousands of trees, the algorithm is often regarded as a black box - we may not be able to easily understand the internal logic or the specific path taken by every individual tree. This complexity naturally leads to the question: how do we determine the final prediction when each tree gives its own answer? The solution is quite elegant and simple. In classification tasks, Random Forest uses a technique known as majority voting. Each tree in the forest makes a prediction for the input data, and the class that receives the most votes across all the trees becomes the final output. For instance, if 100 trees are used, and 65 of them predict class A while 35 predict class B, the model will classify the input as class A. This majority voting ensures that the decision reflects the most consistent pattern seen across the diverse trees.

For regression problems, where the outputs are continuous values rather than categories, the model uses averaging instead. Each tree predicts a numerical value, and the final prediction is simply the mean of all individual predictions. This helps to reduce variance and smooth out extreme values contributed by any one tree. Through majority voting and averaging, Random Forest effectively consolidates many different viewpoints into a single, robust prediction, making it one of the most reliable models in both supervised learning tasks.

Why Ensembles Work¶

The success of ensemble methods stems from the idea of diversity and aggregation:

  • Diversity ensures that individual models make different kinds of errors.
  • Aggregation (like averaging or voting) helps in smoothing out those errors, resulting in better overall performance.

If models are both accurate (better than random guessing) and uncorrelated, their combination improves accuracy significantly.

Common Types of Ensemble Methods¶

There are three primary types of ensemble methods:

a. Bagging (Bootstrap Aggregating)¶
  • Reduces variance by training multiple models on different random samples with replacement (bootstrap samples).
  • Each model is trained independently.
  • Final prediction:
    • Classification: majority voting
    • Regression: averaging
b. Boosting¶
  • Reduces bias by training models sequentially, where each model tries to correct the errors of the previous one.
  • Final prediction: weighted sum of individual learners.
c. Stacking¶
  • Combines predictions of several models (base learners) using a meta-model that learns to combine them optimally.
  • Can mix models of different types.

Now that we understand ensemble methods, let’s dive into Random Forest, which is one of the most successful and widely used ensemble algorithms.


What is Random Forest?¶

Random Forest is an ensemble method based on Bagging, where the base learners are Decision Trees. It constructs a multitude of decision trees during training and outputs the mode (for classification) or mean (for regression) of the predictions of individual trees.

Key Components of Random Forest¶

  1. Bootstrapped Sampling:

    • Each tree is trained on a random subset of the data sampled with replacement (bootstrap).
    • This introduces diversity among the trees.
  2. Random Feature Selection:

    • At each split in a tree, a random subset of features is selected, and the best split is found only within that subset.
    • This prevents trees from being correlated (e.g., always splitting on the same strong feature).
  3. Aggregation:

    • For classification: majority vote of all trees.
    • For regression: average of all tree predictions.

Relationship Between Decision Trees and Random Forest¶

Random Forest builds on the idea of decision trees but addresses their key limitation - overfitting.

Aspect Decision Tree Random Forest
Model type Single estimator Ensemble of decision trees
Variance High Low (due to bagging)
Bias Can be low Slightly higher than a fully grown tree
Overfitting Prone to overfitting Less prone (aggregates predictions)
Feature usage All features Random subset of features per split

Algorithm: Building a Random Forest¶

Step-by-Step Random Forest Training:

  1. Let’s say we want to build a forest with $B$ trees.
  2. For each tree:
    • Sample (with replacement) a bootstrap dataset from the original training set.
    • Train a decision tree on this sample.
    • At each split in the tree, choose a random subset of $m$ features (usually $m = \sqrt{d}$ for classification, where $d$ is total features).
  3. Aggregate predictions from all trees:
    • For classification: use majority vote.
    • For regression: take the average of outputs.
Example¶

Suppose you have a dataset with 1,000 samples and 10 features.

  • You decide to build 100 decision trees.
  • For each tree:
    • You sample 1,000 data points with replacement (bootstrap).
    • You randomly select $\sqrt{10} \approx 3$ features at each split.
    • Each tree grows fully or up to a maximum depth (to prevent overfitting).
  • When making a prediction on a new sample:
    • Each of the 100 trees votes for a class.
    • The class with the most votes becomes the final prediction.

Random Forest in Regression¶

In regression, Random Forest works similarly, but instead of voting, it averages the predicted numerical values from all the trees.

Formula for Random Forest Regression Prediction:¶

Let the prediction from the $i^{th}$ tree be $\hat{f}_i(x)$. Then the forest prediction is:

$$ \hat{f}(x) = \frac{1}{B} \sum_{i=1}^{B} \hat{f}_i(x) $$

Where:

  • $B$ = number of trees
  • $\hat{f}_i(x)$ = prediction from the $i^{th}$ decision tree

Feature Importance in Random Forest¶

Random Forest offers a built-in way to assess feature importance, which is useful for feature selection.

There are two common measures:

  • Mean Decrease in Impurity (MDI): Average reduction in impurity (e.g., Gini) by that feature over all trees.
  • Permutation Importance: Measures drop in accuracy when feature values are randomly shuffled.

Advantages of Random Forest¶

  1. Robust to overfitting: Especially compared to individual decision trees.
  2. Works with both classification and regression tasks.
  3. Handles large feature spaces well.
  4. Performs implicit feature selection.
  5. Handles missing values and maintains good performance with imbalanced datasets.

Limitations of Random Forest¶

  1. Less interpretable than a single decision tree.
  2. Slower to predict than small models (many trees involved).
  3. Not ideal for high-dimensional sparse data, like text classification.
  4. May require hyperparameter tuning for optimal performance.

Understanding Bootstrapping in Machine Learning¶

Bootstrapping is a statistical resampling technique used in machine learning, particularly in ensemble methods like Random Forest, to create multiple training datasets from a single original dataset. The aim of bootstrapping is to generate a diverse set of data subsets that still resemble the original data distribution, but differ enough to introduce variation and reduce model variance when combined.

Key Concepts Behind Bootstrapping:¶

  1. Sampling with Replacement:

    • Data points are drawn one at a time, randomly, from the original dataset.
    • After each draw, the selected point is put back into the pool, making it eligible to be drawn again.
    • This means that the same data point can appear multiple times in a resampled dataset, while some points may not appear at all.
  2. Uniform Probability:

    • Every data point in the original dataset has an equal chance of being selected at each draw, regardless of how often it has been selected before.
  3. Same Size as Original:

    • Each bootstrapped dataset is typically the same size as the original training set. This ensures comparability in training and helps preserve the data distribution.
  4. Why It Works:

    • Although the resampled datasets are not completely different, the small variations introduced through resampling cause different models (e.g., decision trees) to learn different patterns, which adds diversity to the ensemble.
    • When aggregated (e.g., in Random Forest), this leads to better generalization.
  5. Unique Sample Expectation:

    • In each bootstrap sample of size $n$ taken from an original dataset of size $n$, only about $1 - \frac{1}{e} \approx 63.2\%$ of the original points are expected to be unique.
    • The remaining $\approx 36.8\%$ are duplicate entries, selected more than once.

Bootstrapping Example: House Dataset¶

Let’s say we have a simple dataset of 5 houses, and each house has 4 features:

House ID Size (sqft) Bedrooms Price (in $1000s) Age (years)
H1 1500 3 250 10
H2 1800 4 300 8
H3 1200 2 200 15
H4 1700 3 270 5
H5 1600 3 260 7

Now, we’ll use bootstrapping** to generate 3 new datasets, each of size 5 (same as original). These new datasets are sampled with replacement.

Bootstrap Sample 1¶
Sample # House ID Selected
1 H2
2 H4
3 H2
4 H5
5 H3

Resulting dataset:

  • H2 appears twice.
  • H1 is not selected.
Bootstrap Sample 2¶
Sample # House ID Selected
1 H3
2 H3
3 H1
4 H4
5 H5

Resulting dataset:

  • H3 appears twice
  • H2 is not selected
Bootstrap Sample 3¶
Sample # House ID Selected
1 H5
2 H5
3 H1
4 H2
5 H1

Resulting dataset:

  • H1 and H5 appear twice
  • H3 and H4 are not selected
Observations¶

Each dataset is slightly different from the original and from each other. Some data points repeat within the same sample. Some original points are left out completely. This variation is what gives Random Forest its resilience and diversity.

Why 63.2% Unique Data Points?¶

Let’s prove this using probability:

For a dataset of size $n$, the probability that a data point is not selected in one draw is:

$$ P(\text{not selected}) = 1 - \frac{1}{n} $$

Since sampling is with replacement, the probability that the data point is never selected in $n$ draws is:

$$ P(\text{never selected in } n \text{ draws}) = \left(1 - \frac{1}{n}\right)^n $$

As $n \to \infty$, this approaches:

$$ \lim_{n \to \infty} \left(1 - \frac{1}{n}\right)^n = \frac{1}{e} \approx 0.368 $$

So, the expected fraction of unique data points in a bootstrap sample is:

$$ 1 - \frac{1}{e} \approx 0.632 \text{ or } 63.2\% $$


How Bootstrapping is Applied in Random Forests¶

To understand Random Forests, it’s crucial to grasp how the concept of bootstrapping evolves into Bootstrap Aggregating, also known as Bagging. Bootstrapping, as discussed earlier, involves generating new datasets by randomly sampling with replacement from the original dataset. These bootstrap samples mirror the distribution of the original data, but they are slightly varied due to the randomness of sampling. This variation creates the foundation for training multiple models that are similar but not identical.

Now, this idea is extended through a powerful concept called Bagging, short for Bootstrap Aggregating. In Bagging, we take each of these bootstrapped datasets and train a separate machine learning model on each one. The models could theoretically be of any type, but in the context of Random Forests, these models are always decision trees.

The Bagging Process with Decision Trees¶

Let’s say we decide to generate 50 bootstrap samples from the training dataset. For each of these 50 datasets:

  1. We sample with replacement from the original dataset to create a new dataset of the same size.
  2. A decision tree is trained on each bootstrap sample.
  3. As a result, we end up with 50 decision trees, each slightly different from the others.

Each decision tree has seen a different subset of the data and learned slightly different patterns. This means each tree may give a different prediction for the same input, depending on how it was trained. Now comes the question: what do we do with these 50 decision trees? We aggregate their predictions to make a final decision. This is typically done using:

  • Majority Voting (for classification): The most frequently predicted class becomes the final prediction.
  • Averaging (for regression): The average of all tree predictions is taken as the final output.

How Random Forest Improves Bagged Decision Trees¶

Random Forest is essentially an enhanced version of Bagging. While Bagging already helps by introducing variability in training data, Random Forest introduces one more level of randomness - and that’s in the features used for splitting nodes. In a regular decision tree (even in bagging), the algorithm looks at all features when deciding how to split a node. But Random Forest does something smarter.

Random Feature Subset at Each Split:¶
  • Instead of considering all features at every split, Random Forest picks a random subset of features.
  • For example, if you have 3 features: A, B, and C:
    • Tree 1 might choose to consider A and B.
    • Tree 2 might consider only B and C.
    • Tree 3 might consider just A.
    • And so on.

This extra layer of randomness leads to greater diversity between the individual decision trees, even more than Bagging alone provides. As a result, it significantly reduces the correlation between trees, which is essential in reducing overfitting and improving generalization.

Individual decision trees tend to have low bias but high variance. They fit the training data well but perform poorly on unseen data if grown deeply. Bagging reduces variance by averaging predictions from many trees trained on different data samples. Random Forest reduces variance even further by ensuring that the trees are decorrelated, thanks to the random feature selection step.

Code Section¶

Importing Libraries¶

In [7]:
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

Loading Dataset¶

In [8]:
df = pd.read_csv('Glass_dataset/glass_data.csv')

df.shape
Out[8]:
(214, 11)
In [9]:
df.head()
Out[9]:
Id RI Na Mg Al Si K Ca Ba Fe Type
0 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0.0 0.0 1
1 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0.0 0.0 1
2 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0.0 0.0 1
3 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0.0 0.0 1
4 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0.0 0.0 1
In [11]:
# Id is irrelevant for us and Type is the target variable, with all others being useful features for us

df = df.drop(['Id'], axis = 1)

Separating features and target¶

In [13]:
inputs = df.iloc[:, :-1]
targets = df.iloc[:, -1]

inputs.head()
Out[13]:
RI Na Mg Al Si K Ca Ba Fe
0 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0.0 0.0
1 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0.0 0.0
2 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0.0 0.0
3 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0.0 0.0
4 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0.0 0.0
In [14]:
targets.head()
Out[14]:
0    1
1    1
2    1
3    1
4    1
Name: Type, dtype: int64

Splitting the data into training and testing dataset -

In [15]:
# train test split will randomly shuffle the data before seggregating

xtrain, xtest, ytrain, ytest = train_test_split(inputs, targets, test_size=0.2, random_state=42)

Creating the model¶

In [17]:
clf = RandomForestClassifier(random_state=42)
clf.fit(xtrain, ytrain)
Out[17]:
RandomForestClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=42)
In [19]:
clf.predict([xtest.iloc[32]])
E:\7. Deep Learning\venv\lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but RandomForestClassifier was fitted with feature names
  warnings.warn(
Out[19]:
array([3], dtype=int64)
In [20]:
ytest.iloc[32]
Out[20]:
3

They are the same! Thus the prediction is right. However, there are ways to test the performance of the model.

Testing model performance¶

In [21]:
test_pred = clf.predict(xtest)

print(classification_report(test_pred, ytest))
              precision    recall  f1-score   support

           1       1.00      0.69      0.81        16
           2       0.64      0.90      0.75        10
           3       0.67      1.00      0.80         2
           5       0.75      1.00      0.86         3
           6       1.00      1.00      1.00         3
           7       1.00      0.89      0.94         9

    accuracy                           0.84        43
   macro avg       0.84      0.91      0.86        43
weighted avg       0.88      0.84      0.84        43

Let’s walk through each term and how to interpret it:

  1. Precision - Of all the predicted samples for a class, how many were actually correct. High precision means few false positives. For class 1, precision = 1.00 (perfect).

  2. Recall - Of all actual samples for a class, how many were correctly predicted. High recall means few false negatives. For class 2, recall = 0.90, meaning 90% of actual class 2 instances were correctly detected.

  3. F1-Score - Harmonic mean of precision and recall. Provides a balanced measure when there’s a trade-off between precision and recall. For class 5, F1-score = 0.86, which is strong.

  4. Support - Number of actual occurrences of each class in the dataset. Indicates how many samples belong to each class. Support for class 1 = 16, class 3 = 2 and so on.

Support Concept in detail¶

Support tells us about the distribution of the dataset across different classes. This is crucial because the size of support affects how well the model can learn patterns for each class. A. Classes with High Support (Large Dataset)

  • Class 1 with support = 16, class 2 with support = 10.
  • The model has more data to learn from.
  • Therefore, it can generalize better for these classes, although that’s not always guaranteed (as we see class 1 has a recall of just 0.69). B. Classes with Low Support (Small Dataset)
  • Class 3 and 6 with support = 2 and 3, respectively.
  • The model has very little data to learn.
  • Results can be:
    • Overfitting: Model may memorize the few examples.
    • Unstable metrics: A small change in prediction can greatly impact precision/recall.
  • Despite this, class 6 shows perfect precision and recall (1.00), which likely means the model guessed correctly on all 3 samples, but this can be misleading if tested on new data.
Accuracy and Averages:¶
  • Accuracy: Overall, the model is correct on 84% of all samples (36 out of 43).
  • Macro Average:
    • Arithmetic mean of the precision/recall/F1 across all classes.
    • Does not consider class imbalance.
    • Here: F1 macro avg = 0.86
  • Weighted Average:
    • Average of metrics weighted by support (class frequency).
    • Considers imbalance, gives more weight to common classes.
    • Here: F1 weighted avg = 0.84
Impact of Uneven Support:¶
  • If some classes are underrepresented (low support), the model can become biased toward the more frequent classes.
  • Classes with few samples can:
    • Be misclassified often (low recall)
    • Or give inflated performance if guessed correctly by chance
Modeling Differences on Small vs Large Data:¶
  • Small Datasets:
    • Can lead to overfitting
    • Metrics like precision/recall become volatile
    • One misprediction greatly changes performance
  • Large Datasets:
    • Model learns better general patterns
    • Performance metrics become more stable and trustworthy
In our case:¶

The model performs well overall (accuracy = 0.84). It does best on classes 6 and 7, but class 6 has only 3 samples, so the perfect score may be misleading. Class 1 has high precision but relatively low recall, meaning the model is conservative in predicting this class. Class 2, despite low precision (many false positives), has high recall, so the model catches most true class 2 samples but misclassifies other classes as 2. The mix of high support classes and low support classes shows how support influences stability and reliability of performance.

In [ ]: