Introduction to KNN¶

K-Nearest Neighbors (KNN) is a simple yet powerful algorithm that excels at identifying patterns within data, particularly when it comes to detecting anomalies or outliers. Because KNN compares new data points to their closest neighbors in the feature space, it is especially effective at flagging samples that deviate from typical patterns-making it a common choice for outlier detection.

Beyond anomaly detection, KNN is widely used for both classification and regression tasks. In classification, KNN can group items based on similarity in their features, which makes it particularly useful in applications such as text mining, where documents with similar word patterns can be clustered together. For regression, KNN is well-suited to problems where the relationship between features and the target variable is non-linear, as it makes no assumptions about the underlying data distribution.

One of the biggest advantages of KNN is its intuitiveness. It is easy to understand and interpret, which makes it a strong candidate for initial modeling and exploratory analysis. In fact, in certain scenarios, especially with smaller datasets and well-separated classes, KNN can outperform more complex machine learning models.

The K in KNN refers to the number of nearest neighbors the algorithm considers when making a prediction. To understand this better, imagine trying to decide which movie to watch next. You might ask a few of your friends - say, 5 of them - for recommendations. If 3 out of those 5 suggest a comedy, you’re likely to choose a comedy yourself. Here, K = 5. If you had asked only 1 friend (K = 1), your decision would rely entirely on their preference. Similarly, increasing the value of K makes the model more generalized by considering more neighbors, while a smaller K makes it more sensitive to noise in the data.

Overall, KNN remains a go-to algorithm due to its simplicity, versatility, and effectiveness in a range of real-world applications. However, it can be computationally expensive for large datasets, as it requires calculating distances to all training points for each prediction.


Distance Metric used in KNN¶

At the heart of the KNN algorithm is distance measurement which is a way to evaluate how "close" two data points are. This concept is essential because KNN uses these distances to decide which neighbors influence predictions.

Let's first understand how distances are calculated in 2D space using Minkowski distance, then extend it to other metrics.

Let us consider two Points in 2D Space

  • Point A = $(x_1, y_1)$
  • Point B = $(x_2, y_2)$
  1. Minkowski Distance: A Generalized Metric

The Minkowski distance is a generalized metric defined as:

$$ d_p(A, B) = \left( |x_1 - x_2|^p + |y_1 - y_2|^p \right)^{1/p} $$

This formula allows you to vary the value of $p$, which changes the behavior of the distance metric. Different values of $p$ yield different types of distances, each with its own geometric and practical implications.

  1. When $p = 1$: Manhattan (Taxicab) Distance

For $p = 1$, the Minkowski formula becomes:

$$ d_1(A, B) = |x_1 - x_2| + |y_1 - y_2| $$

This is called Manhattan or Taxicab distance because it simulates travel along a grid—like navigating streets in a city. You can only move horizontally or vertically.

Example: If A = (2, 3) and B = (5, 7):

$$ d_1 = |2 - 5| + |3 - 7| = 3 + 4 = 7 $$

This distance would follow the orange path in the above diagram, forming a right-angled route.

  1. When $p = 2$: Euclidean Distance

The most commonly used metric is when $p = 2$:

$$ d_2(A, B) = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2} $$

This gives the straight-line distance, like how a bird would fly directly from one point to another. The formula replicates Pythagoras Theorem.

Example: Using A = (2, 3) and B = (5, 7):

$$ d_2 = \sqrt{(3)^2 + (4)^2} = \sqrt{9 + 16} = 5 $$

  1. When $p \to \infty$: Chebyshev Distance

As $p$ approaches infinity, the Minkowski distance becomes:

$$ d_\infty(A, B) = \max\left(|x_1 - x_2|,\; |y_1 - y_2|\right) $$

This is the Chebyshev distance, often used in games like chess where a move can span diagonals, rows, or columns—all in one step.

Example: A = (2, 3), B = (5, 7):

$$ d_\infty = \max(3, 4) = 4 $$

It captures the largest directional change between two points.

image.png


Why Does the Value of $p$ Matter in KNN?¶

The choice of $p$ influences how distances are interpreted:

  • $p = 1$: Treats all directional changes equally. Useful when data follows grid patterns.
  • $p = 2$: Captures geometric distance. Ideal when spatial or physical distance matters.
  • $p \to \infty$: Focuses only on the largest change. Helpful in some pathfinding or board game logic.

KNN performance can vary significantly based on the distance metric. Cross-validation can help you select the best $p$ for your dataset.


Other Important Distance Metrics in KNN¶

While Minkowski metrics are common, other specialized distances are also widely used in machine learning:

  1. Mahalanobis Distance

Measures the distance between points while considering the variance and correlation between features. It is given by -

$$ D_M(x, y) = \sqrt{(x - y)^T S^{-1} (x - y)} $$

Where:

  • $x$ and $y$ are vectors (points)
  • $S$ is the covariance matrix of the dataset
  • $S^{-1}$ is its inverse

If features are correlated or have different scales, Mahalanobis adjusts for that. It scales the space so that variances become equal and uncorrelated - giving a more true sense of distance. More computationally intensive due to inversion of covariance matrix. It is ideal to use when features are interrelated. Often used in Outlier detection, in data with correlated features and for Multivariate analysis.

  1. Cosine Similarity and Distance

Measures the angle between vectors rather than magnitude. Focuses on direction, not magnitude.

$$ \text{Cosine Similarity} = \frac{x \cdot y}{\|x\| \|y\|} $$

Cosine distance is simply:

$$ \text{Cosine Distance} = 1 - \text{Cosine Similarity} $$

Two vectors pointing in the same direction have cosine similarity 1, and those at 90° have similarity 0. It is great for high-dimensional sparse data like text. It doesn't work well for purely numeric regression tasks. It is best for Text mining, Document clustering, Sparse data with high dimensions (e.g., TF-IDF vectors).

  1. Hamming Distance

Hamming distance counts the number of positions at which the corresponding symbols differ between two strings of equal length.

$$ \text{Hamming Distance} = \sum_{i=1}^n \mathbb{1}(x_i \ne y_i) $$

Where $\mathbb{1}$ is an indicator function.

It's a simple count of mismatches and is ideal for binary or categorical data. It works only for discrete valued equal length vectors. It is fast and efficient for binary data and is not suited for continuous variables. It is generally used in DNA sequence comparison, Error detection in transmission (parity checks) and in case of using KNN for categorical attributes (e.g., gender, color, type).

  1. Jaccard Distance

Used to compare set similarity based on intersection and union.

$$ \text{Jaccard Similarity} = \frac{|A \cap B|}{|A \cup B|}, \quad \text{Jaccard Distance} = 1 - \text{Jaccard Similarity} $$

It measures how similar two sets are based on shared vs total unique elements. It is used only for sets or binary-encoded features (e.g., tag vectors). It is very effective for sparse binary data. Usually used in Recommender systems, Set-based feature encoding, Tag or attribute matching.


Selecting the right distance metric is crucial in KNN. It determines which neighbors influence the model, which in turn impacts predictions. The more aligned the metric is with the nature of the data, the better KNN algorithm will perform.


KNN as a Classifier demo¶

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Let us use randomly generated data for the purpose. We will use make_blobs from sklearn datasets and what it does is generates normally distributed points around a specified center. This function will return 2 arrays -

  • First array stores the coordinates
  • Second array stores the corresponding labels
In [2]:
from sklearn.datasets import make_blobs
In [3]:
inputs, targets = make_blobs(n_samples=1000, centers=[(-6,6), (-3,2), (-6,4)] , random_state=42)

# Centers is defining the center around which points will be centered - here we gave 3 so target will have 3 labels
# You can also write integer numbers to get desired labels, that is centers = 5 will give you 5 different labels
# You can control the number of features to include by n_features parameter
In [4]:
inputs[:2]
Out[4]:
array([[-4.97593747,  4.59252695],
       [-4.33874079,  3.54290374]])
In [5]:
targets[:2]
Out[5]:
array([2, 2])
In [6]:
inputs.shape   # 1000 rows and 2 coordinates x and y which is equivalent to 2 features
Out[6]:
(1000, 2)
In [7]:
# Let's create a pandas dataframe with the newly created data points

data = pd.DataFrame(data = inputs, columns = ['feature1', 'feature2'])
data['target'] = targets
In [8]:
data.head()
Out[8]:
feature1 feature2 target
0 -4.975937 4.592527 2
1 -4.338741 3.542904 2
2 -7.607483 6.184634 0
3 -4.923993 4.021312 2
4 -6.756351 4.577746 0
In [9]:
data['target'].value_counts()
Out[9]:
target
0    334
2    333
1    333
Name: count, dtype: int64

Visualize the dataset¶

In [10]:
plt.figure(figsize = (14,5))
sns.set()
sns.scatterplot(x = 'feature1', y = 'feature2',
                data = data,
                hue = 'target',
                palette = ['#000172', '#287281', '#986732'],
                style = 'target',
                s = 60
               )
plt.show()
No description has been provided for this image

To look at the normal distribution of data, we can use the jointplot from sns -

In [11]:
sns.set()
sns.jointplot(x = 'feature1', y = 'feature2',
              data = data,
              hue = 'target',
              palette = ['#000172', '#287281', '#986732'],
              height=6,
              s = 60
             )
plt.show()
No description has been provided for this image

So, we generated random points and distributed amongst three classes. Now we will use these points to train KNN Algorithm. Before that, we need to split our data.

Splitting data into train and test¶

In [12]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data.iloc[:, :-1], 
                                                    data.iloc[:, -1], 
                                                    test_size = 0.2, 
                                                    random_state=42)
x_train.shape, x_test.shape, y_train.shape, y_test.shape
Out[12]:
((800, 2), (200, 2), (800,), (200,))
In [13]:
x_train.head(2)
Out[13]:
feature1 feature2
29 -1.469249 3.218762
535 -6.883857 6.153725

Building the model¶

In [14]:
from sklearn.neighbors import KNeighborsClassifier

# Let's instantiate the model with 1 neighbor at first
clf = KNeighborsClassifier(n_neighbors = 1)  # By default, the value of this parameter is 5
In [15]:
clf.fit(x_train, y_train)
Out[15]:
KNeighborsClassifier(n_neighbors=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=1)
In [16]:
samples = [[-3.6, 3], [-7.6, 1], [-2.15, 2.7]]
sample_set = pd.DataFrame(data = samples, columns=['feature1', 'feature2'])
sample_set
Out[16]:
feature1 feature2
0 -3.60 3.0
1 -7.60 1.0
2 -2.15 2.7
In [17]:
prediction = clf.predict(sample_set)
prediction
Out[17]:
array([1, 2, 1])

It shows that the data point belongs to label 2. Let's retrieve the nearest neighbor of this point.

In [18]:
neighbors = clf.kneighbors(sample_set)
neighbors
Out[18]:
(array([[0.06855388],
        [0.44865184],
        [0.19497333]]),
 array([[750],
        [777],
        [509]], dtype=int64))

The result is tuple of arrays -

  • First array stores the Euclidean distance from the new points in sample set to its nearest neighbor points from training set.
  • Second array stores the index of the neighbor points as in the training dataset.

Let's visualize this.

In [19]:
plt.figure(figsize = (14,5))
sns.set()
sns.scatterplot(x = x_train.iloc[:, 0], y = x_train.iloc[:, 1],
                hue = y_train,
                palette = ['#000172', '#287281', '#986732'],
                s = 60,
                legend = True
               )

sns.scatterplot(x = sample_set.iloc[:, 0], y = sample_set.iloc[:, 1],
                markers = ['o'],
                color = 'red',
                s = 30,
                legend = False
               )
    
plt.show()
No description has been provided for this image

Red points shows the new points and the neighbours nearest to it. Sometimes, a new point may get assigned to a certain label which is not as close to the new point as some other label might be. For example a new point maybe close to class 2 but the prediction shows that its close to class 1. Why is the case? Let's check the weight of the model first -

In [20]:
clf.get_params()
Out[20]:
{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 1,
 'p': 2,
 'weights': 'uniform'}

By default, the weights parameter in KNN is set to uniform. This means all neighbors contribute equally to the classification, regardless of how close or far they are from the point being predicted. As a result, when there is a tie between classes - for example, if one neighbor belongs to class 1 and another to class 2 - the classifier is equally likely to choose either class.

However, in practice, the classifier doesn't make a random choice in the event of a tie. It consistently picks the class with the smaller label, no matter how many times you run the classifier. Why does this happen?

To understand this, we need to look at how the predict method is implemented. Internally, it uses the mode function from the scipy.stats library to determine the most frequent class among the nearest neighbors. When there's a tie (i.e., multiple classes appear the same number of times), the mode function returns the smallest value among them.

So if both class 1 and class 2 occur once, the classifier will always return class 1, simply because it has a lower numerical label. This happens even if class 2 is represented by a neighbor that is closer to the input point than the one from class 1.

This behavior is a bit counterintuitive, especially for an algorithm like KNN, which is fundamentally based on distance. In such cases, it makes more sense to give more weight to closer neighbors.

To improve this, we can change the weights parameter from uniform to distance. When this setting is used, each neighbor's influence is determined by how close it is to the input point. Specifically, each neighbor gets a weight equal to the inverse of its distance:

$$ \text{weight}_i = \frac{1}{\text{distance}_i} $$

This means that neighbors closer to the query point have a greater impact on the prediction, while those farther away contribute less. Using distance weighting helps the model make decisions that better reflect the true structure of the data, especially in cases of ties or when neighbor distances vary significantly.

In [21]:
# Let's run the model code again

clf2 = KNeighborsClassifier(n_neighbors=1, weights='distance')
clf2.fit(x_train, y_train)

prediction2 = clf2.predict(sample_set)
In [22]:
clf2.get_params()
Out[22]:
{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 1,
 'p': 2,
 'weights': 'distance'}
In [23]:
prediction2
Out[23]:
array([1, 2, 1])
In [24]:
prediction
Out[24]:
array([1, 2, 1])

In many situations, using uniform weights in KNN works well and is often preferred, especially when the data is evenly distributed and relatively clean. With uniform weights, every neighbor contributes equally to the classification, which simplifies the model and can prevent overfitting to local noise or outliers. However, this approach might not always capture the influence of proximity, especially in more complex or noisy datasets.

Understanding decision boundaries¶

To better understand the impact of KNN parameters - particularly the choice of k, the number of neighbors - it helps to visualize how the algorithm behaves in feature space. One effective way to do this is by constructing decision boundaries or decision regions.

Decision boundaries show the areas in the feature space where the classifier assigns different class labels. Each region corresponds to one class, and any new point that falls within a region will be classified accordingly. These boundaries shift depending on how k is set and whether uniform or distance-based weighting is used.

When k is small (such as 1 or 3), the decision boundaries are more sensitive to individual data points. This can lead to overfitting, where the model captures noise and makes overly complex boundaries. As k increases, the model becomes more stable and generalizes better, resulting in smoother and more averaged decision boundaries. However, if k is too large, the model may become too simple, potentially misclassifying points near the edge of a cluster.

By plotting decision regions for different values of k, we can observe how the model behaves and choose a value that best balances bias and variance. These visualizations help identify how robust the classifier is to variations in the data and how much influence local neighbors have on predictions.

Visual tools like decision boundary plots allow us to better understand and fine-tune the classifier for optimal performance on different datasets.

In [25]:
from mlxtend.plotting import plot_decision_regions
import time
In [26]:
plot_decision_regions(X = np.array(x_train), 
                      y = np.array(y_train),
                      X_highlight = np.array(sample_set),
                      clf = clf2, 
                      scatter_kwargs= {'s':60, 'edgecolor':'white', 'alpha':0.5},
                      legend = True)  
# This function requires 2D array - (samples, features) and labels

plt.show()
E:\7. Deep Learning\venv\lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but KNeighborsClassifier was fitted with feature names
  warnings.warn(
No description has been provided for this image

Coloured sections are the decision region, and the boundaries that separates them are the decision boundaries. This plot visually explains how KNN partitions the feature space into different class regions based on proximity to labeled training samples. The decision boundaries curve around clusters and adapt to the distribution of data points, making KNN a flexible but non-parametric classifier.

K value of 1 is not really a good choice. The model is very flexible and creates decision boundaries that are unique to each training dataset. Thus, introducing even a new single data can change boundaries a lot. Model will likely overfit during training and will have poor performance on the new datasets. This kind of model will have low bias and high variance.

Let's try with higher value of k and see what happens. It will take few seconds -

In [27]:
clf3 = KNeighborsClassifier(n_neighbors=30, weights='uniform')
clf3.fit(x_train, y_train)

prediction3 = clf3.predict(sample_set)

plot_decision_regions(X = np.array(x_train), 
                      y = np.array(y_train),
                      X_highlight = np.array(sample_set),
                      clf = clf3, 
                      scatter_kwargs= {'s':60, 'edgecolor':'white', 'alpha':0.5},
                      legend = True) 
E:\7. Deep Learning\venv\lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but KNeighborsClassifier was fitted with feature names
  warnings.warn(
Out[27]:
<Axes: >
No description has been provided for this image

Decision boundaries have changed quiet a lot. Some greens have also got misclassified. This model may suffer from underfitting where we get low variance but high bias. So, we need to find the value of K which is the most optimal to use.

One way to analyze value of k is to get the number of misclassified test data points as the value of K increases. This metric is called misclassification rate, which is defined as - Misclassification rate = 1 - accuracy. We can also refer it as Error Rate.

Getting error rates from a set of models¶

In [28]:
from sklearn.metrics import accuracy_score

error_uniform = []  # Stores error rates from models with uniformly distributed weights
error_distance = []   # Stores error rates from models with distance-based weights

for k in range(1,31):
    model = KNeighborsClassifier(n_neighbors=k, weights='uniform')
    model.fit(x_train, y_train)
    predictions = model.predict(x_test)
    error_uniform.append(1 - accuracy_score(y_test, predictions))

    model = KNeighborsClassifier(n_neighbors=k, weights='distance')
    model.fit(x_train, y_train)
    predictions = model.predict(x_test)
    error_distance.append(1 - accuracy_score(y_test, predictions))
In [31]:
# Let's plot now

plt.plot(range(1,31), error_uniform, c = 'red', linestyle = 'solid',
         marker = 'o', markerfacecolor = 'black', label = 'Error Uniform')
plt.plot(range(1,31), error_distance, c = 'purple', linestyle = '--',
         marker = 'o', markerfacecolor = 'white', label = 'Error Distance')
plt.legend()
plt.xlabel('K values')
plt.ylabel('Error rate')
plt.show()
No description has been provided for this image

Now from the above we will select a value of k which is not very small to introduce high variance and not very large to introduce high bias. It must also cause small error rate. On an average, from above we can see that uniform weights works better than distance based weights. Now there are multiple values of k which shows small error rate. Which one to choose? We will use GridSearchCV for the purpose.

Using GridSearchCV¶

In [32]:
from sklearn.model_selection import GridSearchCV

grid_params = {
    'n_neighbors' : range(1,31),
    'weights' : ['uniform', 'distance']
    }

gridcv = GridSearchCV(estimator = KNeighborsClassifier(),
                      param_grid=grid_params,
                      scoring='accuracy')

gridcv.fit(x_train, y_train)
Out[32]:
GridSearchCV(estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': range(1, 31),
                         'weights': ['uniform', 'distance']},
             scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': range(1, 31),
                         'weights': ['uniform', 'distance']},
             scoring='accuracy')
KNeighborsClassifier(n_neighbors=8)
KNeighborsClassifier(n_neighbors=8)
In [33]:
gridcv.best_params_
Out[33]:
{'n_neighbors': 8, 'weights': 'uniform'}

GridSearch has returned uniform to be the best weight and neighbors as 8 to the optimal in our case.

In [34]:
gridcv.best_estimator_
Out[34]:
KNeighborsClassifier(n_neighbors=8)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=8)
In [35]:
gridcv.best_score_   
# This number is a mean of all accuracies obtained during cross validation process
Out[35]:
0.8625
In [36]:
clf = gridcv.best_estimator_
clf.fit(x_train, y_train)
Out[36]:
KNeighborsClassifier(n_neighbors=8)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=8)
In [37]:
predictions = clf.predict(x_test)

Evaluating Model performance¶

In [38]:
# Checking the score
accuracy_score(predictions, y_test)
Out[38]:
0.86

As we know accuracy is not the best metric to use for classification tasks. So let's use other metrics for classification -

In [39]:
from sklearn.metrics import classification_report, confusion_matrix

cnf_matrix = confusion_matrix(predictions, y_test)
cnf_matrix
Out[39]:
array([[65,  0, 15],
       [ 1, 52,  3],
       [ 8,  1, 55]], dtype=int64)
In [40]:
plt.figure(figsize=(6, 5))
sns.heatmap(
    cnf_matrix,
    annot=True,               # Show numbers in cells
    fmt='d',                  # Format numbers as integers
    cmap='magma',
    xticklabels=clf.classes_,
    yticklabels=clf.classes_
)

plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()
No description has been provided for this image
In [42]:
print(classification_report(predictions, y_test))
              precision    recall  f1-score   support

           0       0.88      0.81      0.84        80
           1       0.98      0.93      0.95        56
           2       0.75      0.86      0.80        64

    accuracy                           0.86       200
   macro avg       0.87      0.87      0.87       200
weighted avg       0.87      0.86      0.86       200

The classification report shows an overall accuracy of 86%, indicating that the model correctly classified 86% of all test instances.

  • Class 1 achieved the best performance with a precision of 0.98, recall of 0.93, and F1-score of 0.95, suggesting it is predicted with high confidence and completeness.

  • Class 0 also performed well with a precision of 0.88 and F1-score of 0.84, though its recall (0.81) is slightly lower.

  • Class 2 has the weakest metrics, particularly in precision (0.75), meaning the model often misclassifies other classes as class 2. However, it has a decent recall of 0.86.

The macro and weighted averages are balanced at around 0.87, confirming that the model maintains fairly consistent performance across classes, though there is room for improvement in class 2's precision.


KNN as a Regressor demo¶

Mechanics of this algorithm is same as that of KNN Classifier, just used for regression task. Aim here is to not create an accurate model but to show how this algorithm works.

In [43]:
from sklearn.datasets import make_regression

Here we will draw random data points from a linear regression with a single feature and a single output. This will be easier to visualize for us.

In [49]:
inputs, targets = make_regression(n_samples = 10, n_features= 1, noise = 6,
                                 random_state = 42)

# noise if to deviate the data point that is to introduce a bit of noise in linear data
In [50]:
plt.scatter(inputs, targets);
No description has been provided for this image

This is the linear dataset with some noise. Target values or y axis shows numbers that varies a lot. Let's divide the values by the max value.

In [51]:
targets = targets/targets.max()

plt.scatter(inputs, targets);
No description has been provided for this image

Great! Now we will create the model.

In [52]:
from sklearn.neighbors import KNeighborsRegressor

reg_knn = KNeighborsRegressor(n_neighbors=1)
reg_knn
Out[52]:
KNeighborsRegressor(n_neighbors=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsRegressor(n_neighbors=1)
In [53]:
reg_knn.fit(inputs, targets)
Out[53]:
KNeighborsRegressor(n_neighbors=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsRegressor(n_neighbors=1)
In [70]:
x_pred = 0.53
y_pred = reg_knn.predict([[x_pred]])
y_pred
Out[70]:
array([-0.08901495])
In [71]:
neighbors_reg = reg_knn.kneighbors([[x_pred]])
neighbors_reg
Out[71]:
(array([[0.01256004]]), array([[5]], dtype=int64))

This shows the same values as it is in Classification, just the difference is that the first array gives a value in x axis. Let's plot to see -

In [72]:
plt.scatter(inputs, targets)
plt.scatter(x_pred, y_pred);
No description has been provided for this image

As per above, the nearest output is at index 5. Let's check the value at index 5 and compare with our predicted value -

In [75]:
y_pred, targets[5]  # They are almost the same
Out[75]:
(array([-0.08901495]), -0.08901494592896088)

In KNN regression, when k = 1, the predicted value of the target variable $\hat{y}$ is simply the target value of the nearest neighbor. However, when the value of k is increased to a number greater than 1, the prediction is no longer based on a single point but instead on the average of the target values of the k nearest neighbors.

For example, if k = 3, the algorithm identifies the 3 closest data points to the query instance, retrieves their corresponding target values, and computes their average. This average becomes the predicted output for the query point.

General Formula for KNN Regression:

Let the k nearest neighbors to a point $x$ be $x_1, x_2, \dots, x_k$, with corresponding target values $y_1, y_2, \dots, y_k$. Then, the predicted value $\hat{y}$ is given by:

$$ \hat{y} = \frac{1}{k} \sum_{i=1}^{k} y_i $$

This approach allows the prediction to be more stable and less sensitive to noise, especially when compared to using only the single nearest neighbor. If we use distance-based weighting, the prediction formula adjusts to give closer neighbors more influence.

In [82]:
# Let's change k to 4 and see what we get

reg_knn = KNeighborsRegressor(n_neighbors=4)
reg_knn.fit(inputs, targets)

x_pred = 0.53
y_pred = reg_knn.predict([[x_pred]])

plt.scatter(inputs, targets)
plt.scatter(x_pred, y_pred);
No description has been provided for this image
In [86]:
neighbors = reg_knn.kneighbors([[x_pred]])
neighbors
Out[86]:
(array([[0.01256004, 0.03328585, 0.11768854, 0.23743473]]),
 array([[5, 6, 9, 2]], dtype=int64))
In [92]:
(targets[5] + targets[6] + targets[9] + targets[2])/4
Out[92]:
0.06162155187861659
In [90]:
y_pred
Out[90]:
array([0.06162155])

Exactly the same just as the formula says.

Now let's compare the parametric and non parametric approaches.


Comparison of KNN Regressor with Linear Regression¶

Parametric approaches refer to algorithms that make strong assumptions about the form or structure of the data. A common example is linear regression, which assumes that the relationship between the independent and dependent variables is linear. In such a model, increasing the input by one unit results in a fixed, proportional change in the output. While this makes the model simple, interpretable, and computationally efficient, it is also quite restrictive and may not work well when the true relationship is more complex or non-linear.

On the other hand, non-parametric approaches like KNN make no assumptions about the underlying data distribution. KNN adapts to the shape and structure of the data by using the target values of nearby points to make predictions. This flexibility makes it a good choice when the form of the relationship between variables is not known in advance or is highly irregular.

Given this, one might wonder why linear regression is still widely used. The answer lies in the trade-off between simplicity and flexibility. Linear regression tends to perform well when the true relationship is approximately linear or when interpretability is a priority. It also requires less data to train effectively and can extrapolate to unseen regions of the feature space. KNN, while more flexible, may require a large amount of data, can be sensitive to noise, and struggles with extrapolation. In the end, the choice depends on the nature of the dataset and the specific goals of the analysis.

Let's see here how it works -

Let's define data first.

In [96]:
inputs, targets = make_regression(n_samples = 500,
                                  n_features= 1,
                                  noise = 28,
                                  random_state=42)

targets = targets/targets.max()

plt.scatter(inputs, targets, color = '#934421')
plt.xlabel('Features')
plt.ylabel('Targets');
No description has been provided for this image
In [130]:
# Splitting the dataset

x_train, x_test, y_train, y_test = train_test_split(inputs, targets,
                                                    test_size=0.2, random_state=42)
In [131]:
fig = plt.figure(figsize=(14, 4))

ax1 = fig.add_subplot(1, 2, 1)
ax1.scatter(x_train, y_train, color='#934421')
ax1.set_title('Training Data')
ax1.set_xlabel('Features')
ax1.set_ylabel('Targets')

ax2 = fig.add_subplot(1, 2, 2)
ax2.scatter(x_test, y_test, color='#934421')
ax2.set_title('Test Data')
ax2.set_xlabel('Features')
ax2.set_ylabel('Targets')

plt.show()
No description has been provided for this image

Now, let's perform linear and KNN regression

In [132]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor

grid_knn = GridSearchCV(estimator=KNeighborsRegressor(),
                       param_grid= {
                           'n_neighbors' : range(1,31),
                           'weights' : ['distance', 'uniform']
                       },
                       scoring='neg_mean_squared_error')

grid_knn.fit(x_train, y_train)
Out[132]:
GridSearchCV(estimator=KNeighborsRegressor(),
             param_grid={'n_neighbors': range(1, 31),
                         'weights': ['distance', 'uniform']},
             scoring='neg_mean_squared_error')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=KNeighborsRegressor(),
             param_grid={'n_neighbors': range(1, 31),
                         'weights': ['distance', 'uniform']},
             scoring='neg_mean_squared_error')
KNeighborsRegressor(n_neighbors=15)
KNeighborsRegressor(n_neighbors=15)
In [133]:
grid_knn.best_estimator_, grid_knn.best_score_
Out[133]:
(KNeighborsRegressor(n_neighbors=15), -0.013154379819456791)

This means our model, on average, makes squared errors of around 0.0131 between predicted and actual target values.

In [134]:
# Let's fit KNN Regressor model with this best estimator and initiate Linear Regression

reg_linear = LinearRegression()
reg_linear.fit(x_train, y_train)
linear_prediction = reg_linear.predict(x_test)

reg_knn = grid_knn.best_estimator_
reg_knn.fit(x_train, y_train)
knn_prediction = reg_knn.predict(x_test)
In [135]:
fig = plt.figure(figsize=(14, 4))

ax1 = fig.add_subplot(1, 2, 1)
ax1.scatter(x_test, y_test, color='#934421')
ax1.plot(x_test, linear_prediction)
ax1.set_title('Fit with Linear Regression')
ax1.set_xlabel('Features')
ax1.set_ylabel('Targets')

ax2 = fig.add_subplot(1, 2, 2)
ax2.scatter(x_test, y_test, color='#934421')
ax2.plot(x_test, knn_prediction)
ax2.set_title('Fit with KNN Regressor')
ax2.set_xlabel('Features')
ax2.set_ylabel('Targets')

plt.show()
No description has been provided for this image

For KNN, we see jagged or uneven line instead of a smooth straight line because:

KNN Regressor is non-parametric and non-linear. It doesn't learn a global equation (like linear regression does). Instead, it:

  • Predicts the target for each test point based on local averages of k-nearest neighbors
  • This causes the prediction line to zigzag or appear step-like, especially when k is small or the test points are not sorted

Let's now see how it looks with sorted values and we will try with 3 different k values

In [138]:
y_pred_knn = []
for i in [1,10,40]:
    model = KNeighborsRegressor(n_neighbors=i)
    model.fit(x_train, y_train)
    y_pred_knn.append(model.predict(x_test))
In [140]:
df = pd.DataFrame(data = {
                        'x_test' : list(x_test.flatten()),
                        'y_test' : list(y_test.flatten()),
                        'y_pred_linear' : list(linear_prediction.flatten()),
                        'y_pred_knn_cv' : list(knn_prediction.flatten()),
                        'y_pred_knn_1' : list(y_pred_knn[0].flatten()),
                        'y_pred_knn_10' : list(y_pred_knn[1].flatten()),
                        'y_pred_knn_40' : list(y_pred_knn[2].flatten()),
                    }
                 )

df = df.sort_values(by = ['x_test'])
x_test_sorted = df['x_test'].to_list()
y_test_sorted = df['y_test'].to_list()
y_pred_linear_sorted = df['y_pred_linear'].to_list()
y_pred_knn_cv_sorted = df['y_pred_knn_cv'].to_list()
y_pred_knn_1_sorted = df['y_pred_knn_1'].to_list()
y_pred_knn_10_sorted = df['y_pred_knn_10'].to_list()
y_pred_knn_40_sorted = df['y_pred_knn_40'].to_list()
In [143]:
fig = plt.figure(figsize=(14, 4))

ax1 = fig.add_subplot(1, 2, 1)
ax1.scatter(x_test_sorted, y_test_sorted, color='#934421')
ax1.plot(x_test_sorted, y_pred_linear_sorted)
ax1.set_title('Fit with Linear Regression')
ax1.set_xlabel('Features')
ax1.set_ylabel('Targets');
No description has been provided for this image
In [151]:
fig = plt.figure(figsize=(21, 6))  

ax1 = fig.add_subplot(1, 3, 1)
ax1.scatter(x_test_sorted, y_test_sorted, color='#934421')
ax1.plot(x_test_sorted, y_pred_knn_1_sorted, color='blue')
ax1.set_title('KNN Regression (k=1)')
ax1.set_xlabel('Features')
ax1.set_ylabel('Targets')

ax2 = fig.add_subplot(1, 3, 2)
ax2.scatter(x_test_sorted, y_test_sorted, color='#934421')
ax2.plot(x_test_sorted, y_pred_knn_10_sorted, color='blue')
ax2.set_title('KNN Regression (k=10)')
ax2.set_xlabel('Features')
ax2.set_ylabel('Targets')

ax3 = fig.add_subplot(1, 3, 3)
ax3.scatter(x_test_sorted, y_test_sorted, color='#934421')
ax3.plot(x_test_sorted, y_pred_knn_40_sorted, color='blue')
ax3.set_title('KNN Regression (k=40)')
ax3.set_xlabel('Features')
ax3.set_ylabel('Targets')

plt.tight_layout()
plt.show()
No description has been provided for this image
In [149]:
fig = plt.figure(figsize=(14, 4))

ax1 = fig.add_subplot(1, 2, 1)
ax1.scatter(x_test_sorted, y_test_sorted, color='#934421')
ax1.plot(x_test_sorted, y_pred_knn_cv_sorted)
ax1.set_title('Fit with KNN Regression')
ax1.set_xlabel('Features')
ax1.set_ylabel('Targets');
No description has been provided for this image

This graph shows the best fit with our data. Let's now calculate the error.

In [152]:
from sklearn.metrics import mean_squared_error

linear_reg_error = mean_squared_error(y_test, linear_prediction)
knn_reg_error = mean_squared_error(y_test, knn_prediction)
In [153]:
linear_reg_error
Out[153]:
0.010253765593013877
In [154]:
knn_reg_error
Out[154]:
0.010670577903965932

Error is almost the same for our data. For datasets that exhibit a clear linear relationship between features and target values, both linear regression and KNN regression can provide similar levels of accuracy. However, in such cases, linear regression is often a better choice. This is because linear regression is a parametric model that assumes a specific form for the relationship between variables, allowing it to capture the underlying trend efficiently with fewer data points. In contrast, KNN regression is non-parametric and relies on local patterns in the data. It may not effectively capture the overall trend, especially when predicting values outside the range of the training data. KNN also requires a larger dataset to perform well and struggles with extrapolation, where linear regression can still produce reasonable predictions based on the learned coefficients. Therefore, when the true relationship is known to be linear, linear regression often provides more reliable and interpretable results.


Cases when KNN Regressor can work better than Linear Regression¶

Here, we will write a non_linear_regression function that generates a synthetic non-linear regression dataset. It creates input features x and corresponding target values y based on a non-linear relationship that includes a quadratic and a sinusoidal component, plus optional Gaussian noise.

  • uni = lambda n : np.random.uniform(-2, 2, n) - Generates n uniformly distributed random numbers between -2 and 2.
  • add_noise = lambda n : np.random.normal(0, 1, n) - Generates n values from a standard normal distribution (mean=0, std=1) to be used as noise.
  • y = y_raw + noise * np.std(y_raw) * add_noise(n_samples) - Adds Gaussian noise scaled by:
    • noise factor
    • Standard deviation of y_raw So that noise is relative to the signal's variation.
In [157]:
from math import sin

# Create a function that generates a random non-linear dataset
def non_linear_regression(n_samples, noise = 0, random_state = None):
    
    if random_state:
        np.random.seed(random_state)

    uni = lambda n : np.random.uniform(-2, 2, n)
    add_noise =  lambda  n : np.random.normal(0, 1, n)
    
    x = []
    x = uni(n_samples)
    x.sort()
    
    y_raw = [i**2 + sin(5*i) for i in x]
    y = y_raw + noise * np.std(y_raw) * add_noise(n_samples)
        
    return x, y
In [161]:
# Generate data without noise
inputs_no_noise, target_no_noise = non_linear_regression(300, 0, 42)

# Using the same random state as above, generate the data with some noise
inputs, target = non_linear_regression(300, 0.5, 42)
In [162]:
sns.set()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13,4))

ax1.scatter(inputs_no_noise, target_no_noise, color = '#000C1F')
ax1.set_title('Data without noise')
ax1.set_xlabel('Feature')
ax1.set_ylabel('Target')

ax2.scatter(inputs, target, color = '#000C1F')
ax2.set_title('Data with noise')
ax2.set_xlabel('Feature')
ax2.set_ylabel('Target');
No description has been provided for this image

Noise is introduced to make it a bit complicated.

In [163]:
x_train, x_test, y_train, y_test = train_test_split(inputs, 
                                                    target, 
                                                    test_size = 0.2, 
                                                    random_state = 42)
In [165]:
sns.set()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13,4))

ax1.scatter(x_train, y_train, color = '#000C1F')
ax1.set_title('Training data')
ax1.set_xlabel('Feature')
ax1.set_ylabel('Target')

ax2.scatter(x_test, y_test, color = '#000C1F')
ax2.set_title('Test data')
ax2.set_xlabel('Feature')
ax2.set_ylabel('Target');
No description has been provided for this image

Now let's fit linear regression and multiple KNN regression

In [166]:
reg_lin = LinearRegression()


# In sklearn, when fitting data with only 1 feature, the following reshaping should be applied.
reg_lin.fit(x_train.reshape(-1, 1), y_train)
y_pred_lin = reg_lin.predict(x_test.reshape(-1, 1))
In [167]:
k = 81
mse_lin = []

# Calculate the MSE value for the linear regression
mse_lin = mean_squared_error(y_test, y_pred_lin)

# The MSE value calculated above is the same for all values of K. 
# Therefore, we create an array storing that MSE value (k-1) many times.
# This will later be used to plot the MSE value versus the number of nearest neighbors.
mse_lin = [mse_lin]*(k-1)


mse_knn = []
for i in range(1, k):
    reg_knn = KNeighborsRegressor(n_neighbors = i)
    reg_knn.fit(x_train.reshape(-1, 1), y_train)
    y_pred_knn = reg_knn.predict(x_test.reshape(-1, 1))
    mse_knn.append(mean_squared_error(y_test, y_pred_knn))

Now let's plot MSE vs Number of Neighbors

In [168]:
sns.set()

fig, ax = plt.subplots()

# Since the linear regression is not affected by the value of K, the output is a straight line.
plt.plot(list(range(1, k)), 
         mse_lin, 
         color = 'orange', 
         label = 'linear')

# Plot the MSE of the KNN regressions versus the value of K.
plt.plot(list(range(1, k)), 
         mse_knn, 
         color = 'red', 
         marker = 'o', 
         markerfacecolor = '#000C1F',
         label = 'KNN')

ax.legend(loc='lower right')
ax.set_title('Mean-Squared Error (MSE)')
ax.set_xlabel('K')
ax.set_ylabel('MSE')
plt.ylim(0);
No description has been provided for this image

KNN gives much better result as there is non linear relationship in the data. Thus we can guess linear won't be a good fit here and that is indeed the case.

In [169]:
# Create a list to store the predictions from 3 KNN regressions
y_pred_knn = []

# Create 3 KNN regressions with K = 1, 7, and 80.
for i in [1, 7, 80]:
    reg_knn = KNeighborsRegressor(n_neighbors = i)
    reg_knn.fit(x_train.reshape(-1, 1), y_train)
    y_pred_knn.append(reg_knn.predict(x_test.reshape(-1, 1)))
In [170]:
df = pd.DataFrame(data = {'x_test':list(x_test.flatten()), 
                          'y_test':list(y_test.flatten()), 
                          'y_pred_lin':list(y_pred_lin.flatten()), 
                          'y_pred_knn-1':list(y_pred_knn[0].flatten()), 
                          'y_pred_knn-7':list(y_pred_knn[1].flatten()), 
                          'y_pred_knn-80':list(y_pred_knn[2].flatten())})


df = df.sort_values(by = ['x_test'])

x_test_sorted = df['x_test'].tolist()
y_test_sorted = df['y_test'].tolist()
y_pred_lin_sorted = df['y_pred_lin'].tolist()
y_pred_knn1_sorted = df['y_pred_knn-1'].tolist()
y_pred_knn7_sorted = df['y_pred_knn-7'].tolist()
y_pred_knn80_sorted = df['y_pred_knn-80'].tolist()

Plotting regression on top of test data

In [171]:
sns.set()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,4))

ax1.scatter(x_test_sorted, 
            y_test_sorted, 
            color = '#000C1F')

ax1.plot(x_test_sorted, 
         y_pred_lin_sorted, 
         color = 'orange')

ax1.set_title('Linear fit on top of the test data')
ax1.set_xlabel('Feature')
ax1.set_ylabel('Target')

ax2.scatter(x_test_sorted, 
            y_test_sorted, 
            color = '#000C1F')

ax2.plot(x_test_sorted, 
         y_pred_knn7_sorted, 
         color = 'red', 
         marker = 'o', 
         markerfacecolor = 'yellow')

ax2.set_title('KNN fit on top of the test data (K = 7)')
ax2.set_xlabel('Feature')
ax2.set_ylabel('Target');
No description has been provided for this image

Plotting regression on top of noiseless data

In [172]:
fig, (ax1,ax2, ax3) = plt.subplots(1,3, figsize=(16, 5))

# Plot the noiseless data on all 3 figures
ax1.scatter(inputs_no_noise, target_no_noise, color = '#000C1F')
ax2.scatter(inputs_no_noise, target_no_noise, color = '#000C1F')
ax3.scatter(inputs_no_noise, target_no_noise, color = '#000C1F')

# Plot the fit from a KNN regression (K = 1)
ax1.plot(x_test_sorted, 
         y_pred_knn1_sorted, 
         color = 'red',
         marker = 'o', 
         markerfacecolor = 'yellow')
ax1.set_title('K = 1')
ax1.set_xlabel('Feature')
ax1.set_ylabel('Target')

# Plot the fit from a KNN regression (K = 7)
ax2.plot(x_test_sorted, 
         y_pred_knn7_sorted, 
         color = 'red',
         marker = 'o', 
         markerfacecolor = 'yellow')
ax2.set_title('K = 7')
ax2.set_xlabel('Feature')
ax2.set_ylabel('Target')

# Plot the fit from a KNN regression (K = 80)
ax3.plot(x_test_sorted, 
         y_pred_knn80_sorted, 
         color = 'red',
         marker = 'o', 
         markerfacecolor = 'yellow')
ax3.set_title('K = 80')
ax3.set_xlabel('Feature')
ax3.set_ylabel('Target');
No description has been provided for this image

Thus, KNN is better in scenarios when there is no linear relationship observed.


Deciding on using Distance Metric¶

Let's revisit the Role of the p Parameter -

The distance metric used in KNN plays a critical role in how neighbors are selected. The general formula for distance is defined by the Minkowski metric, which can be written as:

$$ \left( |x_1 - x_2|^p + |y_1 - y_2|^p \right)^{1/p} $$

By varying the value of p, we get different types of distances:

  • When p = 1, the formula becomes the Manhattan (or taxicab) distance.
  • When p = 2, it becomes the familiar Euclidean distance.
  • Higher values of p result in other forms of the Minkowski distance, and as p → ∞, it approaches the Chebyshev distance (which only considers the largest coordinate difference).

While Euclidean distance (p = 2) is common, it is not always the best option for all datasets. Depending on the feature distribution, scale, or clustering of data, other values of p might offer better classification performance.

Instead of guessing which distance is optimal, we can automate this choice.

Using GridSearchCV to select the best p¶

The process involves:

  1. Setting up a parameter grid that includes multiple values for p (e.g., from 1 to 5 using range(1,6)).
  2. Creating an instance of GridSearchCV with:
    • The KNeighborsClassifier as the model (estimator).
    • The defined parameter grid.
    • A scoring method such as accuracy.
  3. Training multiple models, each with a different p value, using cross-validation.
  4. Choosing the best-performing model automatically by checking which parameter configuration gave the highest cross-validation accuracy.

So, instead of assuming a fixed distance metric, it's better to tune p as a hyperparameter. GridSearchCV makes this easy by systematically evaluating performance across a range of values. Using a higher-order Minkowski distance might capture subtle relationships in the data that simpler metrics miss. This approach ensures that the distance function is tailored to data, potentially leading to more accurate and robust predictions.


KNN Pros and Cons¶

KNN Classifier¶

Pros:

  • Simple and easy to understand.
  • No training phase; makes predictions directly using the training data.
  • Naturally handles multi-class classification.
  • Effective when decision boundaries are irregular.
  • Flexible to feature types with appropriate distance metrics (e.g., Hamming for categorical).

Cons:

  • Computationally expensive at prediction time (especially with large datasets).
  • Sensitive to the choice of distance metric and the value of k.
  • Poor performance on imbalanced datasets.
  • Affected by noisy data and irrelevant features.
  • Requires proper feature scaling (e.g., normalization or standardization).

KNN Regressor¶

Pros:

  • Makes no assumption about the underlying data distribution.
  • Can model complex, non-linear relationships.
  • Easy to implement and intuitive.
  • Adapts well to local data patterns.
  • Performs well when the target function is smooth.

Cons:

  • Struggles with extrapolation outside the training data range.
  • Sensitive to outliers in target values.
  • Requires large datasets to work effectively.
  • Slower predictions due to lack of training and full dataset scanning.
  • Choice of k and feature scaling strongly impact performance.

KNN Classifier and Regressor – Drawbacks¶

  1. Computational cost during prediction Since KNN stores the entire training dataset and performs calculations for each prediction, it can be slow and inefficient, especially with large datasets.

  2. Curse of dimensionality As the number of features increases, the distance between data points becomes less meaningful. This reduces the effectiveness of the algorithm and may lead to poor performance in high-dimensional spaces.

  3. Limited suitability for categorical data KNN struggles with purely categorical features unless special distance metrics are applied. Standard Euclidean distance does not work well for non-numeric data.

  4. Highly sensitive to irrelevant or noisy features If the dataset contains unimportant or noisy features, they can distort distance calculations and reduce prediction accuracy. Proper feature selection or scaling is essential.

  5. Memory intensive Because KNN stores the entire dataset, it can consume a lot of memory, especially for large training sets.

  6. No model interpretation KNN does not provide an explicit model or formula, so it lacks interpretability compared to parametric methods like linear regression.

  7. Poor at extrapolation In regression tasks, KNN cannot predict values outside the range of the training data, unlike linear regression which can extend the fitted line.