Details on SVM Classification¶
Support Vector Machines (SVMs) are supervised machine learning algorithms used for both classification and regression tasks. Although they are more commonly associated with classification, SVMs are versatile and can be adapted to continuous output prediction as well. The core idea behind SVMs is to find the optimal separating boundary between data points belonging to different classes or to fit a function that approximates the underlying pattern in regression.
Classification with SVMs¶
In a classification setting, the goal of an SVM is to separate objects into distinct categories. The algorithm does this by identifying a decision boundary, known as a hyperplane, which maximizes the margin between different class labels. The margin is the distance between the hyperplane and the closest data points from each class. These closest points are referred to as support vectors and play a key role in defining the position of the hyperplane. A larger margin generally contributes to better generalization and robustness of the model.
Regression using SVMs¶
For regression tasks, the approach of SVMs slightly changes. Instead of classifying points, the algorithm tries to predict continuous numerical values. Here, the model does not seek an exact fit to the training data. Instead, it introduces a margin of tolerance around the predicted function, within which errors are considered acceptable. This flexibility allows the model to balance bias and variance, helping avoid overfitting while still capturing the data's underlying trend.
Linear and Non-Linear Decision Boundaries¶
While SVMs are inherently linear classifiers, they can also be extended to solve non-linear problems through a technique known as the kernel trick. Kernels are mathematical functions that transform the input data into higher-dimensional spaces, where a linear separator might exist even if the data is not linearly separable in the original space. Commonly used kernels include radial basis function (RBF), polynomial, and sigmoid. This adaptability to non-linear relationships is a key strength of SVMs, making them suitable for a wide variety of classification and regression problems.
Performance and Practical Considerations¶
SVMs are known to perform well in high-dimensional spaces, especially when the number of features exceeds the number of samples. They are also efficient at predicting once trained, which is beneficial in applications where prediction speed matters. However, the training phase can be computationally intensive and time-consuming, especially with large datasets. Furthermore, one of the limitations of SVMs is their lack of interpretability. Due to the mathematical complexity involved in kernel transformations and the model’s internal operations, it is often difficult to explain how a particular prediction was made or what influenced it most.
Understanding the Decision Boundary in SVMs through a Fraud Detection Example¶
Imagine you're analyzing transactional data from an online payment system. Each transaction includes information like the amount of money transferred and the time of day. You want to predict whether a new transaction is fraudulent or not based on historical data.
Let’s assume you plot this historical data on a 2D graph where the x-axis is the amount and the y-axis is the time. Transactions are labeled either as "fraud" or "not fraud". The goal is to classify new transactions based on this plot.
Finding a Separating Hyperplane¶
In two dimensions, the separating hyperplane is a line that divides the plane into two parts - one for each class. The central question is: where should this line be placed? There are infinitely many ways to draw a line that separates the two classes, but not all of them are optimal.
SVMs tackle this by choosing the maximum margin hyperplane. Instead of just separating the classes, SVM looks for the line (or hyperplane in higher dimensions) that maximizes the distance to the closest data points from both classes. These closest points are called support vectors, and they essentially "support" or define the boundary.
What is a Hyperplane?¶
In a general sense, a hyperplane is a subspace whose dimension is one less than that of its ambient space. If you have data with n
features (or dimensions), the separating hyperplane will be n-1
dimensional. For example:
- In 2D (two features), the hyperplane is a line.
- In 3D, it’s a plane.
- In 4D or higher, it becomes more abstract but still behaves similarly in principle.
General form of a hyperplane in an n-dimensional space:
w₁x₁ + w₂x₂ + ... + wₙxₙ + b = 0
Where:
w
is the vector of weights,x
is the input feature vector,b
is the bias term.
Hard Margin SVM: Perfect Separation¶
A hard margin SVM tries to find a separating hyperplane without any misclassification. It assumes the data is linearly separable - which means you can draw a clean boundary between classes without any overlap.
Going back to our transaction example, imagine you plotted all transactions and there’s a clear linear boundary - maybe all low-value, late-night transactions are frauds, and high-value, business-hour transactions are not. If that’s the case, a hard margin SVM can perfectly draw a boundary without errors.
However, hard margin SVMs are sensitive to outliers. If just one transaction is mislabeled or is an exception, it can force the model to draw a poor boundary.
Soft Margin SVM: Handling Real-World Imperfections¶
Real-world data is rarely clean. Some genuine transactions might look suspicious (low-value at odd hours) and vice versa. This is where soft margin SVMs come in. They allow some misclassifications while still trying to keep the margin as wide as possible.
In this case, we don’t insist that every transaction be on the correct side of the boundary. Instead, we allow a few points to fall on the wrong side of the margin or even the wrong side of the hyperplane. The SVM balances two objectives:
- Maximize the margin.
- Minimize the number and extent of violations (misclassifications).
The Role of C: Regularization in SVM¶
The trade-off between maximizing the margin and allowing violations is controlled by a regularization parameter called C.
- A small C value gives you a wider margin but allows more misclassifications. This is useful when your data is noisy, and you don’t want the model to overfit.
- A large C value gives you a narrower margin with fewer misclassifications, effectively behaving like a hard margin. This works well when your data is clean and separable.
Bias-Variance Trade-off in SVMs¶
The choice of C is directly related to the bias-variance trade-off:
- Low C (wider margin): High bias, low variance. The model is simpler and less sensitive to small fluctuations in training data.
- High C (narrow margin): Low bias, high variance. The model tries to fit training data tightly and may not generalize well.
Let’s now visualize this relationship with a graph of model complexity vs error.
The graph above shows how model complexity (controlled by the SVM regularization parameter C) influences bias and variance:
- Left side (low C, less complex model): High bias and low variance. The model makes simplifying assumptions and may underfit the data.
- Right side (high C, more complex model): Low bias and high variance. The model tries to fit training data tightly, increasing risk of overfitting.
- The green dashed curve represents the total error, which is minimized at a balance between bias and variance — this is where cross-validation helps identify the best C value.
This reflects the practical goal in SVM: not just separating data, but doing so in a way that generalizes well to new, unseen data.
Example of Hard Margin in Python¶
Let's consider input data to be -
- Class 0 (non-fraudulent): [1, 2], [2, 3], [3, 3], [4, 5]
- Class 1 (fraudulent): [6, 6], [7, 7], [8, 8], [9, 10]
Each point has two features:
- Amount of money transferred (x-axis)
- Time of transaction (y-axis)
This dataset is clearly linearly separable - you can draw a straight line that cleanly separates all class 0 from class 1 points.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
# Create simple linearly separable data
X_hard = np.array([
[1, 2], [2, 3], [3, 3], [4, 5], # Class 0
[6, 6], [7, 7], [8, 8], [9, 10] # Class 1
])
y_hard = np.array([0, 0, 0, 0, 1, 1, 1, 1])
# Fit hard-margin SVM (C is very large to prevent misclassification)
clf_hard = SVC(kernel='linear', C=100)
clf_hard.fit(X_hard, y_hard)
# Plotting
plt.figure(figsize=(8, 6))
plt.title("Hard Margin SVM: Linearly Separable Data")
# Plot points
plt.scatter(X_hard[:, 0], X_hard[:, 1], c=y_hard, cmap=plt.cm.coolwarm, s=60)
# Plot support vectors
plt.scatter(clf_hard.support_vectors_[:, 0], clf_hard.support_vectors_[:, 1],
s=100, facecolors='none', edgecolors='k', label='Support Vectors')
# Plot decision boundary
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
xx = np.linspace(xlim[0], xlim[1], 200)
yy = np.linspace(ylim[0], ylim[1], 200)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf_hard.decision_function(xy).reshape(XX.shape)
# Plot decision boundary and margins
plt.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], linestyles=['--', '-', '--'])
plt.xlabel("Amount Transferred")
plt.ylabel("Time of Transaction")
plt.legend()
plt.grid(True)
plt.show()
- The black solid line is the decision boundary (hyperplane).
- The dashed lines on either side represent the margins.
- The support vectors (highlighted with black circles) lie exactly on these margin lines.
- No points are misclassified or fall inside the margin, making this a perfect case for a hard margin SVM.
Because the regularization parameter C is set to a very large value (100), the model does not tolerate any misclassification. This forces a hard boundary that separates all points correctly. This setup mirrors an ideal, noiseless situation - useful for understanding the pure concept of margin maximization. In real-world scenarios, though, soft margin SVMs are typically more appropriate due to noise and overlapping class distributions.
Example of Soft Margin in python¶
# Introduce slight label noise to create a non-linearly separable scenario
X_soft = np.array([
[1, 2], [2, 3], [3, 3], [4, 5], # Class 0
[6, 6], [7, 7], [8, 8], [4.5, 4.5] # Class 1 (last point added closer to class 0)
])
y_soft = np.array([0, 0, 0, 0, 1, 1, 1, 1]) # Notice one class 1 point overlaps with class 0
# Fit soft-margin SVM (with lower C to allow for margin violations)
clf_soft = SVC(kernel='linear', C=1)
clf_soft.fit(X_soft, y_soft)
# Plotting
plt.figure(figsize=(8, 6))
plt.title("Soft Margin SVM: Non-linearly Separable Data with Slight Overlap")
# Plot points
plt.scatter(X_soft[:, 0], X_soft[:, 1], c=y_soft, cmap=plt.cm.coolwarm, s=60)
# Plot support vectors
plt.scatter(clf_soft.support_vectors_[:, 0], clf_soft.support_vectors_[:, 1],
s=100, facecolors='none', edgecolors='k', label='Support Vectors')
# Plot decision boundary
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
xx = np.linspace(xlim[0], xlim[1], 200)
yy = np.linspace(ylim[0], ylim[1], 200)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf_soft.decision_function(xy).reshape(XX.shape)
# Plot decision boundary and margins
plt.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], linestyles=['--', '-', '--'])
plt.xlabel("Amount Transferred")
plt.ylabel("Time of Transaction")
plt.legend()
plt.grid(True)
plt.show()
Understanding Kernels in detail in Support Vector Machines¶
Kernels are a fundamental component of Support Vector Machines, especially when dealing with data that is not linearly separable in its original feature space. In many real-world applications, such as image recognition, bioinformatics, or text classification, the relationship between input features and their corresponding classes is often complex and non-linear. A simple straight line or hyperplane in the original feature space may not suffice to separate the classes effectively.
To address this, kernels enable the SVM to implicitly map the data into a higher-dimensional feature space where linear separation may become possible. This transformation helps uncover hidden structures or patterns in the data that are not evident in the original dimensions. For instance, data points that are entangled in a spiral or circular pattern in two dimensions might become linearly separable when lifted into a third or higher dimension. However, directly computing this transformation for each data point can be computationally expensive and inefficient, particularly when the target space is very high-dimensional or even infinite-dimensional. This is where the kernel trick becomes invaluable.
The kernel trick allows the SVM to perform the required computations in the transformed feature space without ever explicitly computing the transformation itself. Instead of transforming each data point, the SVM relies on a kernel function, which computes the inner product between pairs of points as if they had been mapped to the higher-dimensional space. This makes the entire process significantly more efficient while still retaining the power to model complex, non-linear decision boundaries. Through kernels, SVMs become not just linear classifiers, but highly flexible and powerful tools for classification and regression in non-linear settings. The choice of kernel function—whether linear, polynomial, radial basis function (RBF), sigmoid, or a custom kernel—directly affects the type of decision boundary the model can learn, making it a critical aspect of SVM performance.
Linear Mapping and the Kernel Trick¶
Mathematically, let us say we have two data points $x_1$ and $x_2$. A kernel function $K(x_1, x_2)$ computes the dot product of these two points after a transformation $\phi$, such that:
$$ K(x_1, x_2) = \langle \phi(x_1), \phi(x_2) \rangle $$
Here, $\phi$ is a mapping function that projects the original points into a higher-dimensional space. However, we never need to compute $\phi(x)$ explicitly. The kernel function allows us to compute the dot product in that high-dimensional space directly, which is computationally efficient and memory-friendly. This is called implicit transformation, and it is the core of the kernel trick.
What is a Kernel?¶
A kernel is a function that computes a similarity measure between two data points. Formally, for inputs $x, x' \in \mathbb{R}^n$, a kernel function is defined as:
$$ K(x, x') = \langle \phi(x), \phi(x') \rangle $$
Here, $\phi$ is the feature map that transforms data into a high-dimensional (possibly infinite-dimensional) space. In SVMs, this allows the decision function to operate in a new space where linear separability is possible, even if it isn't in the original space.
Types of Kernel Functions¶
There are several types of kernel functions, each with distinct characteristics. The choice of kernel depends on the data distribution and the problem being solved.
1. Linear Kernel¶
This is the simplest kernel function and corresponds to the standard dot product in the input space:
$$ K(x, x') = x^\top x' $$
Use case: When the data is linearly separable in the original feature space. This kernel is computationally efficient and often used in text classification problems with high-dimensional sparse data.
Example: Classifying emails as spam or not based on term frequency vectors.
2. Polynomial Kernel¶
This kernel represents the similarity of vectors in a feature space over polynomials of the original variables.
$$ K(x, x') = (\gamma x^\top x' + r)^d $$
Where $\gamma$, $r$, and $d$ are kernel parameters:
- $\gamma$: scale of the dot product,
- $r$: a constant coefficient,
- $d$: degree of the polynomial.
Use case: When the relationship between class labels and features is polynomial.
Example: Image data with interactions between pixel intensities that cannot be modeled linearly.
3. Radial Basis Function (RBF) or Gaussian Kernel¶
One of the most popular and powerful kernels. It measures the similarity based on the Euclidean distance between feature vectors:
$$ K(x, x') = \exp\left(-\frac{\|x - x'\|^2}{2\sigma^2}\right) $$
Where $\sigma$ controls the spread or smoothness of the decision boundary.
Use case: When data has non-linear patterns and complex boundaries.
Example: Detecting fraudulent transactions where normal and fraudulent behavior are not linearly separable in the feature space.
4. Sigmoid Kernel¶
Based on the hyperbolic tangent function:
$$ K(x, x') = \tanh(\gamma x^\top x' + r) $$
This kernel behaves like a neural network activation function and mimics a two-layer perceptron.
Use case: Less commonly used but interesting for drawing analogies between SVMs and neural networks.
Example: Biometric authentication problems where the data has saturation effects.
5. Custom or Precomputed Kernels¶
Users can define their own kernels or precompute similarity matrices for special applications where domain knowledge allows better feature design than standard kernels.
Implicit Transformation with Example¶
Let’s consider a simple example to illustrate implicit transformation.
Suppose we have two-dimensional data:
- Class A: [1, 1], [2, 2]
- Class B: [-1, -1], [-2, -2]
Clearly, this data is linearly separable. But what if the data was in the form:
- Class A: [1, 0], [0, 1]
- Class B: [-1, 0], [0, -1]
In 2D, this data forms an XOR-like problem which is not linearly separable.
Let’s apply a mapping:
$$ \phi(x_1, x_2) = (x_1^2, \sqrt{2}x_1x_2, x_2^2) $$
Under this transformation:
- [1, 0] becomes [1, 0, 0]
- [0, 1] becomes [0, 0, 1]
- [-1, 0] becomes [1, 0, 0]
- [0, -1] becomes [0, 0, 1]
Now, the data is linearly separable in 3D. Instead of performing the mapping explicitly, a kernel function (e.g., polynomial kernel) can do this implicitly. This is computationally advantageous because it avoids the need to compute and store high-dimensional feature vectors.
Visual Intuition Behind Kernels¶
Consider again the fraud detection scenario. If transaction amount and time cannot clearly separate fraud and non-fraud in 2D, you might use a Gaussian kernel to project data into a higher-dimensional space where the radial behavior becomes a feature.
In that space:
- Points that are close in time and amount but belong to different classes may now be separated.
- The kernel uses the "distance" or similarity between all pairs of points to help define boundaries.
The decision boundary in the original space will appear non-linear, but it’s actually a linear separation in the transformed space.
Choosing the Right Kernel¶
The selection of a kernel function is not always straightforward. It often depends on:
- The structure of your data
- The expected decision boundary complexity
- Computational efficiency requirements
Linear kernels are best for linearly separable and sparse high-dimensional data. RBF kernels are suitable for general-purpose non-linear decision boundaries. Polynomial kernels may work better when interaction between features is known to be polynomial.
Kernel Trick in Detail¶
The kernel trick is a computational shortcut that allows an algorithm to learn a non-linear decision boundary by implicitly mapping data into a higher-dimensional space, without ever performing the transformation explicitly. To understand it properly, let’s start with a real-world-like scenario - classifying data that forms two nested circles.
Imagine plotting two classes of data on a 2D plane:
- Class A (label = 0): Forms a small circle centered at the origin.
- Class B (label = 1): Forms a larger circle surrounding the small circle.
In this 2D space, it’s impossible to draw a straight line that separates the two classes. No linear boundary can cleanly divide them. If you try to draw one, it will either cut through both circles or misclassify many points. This type of data distribution is non-linearly separable in 2D.
Now imagine you add a third dimension - say, the radial distance of each point from the origin, computed as:
$$ z = x^2 + y^2 $$
This transformation lifts each point into 3D space. Points that were closer to the center (the smaller circle) will now lie on a lower z-level, and points from the larger circle (which are farther from the origin) will rise to higher z-values. What we now have is a 3D "bowl" shape, with the inner circle at the bottom and the outer circle at the top.
In this transformed space, a flat plane (hyperplane in 3D) can now separate the two classes. This plane would be parallel to the x-y plane and lie somewhere between the two z-levels corresponding to the two classes. So, even though the classes weren’t separable in 2D, by lifting them into 3D using a transformation, we made linear separation possible.
Transforming each data point explicitly into a higher dimension (like computing $x^2 + y^2$) might be feasible for simple 2D to 3D projections, but this quickly becomes computationally expensive in real-world problems:
- Real datasets often have hundreds or thousands of features.
- Higher-dimensional transformations lead to exponential growth in memory and computation.
- Managing the transformed vectors explicitly in high-dimensional space is inefficient or even infeasible.
This is where the kernel trick comes in.
The Kernel Trick: Intuition and Mechanism¶
The kernel trick lets us benefit from working in a higher-dimensional space without actually going there. Instead of transforming data explicitly and then calculating dot products between transformed vectors, we use a kernel function that calculates the dot product directly in the higher-dimensional space.
Formally, for a transformation function $\phi$, and input vectors $x$ and $x'$, the kernel trick says:
$$ K(x, x') = \langle \phi(x), \phi(x') \rangle $$
This means we can compute the inner product in the transformed space without ever computing $\phi(x)$ or $\phi(x')$.
In the nested circles example, if we use a Radial Basis Function (RBF) kernel:
$$ K(x, x') = \exp\left(-\frac{\|x - x'\|^2}{2\sigma^2}\right) $$
This kernel implicitly maps each data point into an infinite-dimensional space and calculates the similarity between points based on their Euclidean distance. Points that are close together (e.g., within the same circle) will have high similarity values. Points that are far apart (e.g., across circles) will have low similarity.
As a result, in this implicit feature space:
- The inner circle points cluster together,
- The outer circle points form another cluster,
- And a linear separator (hyperplane) can be found.
This is done without ever computing the actual transformation from 2D to a higher-dimensional feature space.
Key Advantages of the Kernel Trick¶
- Efficiency: Avoids the high cost of explicitly transforming data.
- Flexibility: Allows SVM to model very complex relationships between features.
- Scalability: Kernel computations depend only on pairwise similarities, not on the size of the transformed feature space.
Code Section - Classification of Mushroom dataset - Edible vs Poisonous¶
Importing Libraries¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import svm, datasets
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, \
recall_score, f1_score, classification_report, ConfusionMatrixDisplay
Loading dataset¶
df = pd.read_csv('mushrooms-full-dataset.csv')
df.shape
(8124, 22)
df.head()
poisonous | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | ... | stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | p | x | s | n | t | p | f | c | n | k | ... | s | w | w | p | w | o | p | k | s | u |
1 | e | x | s | y | t | a | f | c | b | k | ... | s | w | w | p | w | o | p | n | n | g |
2 | e | b | s | w | t | l | f | c | b | n | ... | s | w | w | p | w | o | p | n | n | m |
3 | p | x | y | w | t | p | f | c | n | n | ... | s | w | w | p | w | o | p | k | s | u |
4 | e | x | s | g | f | n | f | w | b | k | ... | s | w | w | p | w | o | e | n | a | g |
5 rows × 22 columns
# Lets see if there is any null
df.isna().sum()
poisonous 0 cap-shape 0 cap-surface 0 cap-color 0 bruises 0 odor 0 gill-attachment 0 gill-spacing 0 gill-size 0 gill-color 0 stalk-shape 0 stalk-surface-above-ring 0 stalk-surface-below-ring 0 stalk-color-above-ring 0 stalk-color-below-ring 0 veil-type 0 veil-color 0 ring-number 0 ring-type 0 spore-print-color 0 population 0 habitat 0 dtype: int64
# Now let's check the target class counts
df['poisonous'].value_counts(normalize=True)
poisonous e 0.517971 p 0.482029 Name: proportion, dtype: float64
Classes are roughly balanced, which is good for our classifier.
Data Preprocessing¶
Split the data into target and feature set
inputs = df.iloc[:,1:]
targets = df.iloc[:, 0]
inputs.head(2)
cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | stalk-shape | ... | stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | x | s | n | t | p | f | c | n | k | e | ... | s | w | w | p | w | o | p | k | s | u |
1 | x | s | y | t | a | f | c | b | k | e | ... | s | w | w | p | w | o | p | n | n | g |
2 rows × 21 columns
Splitting into training and testing dataset
xtrain, xtest, ytrain, ytest = train_test_split(inputs, targets,
test_size=0.2, random_state=42,
stratify=targets)
ytrain.value_counts(normalize=True)
poisonous e 0.517926 p 0.482074 Name: proportion, dtype: float64
ytest.value_counts(normalize=True)
poisonous e 0.518154 p 0.481846 Name: proportion, dtype: float64
This proves stratified class distribution.
Now we will define encoders for the target and the inputs as they need to be in numbers
enc_features = OrdinalEncoder()
enc_label = LabelEncoder()
xtrain_transf = enc_features.fit_transform(xtrain)
xtest_transf = enc_features.transform(xtest)
ytrain_transf = enc_label.fit_transform(ytrain)
ytest_transf = enc_label.transform(ytest)
xtrain_transf[:2]
array([[2., 3., 9., 0., 2., 1., 0., 0., 2., 0., 1., 1., 6., 0., 0., 2., 1., 2., 1., 5., 1.], [5., 2., 5., 1., 5., 1., 0., 0., 1., 0., 2., 2., 2., 7., 0., 2., 2., 0., 7., 1., 6.]])
ytrain_transf[:2]
array([1, 0])
Rescaling data as for SVC to work correctly, the inputs need to be rescaled to the range of -1 to 1
scaling = MinMaxScaler(feature_range= (-1,1)).fit(xtrain_transf)
xtrain_scaled = scaling.transform(xtrain_transf)
xtest_scaled = scaling.transform(xtest_transf)
Creating Classification models¶
# Starting with linear kernel first
C = 1.0 # parameter that helps in deciding how wide the margins of the classifier are
svc_linear = svm.SVC(kernel='linear', C=C).fit(xtrain_scaled, ytrain_transf)
ypred_test = svc_linear.predict(xtest_scaled)
print(confusion_matrix(ypred_test, ytest_transf))
[[815 28] [ 27 755]]
fig, axes = plt.subplots(figsize=(10,4))
cmd = ConfusionMatrixDisplay(
confusion_matrix(ypred_test, ytest_transf),
display_labels=['Edible', 'Poisonous']
)
cmd.plot(ax=axes);
print(classification_report(
ytest_transf, ypred_test, target_names = ['Edible', 'Poisonous'])
)
precision recall f1-score support Edible 0.97 0.97 0.97 842 Poisonous 0.97 0.96 0.96 783 accuracy 0.97 1625 macro avg 0.97 0.97 0.97 1625 weighted avg 0.97 0.97 0.97 1625
Let's see if we can improve the classifier
Using Cross Validation approach¶
hyperparameters = [
{'kernel' : ['linear'], 'C' : [1,10]},
{'kernel' : ['poly'], 'C':[0.1,1,10]},
{'kernel' : ['rbf'], 'gamma':[1e-3, 1e-4], 'C':[1,10]}
]
# gamma comes with radial - rbf, that controls the radius of the area of the support vector boundary
scores = ['precision', 'recall']
for score in scores:
print('Tuning hyperparameter for ', score)
print()
clf = GridSearchCV(svm.SVC(), hyperparameters, scoring=score) # cv=5 by default
clf.fit(xtrain_scaled, ytrain_transf)
print('Best parameters found : \n', clf.best_params_)
print()
print('Grid score on development set:\n')
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print(mean, std * 2, params)
print()
print('Detailed Classification Report : \n')
y_true, y_pred = ytest_transf, clf.predict(xtest_scaled)
print(classification_report(y_true, y_pred))
print('\n')
Tuning hyperparameter for precision Best parameters found : {'C': 0.1, 'kernel': 'poly'} Grid score on development set: 0.9581685809539241 0.018149967606886655 {'C': 1, 'kernel': 'linear'} 0.9609007016248526 0.014967629211791312 {'C': 10, 'kernel': 'linear'} 1.0 0.0 {'C': 0.1, 'kernel': 'poly'} 0.9996810207336523 0.001275917065390786 {'C': 1, 'kernel': 'poly'} 1.0 0.0 {'C': 10, 'kernel': 'poly'} 0.9472367421129387 0.017886730548394 {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'} 0.9425038002986422 0.019197193605451017 {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'} 0.9550908071345099 0.019754676048719444 {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'} 0.9472367421129387 0.017886730548394 {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'} Detailed Classification Report : precision recall f1-score support 0 0.99 1.00 0.99 842 1 1.00 0.99 0.99 783 accuracy 0.99 1625 macro avg 0.99 0.99 0.99 1625 weighted avg 0.99 0.99 0.99 1625 Tuning hyperparameter for recall Best parameters found : {'C': 1, 'kernel': 'poly'} Grid score on development set: 0.9387172549439239 0.035570029622332376 {'C': 1, 'kernel': 'linear'} 0.9543584491289216 0.02122281349979636 {'C': 10, 'kernel': 'linear'} 0.9757417796597215 0.009336022016715193 {'C': 0.1, 'kernel': 'poly'} 1.0 0.0 {'C': 1, 'kernel': 'poly'} 1.0 0.0 {'C': 10, 'kernel': 'poly'} 0.892436216885519 0.006513038319161858 {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'} 0.8145599258092953 0.01683093150004418 {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'} 0.9336120580277297 0.010930774298062718 {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'} 0.892436216885519 0.006513038319161858 {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'} Detailed Classification Report : precision recall f1-score support 0 1.00 1.00 1.00 842 1 1.00 1.00 1.00 783 accuracy 1.00 1625 macro avg 1.00 1.00 1.00 1625 weighted avg 1.00 1.00 1.00 1625