What are Outliers?¶
Outliers are data points that differ significantly from other observations in a dataset. These deviations may occur due to variability in the data, measurement error, or rare occurrences.
Mathematically, outliers distort the statistical properties (like mean, variance, correlation) and can severely impact model performance, particularly for models sensitive to distance or distribution.
Causes of Outliers¶
Data Entry Errors - Manual input mistakes or system glitches can create unrealistic values. Example: Typing 10000 instead of 1000, or misplacing a decimal (e.g., 1.2 vs. 12).
Measurement Errors - Sensors or instruments may malfunction or lose calibration. Example: A faulty temperature sensor recording 300°C in a normal room.
Data Processing Errors Errors during merging, encoding, or scaling may introduce unexpected values. Example: Duplicated rows or improper imputation.
Sampling Issues - Non-representative or biased sampling may include rare but valid extremes. Example: Surveying only high-income individuals in a consumer study.
Natural Variation - Some outliers are genuine and reflect real-world extremes. Example: Olympic athletes or billionaires.
Environmental or Contextual Events - Temporary external factors can cause anomalies. Example: Sudden spikes in web traffic during a product launch.
Fraud or Adversarial Behavior - Intentional manipulation of data may produce abnormal patterns. Example: Credit card fraud, fake reviews.
Multivariate Anomalies - A value may be normal by itself but unusual in combination with others. Example: A 10-year-old earning $100,000/year.
Concept Drift or Source Changes - Data collected from different systems, times, or populations may not align. Example: Combining datasets from different regions or years.
Understanding the cause helps decide whether to remove, transform, or retain outliers, ensuring better model accuracy and interpretability.
Consequences of Not Treating Outliers¶
- Model Bias: Predictive models may generalize poorly.
- Incorrect Coefficients: Particularly in regression.
- Misleading Insights: In analytics, outliers might mask or exaggerate trends.
- Clustering Failures: K-means will cluster poorly due to centroid shifts.
- Poor Feature Scaling: Standardization and normalization become ineffective.
Implications of Outliers in Machine Learning Models¶
Outliers can skew the data distribution, bias parameter estimation, and reduce model accuracy. Different models react differently:
Few Model-wise Explanation¶
| Model | Affected? | Explanation | | ------------------------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Linear Regression | Yes | Linear regression minimizes squared errors: $\min \sum (y_i - \hat{y}_i)^2$. Outliers have large residuals, leading to disproportionately high squared errors. | | Logistic Regression | Yes | Logistic models use log-odds: $\log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 x$. Outliers can mislead the estimation of coefficients. | | Decision Tree | No | Trees split data by feature thresholds using criteria like Gini or Entropy, unaffected by extreme values. | | Random Forest | No | Ensemble of decision trees, hence robust to outliers. | | Gradient Boosting | No | Also based on trees; outliers have little effect unless overfitting occurs. | | K-Nearest Neighbors (KNN) | Yes | KNN uses distance metrics (e.g., Euclidean): $d = \sqrt{\sum(x_i - x_j)^2}$. Outliers drastically affect neighbor computation. | | Support Vector Machines (SVM) | Yes | SVM tries to maximize margin. Outliers may change margin placement and support vectors. | | | Naive Bayes | Yes | Based on probability density estimation; outliers shift mean and variance in Gaussian NB. | | K-Means Clustering | Yes | Uses centroid-based distance minimization. Outliers can pull centroids away: $\min \sum \|x_i - \mu_k\|^2$. | | Hierarchical Clustering | Depends | Sensitive if using single linkage; less so with complete linkage or Ward's method. | | Non-Negative Matrix Factorization | Yes | Matrix factorization assumes parts-based reconstruction. Outliers disturb convergence due to reconstruction error. |
Outlier Detection Techniques¶
1. Box Plot¶
A box plot is a standardized way of displaying the distribution of data based on a five-number summary: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.
Theoretical Foundation:¶
Key Elements:
- Q1 (25th percentile): Lower quartile
- Q3 (75th percentile): Upper quartile
- IQR: $Q3 - Q1$
- Lower Fence: $Q1 - 1.5 \times IQR$ (Lower Bound)
- Upper Fence: $Q3 + 1.5 \times IQR$ (Upper Bound)
The Interquartile Range (IQR) is the range between the 75th percentile (Q3) and the 25th percentile (Q1) of the data:
$$ \text{IQR} = Q3 - Q1 $$
Using the IQR, we define the "whiskers" of the box plot. Any data point outside these bounds or fences is considered an outlier.
Why It Works: IQR is robust and not affected by extreme values, unlike mean/standard deviation.
Interpretation:¶
- Box plots are particularly useful for univariate detection.
- They allow visual comparison across different groups.
- Outliers are shown as individual points beyond the whiskers.
It helps visualize:
- Central tendency
- Spread
- Skewness
- Potential outliers (left/right extremes)
Example:¶
Imagine a dataset of house prices in a city. Most prices range between ₹30 lakhs to ₹1 crore. A few properties priced at ₹20 crores would appear as outliers in a box plot.
2. Violin Plot¶
A violin plot is an advanced version of the box plot that combines the box plot with a kernel density plot.
Theoretical Foundation:¶
The violin plot shows the probability density of the data at different values, revealing the shape of the distribution.
It includes:
- A central box plot (showing quartiles and median)
- A mirrored KDE (kernel density estimate) that looks like a violin
- It shows the distribution shape, central value, and outliers.
- Peaks indicate concentration; tails indicate sparsity.
- Unlike box plots, violin plots reveal bimodal or skewed patterns.
Useful for comparing distributions and spotting outliers within multimodal data.
Usefulness in Outlier Detection:¶
- Highlights multi-modal distributions (multiple peaks)
- Indicates density and spread
- Helps identify not just the extremities but also skewness, clumping, and low-density regions, where outliers may be present
Example:¶
In biological data such as blood pressure readings, a violin plot might show dense clusters at typical values but reveal outliers as long tails on either side.
3. Z-Score Method¶
The Z-score measures how many standard deviations a data point is from the mean.
Mathematical Formula:¶
$$ Z = \frac{X - \mu}{\sigma} $$
Where:
- $X$ is a data point
- $\mu$ is the mean of the dataset
- $\sigma$ is the standard deviation
Interpretation:¶
- If $|Z| > 3$, typically treated as an outlier (assuming normality).
- The threshold can vary depending on the domain (e.g., 2.5, 3, 4).
Assumptions:¶
- The data should follow a normal (Gaussian) distribution.
- Z-score is sensitive to skewed data, so it may not perform well with non-normal data.
- Z-score is influenced by mean and standard deviation, hence sensitive to other outliers.
Practical Scenario:¶
In credit score analysis, a Z-score of +4 for an individual’s debt-to-income ratio might indicate abnormal behavior compared to the population average.
4. Percentile Method¶
This is a non-parametric method using percentiles to trim extremes. This method is a generalization of the IQR method but uses custom percentiles to determine thresholds.
Theoretical Description:¶
- Rather than relying strictly on Q1 and Q3, one may choose other percentiles such as 5th and 95th.
- Data outside these percentile boundaries are considered outliers.
Formula (generalized):¶
$$ \text{Outlier bounds} = [P_{\text{low}}, P_{\text{high}}] $$
When to Use:¶
- Works well when the distribution is non-normal or skewed.
- Offers flexibility in defining what constitutes an outlier.
Example:¶
In sales data, you may want to remove the bottom 1% and top 1% values, which might correspond to one-off errors or rare, non-representative sales.
5. Isolation Forest¶
An ensemble-based algorithm designed for unsupervised outlier detection. It is specifically designed for anomaly detection.
Theory Behind It:¶
- The algorithm works by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values.
- The process continues recursively until the data point is isolated.
- Outliers are isolated faster, meaning they have shorter path lengths in the tree.
Mathematical Insight: Expected path length $E(h(x))$ is shorter for anomalies. Outlier score:
$$ s(x, n) = 2^{-\frac{E(h(x))}{c(n)}} $$
Where $c(n)$ is average path length of unsuccessful search in a binary tree.
Score:¶
Each data point is assigned an anomaly score. Points with a high anomaly score are considered outliers.
Advantages:¶
- Scalable to large datasets
- Works well with high-dimensional data
- No assumptions about the distribution of data
Example:¶
Used for fraud detection in transactional data, where outliers are unusual spending patterns.
6. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)¶
A clustering algorithm that groups together points that are densely packed. This labels points in low-density regions as outliers.
Theory:¶
Key Concepts:
- ε (epsilon): Radius for neighborhood
- MinPts: Minimum number of points to form a dense region
- Points not belonging to any cluster are labeled noise = outliers.
Best for spatial data or nonlinear clusters.
DBSCAN classifies points as:
- Core points: Have at least
min_samples
points withineps
radius - Border points: Within
eps
of a core point but don’t satisfymin_samples
- Noise points: Not within
eps
of any core point
Noise points are considered outliers.
Use Cases:¶
Ideal for geospatial, market basket, or spatial clustering problems where density defines normality.
Limitation:¶
Sensitive to eps
and min_samples
. Performance can degrade if parameter tuning is not done carefully.
Outlier Treatment Methods¶
1. Removal¶
The simplest treatment is to remove outliers from the dataset.
Justification:¶
- Appropriate when outliers are due to errors, noise, or irrelevance.
- Especially helpful in linear models where extreme values have disproportionate impact.
However, not advised for small datasets or if outliers carry signal (e.g., fraud detection).
Caveat:¶
- Can lead to information loss, especially if outliers represent rare but important phenomena.
- Always assess the context before dropping data.
Example:¶
A medical device records a heart rate of 750 bpm. Clearly, it's a measurement error and can be dropped.
2. Winsorization¶
A technique to cap extreme values at a certain percentile threshold. Winsorization is a transformation technique that limits extreme values in the data to reduce the effect of possibly spurious outliers.
How It Works:¶
- Replace the smallest and largest values with values at a specified percentile.
- For example, cap values below the 5th percentile and above the 95th percentile.
Effect:¶
- Preserves data length (unlike removal).
- Reduces skewness and the influence of extreme values.
Preserves data size and distribution integrity but suppresses outliers.
Example:¶
In survey income data, values above the 99th percentile can be winsorized to avoid income skew dominating mean-based analysis.
Additional Outlier Detection Techniques¶
1. Mahalanobis Distance¶
A multivariate distance metric that considers correlation between variables.
Formula:¶
$$ D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)} $$
Where:
- $x$ is the vector of values
- $\mu$ is the mean vector
- $\Sigma$ is the covariance matrix
Points with $D^2 > \chi^2(p, \alpha)$ are multivariate outliers
Use:¶
Useful in multivariate outlier detection where individual variables may not appear abnormal but combinations are. Best for detecting outliers in multivariate Gaussian data.
2. Robust Mahalanobis Distance¶
This improves on traditional Mahalanobis by using robust estimators for the mean and covariance.
Why Needed:¶
- Classical Mahalanobis is sensitive to outliers (ironically).
- Robust methods like Minimum Covariance Determinant (MCD) ensure more stable results.
Prevents distortion by existing outliers while computing $\mu$ and $\Sigma$.
3. Algorithm-Based Detection¶
1. K-Means Clustering¶
Outliers are identified based on their distance from cluster centroids.
Method:
- After clustering, calculate distance of each point to its cluster centroid.
- Points with exceptionally high distances are labeled outliers.
2. Hierarchical Clustering¶
In dendrograms, outliers may:
- Appear as singleton branches.
- Be merged last into clusters.
These can be visually detected or cut off based on distance thresholds.
Robust Models (Outlier-Insensitive)¶
- Decision Trees: Split by condition, not affected by magnitude.
- Random Forests: Ensemble reduces variance.
- Gradient Boosting: Similar to Random Forest in structure.
- Ensemble Methods: Aggregation lowers sensitivity.
Relationship Between Loss Functions and Outliers¶
What is a Loss Function?¶
A loss function quantifies the difference between the predicted value ($\hat{y}$) and the actual/true value ($y$). It plays a critical role in training supervised machine learning models by guiding the optimization process (e.g., via gradient descent) to minimize prediction errors.
Outliers, being extreme data points, produce larger errors. Since most loss functions are sensitive to the magnitude of these errors, they can distort model training, especially when the loss function gives higher weight to large deviations. The sensitivity varies depending on the mathematical formulation of the loss function.
Common Loss Functions and Their Sensitivity to Outliers¶
Mean Squared Error (MSE)¶
$$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$
- Behavior: Squaring error magnifies the effect of large residuals.
- Sensitivity: Very sensitive to outliers.
- Use case: When outliers are minimal and data is normally distributed.
Mean Absolute Error (MAE)¶
$$ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| $$
- Behavior: Penalizes all deviations linearly.
- Sensitivity: Less sensitive to outliers than MSE.
- Use case: Preferred when outliers exist, and equal penalty is desired.
Huber Loss¶
$$ \text{Huber}(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{for } |y - \hat{y}| \leq \delta \\ \delta \cdot (|y - \hat{y}| - \frac{1}{2}\delta) & \text{otherwise} \end{cases} $$
- Behavior: Quadratic for small errors, linear for large ones.
- Sensitivity: Robust, combines benefits of MSE and MAE.
- Use case: Useful when you want to tolerate small errors but reduce the influence of large outliers.
Quantile Loss (Pinball Loss)¶
$$ L(y, \hat{y}) = \begin{cases} \tau (y - \hat{y}) & \text{if } y \geq \hat{y} \\ (1 - \tau)(\hat{y} - y) & \text{if } y < \hat{y} \end{cases} $$
- Behavior: Focuses on the distribution’s percentiles.
- Sensitivity: Robust to outliers depending on the quantile.
- Use case: When modeling asymmetric or extreme behavior (e.g., forecasting high-risk events).
Log-Cosh Loss¶
$$ L(y, \hat{y}) = \sum \log(\cosh(\hat{y} - y)) $$
- Behavior: Similar to MSE for small errors but less affected by large errors.
- Sensitivity: Lower than MSE; smooth and differentiable everywhere.
- Use case: A smooth alternative to Huber for robust regression.
Why This Matters¶
When outliers are not addressed, highly sensitive loss functions (like MSE) will lead to models that overfit the outliers and perform poorly on general data. This results in:
- Skewed model coefficients (especially in linear models and distance based models),
- Reduced generalization ability,
- Poor performance on unseen or clean data.
Let's do some Coding¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv("clv_data.csv")
Outlier Detection - Box Plot
sns.boxplot(df['purchases'])
<Axes: ylabel='purchases'>
def extract_outliers_from_boxplot(array):
## Get IQR
iqr_q1 = np.quantile(array, 0.25)
iqr_q3 = np.quantile(array, 0.75)
med = np.median(array)
# finding the iqr region
iqr = iqr_q3-iqr_q1
# finding upper and lower whiskers
upper_bound = iqr_q3+(1.5*iqr)
lower_bound = iqr_q1-(1.5*iqr)
outliers = array[(array <= lower_bound) | (array >= upper_bound)]
return outliers
print('Outliers within the box plot are :')
extract_outliers_from_boxplot(df['purchases'])
Outliers within the box plot are :
47 5 104 5 142 5 301 5 323 5 485 6 486 5 1026 5 1104 6 1112 5 1120 6 1125 5 1374 5 1504 5 1623 5 1669 6 1670 5 1809 6 1818 5 1836 5 1870 5 2180 5 2463 6 2548 5 2572 5 2605 5 2717 5 2901 5 3032 6 3080 5 3105 5 3162 5 3170 5 3291 5 3298 5 3321 5 3361 5 3380 5 3410 5 3566 5 3603 6 3631 6 3835 5 3848 5 4003 6 4141 5 4334 5 4346 5 4545 5 4597 5 4611 5 4620 5 4662 5 4691 5 4728 5 4751 5 4761 5 4895 5 4958 5 Name: purchases, dtype: int64
Outlier Detection - Violin Plot
plt.violinplot(df['purchases'])
plt.show()
Outlier Detection - Z-score
purchases = df['purchases']
def percentile_outliers(array,
lower_bound_perc,
upper_bound_perc):
upper_bound = np.percentile(df['purchases'], upper_bound_perc)
lower_bound = np.percentile(df['purchases'], lower_bound_perc)
outliers = array[(array <= lower_bound) | (array >= upper_bound)]
return outliers
def z_score_outliers(array,
z_score_lower,
z_score_upper):
z_scores = scipy.stats.zscore(array)
outliers = (z_scores > 1.96) | (z_scores < -1.96)
return array[outliers]
outliers = percentile_outliers(df['purchases'],
upper_bound_perc = 99,
lower_bound_perc = 1)
z_score_outliers(df['purchases'],
z_score_lower = -1.96,
z_score_upper = 1.96)[:10]
28 4 47 5 51 4 67 4 74 4 96 4 104 5 117 4 142 5 147 4 Name: purchases, dtype: int64
Outlier Detection - Isolation Forest
from sklearn.ensemble import IsolationForest
features = ['age','income','days_on_platform','purchases']
## We'll do a simple drop null for now
df = df.dropna()
## Create a training-test set
X = df[features]
X_train = X[:4000]
X_test = X[1000:]
## Fit Model
clf = IsolationForest(n_estimators=50, max_samples=100)
clf.fit(X_train)
## Get Scores
df['scores'] = clf.decision_function(X_train)
df['anomaly'] = clf.predict(X)
## Get Anomalies
outliers=df.loc[df['anomaly']==-1]
outliers[:10]
Unnamed: 0 | id | age | gender | income | days_on_platform | city | purchases | scores | anomaly | |
---|---|---|---|---|---|---|---|---|---|---|
9 | 9 | 9 | 49.0 | Female | 76842 | 19.0 | Tokyo | 2 | -0.028500 | -1 |
15 | 15 | 15 | 31.0 | Female | 226249 | 20.0 | Miami | 0 | -0.041933 | -1 |
17 | 17 | 17 | 27.0 | Female | 177582 | 2.0 | London | 0 | -0.025880 | -1 |
18 | 18 | 18 | 10.0 | Female | 260 | 32.0 | San Francisco | 0 | -0.055640 | -1 |
23 | 23 | 23 | 10.0 | Female | 108804 | 5.0 | Tokyo | 2 | -0.018705 | -1 |
25 | 25 | 25 | 46.0 | Female | 112992 | 9.0 | London | 3 | -0.054900 | -1 |
40 | 40 | 40 | 31.0 | Male | 138533 | 20.0 | New York City | 3 | -0.032015 | -1 |
44 | 44 | 44 | 36.0 | Male | 1062 | 28.0 | Tokyo | 2 | -0.006741 | -1 |
47 | 47 | 47 | 34.0 | Male | 9866 | 33.0 | London | 5 | -0.116212 | -1 |
50 | 50 | 50 | 36.0 | Male | 255965 | 22.0 | Tokyo | 1 | -0.065819 | -1 |
Outlier Treatment - Removal method
def z_score_removal(df, column, lower_z_score, upper_z_score):
col_df = df[column]
z_scores = scipy.stats.zscore(purchases)
outliers = (z_scores > upper_z_score) | (z_scores < lower_z_score)
return df[~outliers]
def percentile_removal(df, column, lower_bound_perc, upper_bound_perc):
col_df = df[column]
upper_bound = np.percentile(col_df, upper_bound_perc)
lower_bound = np.percentile(col_df, lower_bound_perc)
z_scores = scipy.stats.zscore(purchases)
outliers = (z_scores > upper_bound) | (z_scores < lower_bound)
return df[~outliers]
filtered_df = z_score_removal(df, 'purchases', -1.96, 1.96)
percentile_removal(df, 'purchases', lower_bound_perc = 1, upper_bound_perc = 99)[:10]
Unnamed: 0 | id | age | gender | income | days_on_platform | city | purchases | scores | anomaly | |
---|---|---|---|---|---|---|---|---|---|---|
3 | 3 | 3 | 29.0 | Male | 43791 | 28.0 | London | 2 | 0.034956 | 1 |
4 | 4 | 4 | 18.0 | Female | 132181 | 26.0 | London | 2 | 0.002514 | 1 |
9 | 9 | 9 | 49.0 | Female | 76842 | 19.0 | Tokyo | 2 | -0.028500 | -1 |
23 | 23 | 23 | 10.0 | Female | 108804 | 5.0 | Tokyo | 2 | -0.018705 | -1 |
25 | 25 | 25 | 46.0 | Female | 112992 | 9.0 | London | 3 | -0.054900 | -1 |
29 | 29 | 29 | 43.0 | Male | 70598 | 15.0 | London | 2 | 0.021325 | 1 |
38 | 38 | 38 | 27.0 | Female | 19003 | 25.0 | San Francisco | 2 | 0.009190 | 1 |
40 | 40 | 40 | 31.0 | Male | 138533 | 20.0 | New York City | 3 | -0.032015 | -1 |
44 | 44 | 44 | 36.0 | Male | 1062 | 28.0 | Tokyo | 2 | -0.006741 | -1 |
47 | 47 | 47 | 34.0 | Male | 9866 | 33.0 | London | 5 | -0.116212 | -1 |
Outlier Treatment - Winsorize
def winsorize(df, column, upper, lower):
col_df = df[column]
perc_upper = np.percentile(df[column],upper)
perc_lower = np.percentile(df[column],lower)
df[column] = np.where(df[column] >= perc_upper,
perc_upper,
df[column])
df[column] = np.where(df[column] <= perc_lower,
perc_lower,
df[column])
return df
winsorize(df, 'purchases', 97.5, 0.025)[:10]
Unnamed: 0 | id | age | gender | income | days_on_platform | city | purchases | scores | anomaly | |
---|---|---|---|---|---|---|---|---|---|---|
2 | 2 | 2 | 24.0 | Male | 104723 | 34.0 | London | 1.0 | 0.036205 | 1 |
3 | 3 | 3 | 29.0 | Male | 43791 | 28.0 | London | 2.0 | 0.034956 | 1 |
4 | 4 | 4 | 18.0 | Female | 132181 | 26.0 | London | 2.0 | 0.002514 | 1 |
5 | 5 | 5 | 23.0 | Male | 12315 | 14.0 | New York City | 0.0 | 0.030462 | 1 |
8 | 8 | 8 | 46.0 | Male | 129157 | 23.0 | New York City | 0.0 | 0.030737 | 1 |
9 | 9 | 9 | 49.0 | Female | 76842 | 19.0 | Tokyo | 2.0 | -0.028500 | -1 |
12 | 12 | 12 | 12.0 | Male | 130521 | 12.0 | London | 1.0 | 0.022177 | 1 |
15 | 15 | 15 | 31.0 | Female | 226249 | 20.0 | Miami | 0.0 | -0.041933 | -1 |
16 | 16 | 16 | 19.0 | Female | 51434 | 18.0 | New York City | 0.0 | 0.049889 | 1 |
17 | 17 | 17 | 27.0 | Female | 177582 | 2.0 | London | 0.0 | -0.025880 | -1 |