From Noise to Knowledge: Explore the Magic of DBSCAN which is beyond Traditional Clustering.

16 min readJun 29, 2023

DBSCAN in Density Based Algorithms : Density Based Spatial Clustering Of Applications with Noise.

Earlier Topics:

Since, We have seen centroid based algorithm for clustering like K-Means.Centroid based : K-Means, K-Means ++ , K-Medoids.

We have also seen a hierarchy based clustering scheme which is nothing but “Agglomerative Hierarchical Clustering” in Hierarchical Clustering.

The third more popular type of clustering techniques is called “Density Based Clustering”.One among the many density based algorithms is “DBSCAN”.

Theme:

The Core Philosophy of Density Based clustering is that in a given Dataset D = {Xi}, we want to partition the points into “Dense-regions” which are called as “Clusters” and “Sparse-regions” which may contain “Noise Points”.

=> Here, the dense regions which are “clusters” are separated by “sparse regions”.

Dense Regions (Clusters) separated by Sparse Regions(Noise).

Now, there will be a question that will strike into our mind.

What does ‘dense’ and ‘sparse’ here mean ? & How do we mathematically quantify these terms “dense” and “sparse”?.

Here, Sparse intuitively mean that there are very few points in an area and Dense basically means that there are a lot of points in a given area surrounding in a smaller area.

=> Let’s see how to mathematically quantify them …!

*** NOTE: “Density Based Clustering” is a Popular Clustering Scheme and one of the most important and popular Density Based Clustering algorithms is “DBSCAN”.

DBSCAN (Density Based Spatial Clustering Of Applications with Noise), the name itself says that it is based on density and works with noise.

If we jump into the concept there are a lot of questions that will start to attack us and one among them is:

How to measure Density? (The Big Question we need to deal with…!)

There are a lot of key terms, concepts and ideas that we will learn across the algorithm. And all these concepts are crucial in learning “DBSCAN” algorithm itself.

In the Quest of defining “Density” we will learn all these concepts which are very simple.

The first task we are dealing is how do we measure density? Let’s see how to do that …,..

Measuring Density : Min Pts & Epsilon (“ε”) :-

There are two hyper-parameters that are present in DBSCAN to measure density. They are:- i) Min Pts, ii) Epsilon (ε).

i) Min Pts:-

First let us know first what is density at a point “P”. How do we define it’s density. It can defined as the following.

*=> Density at a point P = The no. of points within a radius of a circle in 2D, a sphere in 3D and a hypersphere in 4D and so,..on. Here the radius will be called as “Epsilon (ε)”.
So, at a given point “P”. Let’s assume there is a circle that can be formed within a radius. And the radius is let’s say 2.0 now, within the radius let’s assume that there are 6 points. Then the density at this point “P” is said to be 7. Density = 7.

P has 7 points within it’s epsilon (ε) radius considering itself as a point.

Because starting from the point “P” in a region (or) a sphere within the radius of eps( “ε” = epsilon) there are total of 7 points including the point “P” itself.
So, the density at a point is the total no.of points within a hypersphere of radius “eps” (ε) around the point “P”. Here, ‘eps’ (ε), is the hyper-parameter we have to give.

The minimum no.of points that are required to form the dense region here is called “Min Pts”.

ii) Epsilon (“ε”):

Epsilon (“ε”) , is the maximum distance that is considered to be neighborhood in this context.

The second most important concept is “Dense region”.

Dense region:-

A Circle (or) a Sphere with a radius of “Eps” (“ε”) that contains at least “Min Pts” no. of points is called “dense region”.
Example: Let’s assume the eps = 1.2 & Min pts = 12.

This is how it looks :

Left picture is Dense Region & Right picture is Sparse Region.

This is how we define Dense & Sparse regions. And we define dense & sparse regions using these two hyper-parameters “Min Pts” & “Epsilon(ε)”.

And these parameters basically says minimum points that are required within a radius to form a dense region.

Core Points, Border Points & Noise Points:-

So, in a given dataset, D = {Xi} with Min pts, eps as hyper-parameters we can categorize each and every point in the dataset as a “Core Point”, “Border Point” (or) a “Noise Point”. Each point can be defined as the following:

i) Core Points:

A point “P” is said to be a “Core Point”, if point “P” has greater than or equal to “Min Pts”” in an “Epsilon(ε)” radius around it.
P >_ length(Min Pts) [within an epsilon(ε) radius].

And a core point is always belongs to dense-region. Recall the “dense region” definition it always has at least “Min Pts” in it’s “Epsilon(ε)” radius. And because the core point has more (or) equal to Min Pts it belongs to the “dense region”.
=> Every Core point belongs to a dense region. And that’s the core idea of what a “Core Point” is. So, the core here in the “core-point” means it is somewhere in the dense region and this is what it means.

ii) Border Points:-

A point “Q” is said to be border point if “Q” is not a core point. And it has the following conditions:
a) “Q” has less than Min Pts in it’s “eps”(ε) radius.
b) But. “Q” belongs to the neighborhood of another point “P” which is a core-point.

So, here what does it mean by saying neighborhood?
The neighborhood here refers that the distance between P & Q is less than (or) equal to Epsilon(“ε”).
*** => dist(P,Q) <_ EPS.

So, the point “P” itself is not a core-point but it lies in the neighborhood, (this is should be remembered) and such point is called a “Border Point”.

iii) Noise Points:-

Any point which is neither a core-point nor a border-point is a “Noise Point”.

Density Edge & Density Connected Points:-

The above mentioned two concepts are crucial ideas that we need to have an idea before we dive deep in learning “DBSCAN”. They are:-

i)Density Edge:-

So, what it means by density edge?
From the graph theory, we know that “edge” is nothing but a connection and this term edge comes from the graph theory.

Remember whenever we represent things in a graph, we call each of them as a “vertex”. And if we connect two vertices then the connection is called the “edge”.

Now, the density edge is defined as follows, if “P and Q” are the two “Core-Points” and if the density between “P and Q” is less than equal to “Epsilon(ε)”. Then if we connect them, that connection which is an edge is called as “Density Edge”.

** Note: There are two conditions that are to be satisfied. They are: i) The two points are to be core-points and ii) the distance between them should be less than or equal to “Epsilon(ε)”. That’s what a “Density Edge” is ..!

ii) Density Connected Points:-

Two points are said to be density connected points if they both are core-points and they are to be connected by core-points in between (or) the path.

A connection formed by “edges” is called a “path” in graph theory and here ..!

If there exists a path between “P & Q” and each of the edges are connected to density edges and formed a path between “P & Q”.

P,Q : Core points & dist (P,Q) <_ Eps

Simply, Two points are said to be “density connected points” if there exists a path that is formed by “density edges” connecting that two points. Then they are called “density connected points”.

Density Connected Points P, Q, R, S & T.

To put it in more simple words, if there are core-points and if you can hop from one core-point to the other core-point which is within an eps distance and reach point “Q” from point “P”. Then P & Q are said to be “density connected points”.

Now, let’s see what we have covered so far before we jump into the DBSCAN algorithm, the terminology includes “Min Pts, Core Point, Noise Point, Border Point, Density Edges and Density Connected Points.”

So, far we have covered the above concepts and by using them let us dive into how the “DBSCAN” algorithm works itself. Here, the two hyper-parameters use remaining key concepts to somehow build the algorithm work which is very very noise resistant and robust clustering algorithm.

DBSCAN Algorithm:

Given all of the terms that we have learnt so far let’s learn about the core of the DBSCAN algorithm and leverage all the terms that we have covered so far.

Step 1:

For each point of the dataset, D = {Xi}, the first step is we need to label them as a “core-point, border-point (or) a noise-point”. For all “Xi” belongs to “D={Xi}” => Label whether it is a “Core Point or Border Point or Noise Point.”
Of course when we run DBSCAN algorithm we get two hyper-parameters and they are “Min pts and Epsilon(ε)”.
Given these two hyper-parameters we will learn how to determine the right hyper-parameter later.
But, given these hyper-parameters, for every given data point we can define what the point is and label them.
Because, what we do is very simple, given a datapoint Xi, we will see how many points are available within the “epsilon(ε)” radius.

And decide them according to their definition, and everything we do here is possible with something called as “Range Query”.

Range Query:

Range Query basically does this, to a given point it returns all the points which are within an epsilon (ε) radius.

Imagine all of this data, we need to be able to say, you can give a range query (Xi, D). It should return you all the points, it should return the set of points, “Si”, which are within an epsilon (ε) radius.

*** Si = range query (Xi, D, Eps).

So, range queries are very very important because if you give a point “Xi”, dataset “D” and epsilon (ε) “Eps”, this range query should be able to return us all the points which are within the epsilon (ε) radius. And this range queries are typically implemented by using data structures like “K-d Tree” (which is a variant of K-NN), (or) “R* Tree” to enable this range query very very efficiently.

We can use brute force but for the matter of efficiency we use Kd-Tree like optimizations.
Basically, DBSCAN was created by the researchers in Data Mining & Data Base community. In databases, there are specialized tools that people have built to make range queries very very efficiently on data points like this. Internally they might be using data structures like “Kd-Tree” (or) “R* Tree”(or) some other optimized techniques.

Step 2:

Next, in this step we need to remove all the noise points. SO, for every point we have said whether it is a core-point, border-point (or) noise-point. Then we need to remove all the noise points from the data.
Because, noise-points belongs to sparse regions which directly means that does not belong to any clusters and hence we will have a need to remove them.

*** Core Step is Step 3:

For each and every core-point “P” that is not yet assigned to a cluster we need to perform the following iteration loop:

a) Create a new Cluster with the “P” .
b) Add all the points that are left out, that are density connected points to “P” into this new cluster.

Now if there are some more core-points left we need to keep perform the iteration.
Now, in the second iteration we will have this “U” as the new cluster and the loop continues.
The loop terminates when each core-point is assigned to at least one cluster.

Now, that we have removed the noise-points and resolved all the core-points. What happens with the border-points?

Step 4:

Now, that in this step, we have to take each “border-point” and assign them to the nearest “core-point cluster”.

T is assigned to the Core Point P’s Cluster.

So, that’s how we will handle the border-points.

This is how we handled all the points in a step-by-step process:

i) First we labelled core-points, border-points and noise-points.
ii) Then we have removed all the noise-points.
iii) And for each core-point we are just checking out the density connected points and to check for density connected points we again have to do range querying , because it tells about density connected points.
iv) To identify density connected points to “P” we need to perform range querying.
v) So the core-operation (or) computation in all of these is that in the first step and even in step 3 b) we need “range querying” and every other computation is mostly trivial.
vi) Range query basically takes “O(logn)” to perform the operation, n is the size of the dataset, and can be optimized and implemented.

Sample Code and Implementation:-

This is a sample code for the implementation of DBSCAN algorithm with famous Mall Segmentation Cluster Data retrieved from Kaggle.

Kaggle Link for the data: https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python

#Runs the Algorithm and Prints the cluster labels.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# Load the dataset
data = pd.read_csv('Mall_Customers.csv')

# Select features for clustering
features = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']
X = data[features]

# Preprocess the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform DBSCAN clustering
dbscan = DBSCAN(eps=0.4, min_samples=5, metric='euclidean')
labels = dbscan.fit_predict(X_scaled)

# Print cluster labels
print("Cluster labels:")
print(labels)

2D & 3D Visualizations of the DBSCAN Clusters.

# Visualize the clusters in 2D
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels)
plt.xlabel('Age')
plt.ylabel('Annual Income (k$)')
plt.title('Mall Customer Segmentation (2D)')
plt.show()

# Visualize the clusters in 3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_scaled[:, 0], X_scaled[:, 1], X_scaled[:, 2], c=labels)
ax.set_xlabel('Age')
ax.set_ylabel('Annual Income (k$)')
ax.set_zlabel('Spending Score (1-100)')
ax.set_title('Mall Customer Segmentation (3D)')
plt.show()

Mall Cluster Segmentation with 3D Visualization.

And after removing noise points we can see the clusters clearly and well separated.

# Identify and remove noise points
core_samples_mask = np.zeros_like(dbscan.labels_, dtype=bool)
core_samples_mask[dbscan.core_sample_indices_] = True
noise_mask = labels == -1
cleaned_labels = labels[~noise_mask]

# Visualize the clusters after removing noise points
plt.scatter(X_scaled[~noise_mask, 0], X_scaled[~noise_mask, 1], c=cleaned_labels)
plt.xlabel('Age')
plt.ylabel('Annual Income (k$)')
plt.title('Mall Customer Segmentation without Noise Points')
plt.show()

Ideal Cluster: Mall Clustering Segmentation without Noise Points.

This is just a sample code implementation without any EDA & feature importance and also data engineering.

Hyper-parameters: Min Pts & Eps

So, the two hyper-parameters we have in the DBSCAN are Min Pts and Eps.

Now, let us see how to determine them.

1. Min Pts:

Typically Min Pts helps us do one of the most important things, that is to remove outliers, in this case noise-points.

i)So, the rule of thumb (or) typically always we should have our Min pts >_ d+1. Here, “d” is if each of your “Xi” belongs to a “d” dimensional space.

Typically, a good rule of thumb is to have our “Min Pts ~~ 2 * d”. Roughly equals to the two times of the dimensionality of the data.

ii) More often we tend to choose a larger value of Min Pts if your dataset is more noisy. Because at a given point “P” if it’s neighborhood within epsilon (ε) distance from this point “P”, if we have less than (or) equal to Min Pts and if “P” is not a border-point we call it a noise-point. And this is a good-logic because all the noisy-points can be removed. So, choosing a larger Min Pts helps us remove more noisy points.

*** Larger Min Pts if we know=>Dataset is more noisy. ***

iii) Consulting a Domain Expert is who we need to determine the hyper-parameter Min Pts in typical cases.

Let’s take a look how it will work, so if the domain expert says that for any given point “P” if we have 12 pts within the epsilon (ε) radius and if he says 10 pts are minimum required to not be a noise-point. This is a problem specific and may have 5,10,15 (or) 100 as Min Pts and it depends on the domain.
Only an expert who knows it can say if given a point “P” within a radius epsilon (ε) “Eps” (some radius) and 10 pts within is not a noise point. In such a case, We need to choose the parameter to be “10”. If the expert says we need to have at least 80 pts in the neighborhood to be very sure then we need to choose the Min Pts to be 80.

So, the rule of thumb is typically choosing Min Pts roughly equals to the twice of the dimensionality of data is the first task by taking a rough approximation.
And larger Min Pts if we know that the dataset contains too much noise.
And finally we need to consult a domain expert and ask the expert how many pts we can expect in a neighborhood for the given point “P” to be declared as a “noisy-point (or) not a noisy point”.

2. Epsilon (ε):

Given the Min Pts hyper-parameter, we have this second task is to determine the eps(ε).
Let’s assume, we have Min Pts = 7. Given this how we can determine the right epsilon (ε) “eps”.

Step 1 : For each point “xi”, we need to compute something called “di”.

di => It is the distance from the point xi to it’s nth nearest point (or) neighbor.
xi -> di : It is the distance from the point xi to it’s 7th nearest point and this 7 comes from the given Min Pt value. So, if the Min Pts value is 10 then we should compute the distance from xi to it’s 10 th nearest neighbor or point.

Step 2 : Now, we will have something as shown here.

We need to sort the di’s in the increasing order. And also we need to plot these values and we need to recall the method “Elbow/Knee Method” that we have discussed in the “K-Means Clustering”. Typically, we use this for “hyper-parameter tuning”.

Elbow / Knee Method for determining right Epsilon (“ε”) Value.

First thing here to be noticed is that if our “di” is high it means it can be a noisy point. Let us assume if our Min Pts = 4 and our “di” is high our chance of “P” being noise point is also high.
In simple words, if we have a high “di” then the chance (or) probability that “xi” is noisy is also high.
After the inflection point there is a sudden increase in the “di’s” and the probability of this being noise is also will be high. And at the point where the inflection point is seen whatever the value of “di” is seen it should be taken as the “epsilon(ε)” by looking at these distances.

Advantages & Limitations Of DBSCAN:

Advantages:

It is highly resistant to noise in the data and also a robust which can handle the outliers which is a great advantage for the algorithm.
It can take care of the clusters with different shapes and sizes.
In DBSCAN, there is no need for declaring the no. of clusters prior.

Limitations:

This DBSCAN, algorithm is highly sensitive to hyper-parameters. With a slight change in hyper-parameter “eps” can leads to completely different clusters.

Eps(ε) = 0.4 & Min Pts = 6 & Eps(ε) = 0.5 & Min Pts = 6.

High dimensional data can lead to curse of dimensionality. If in case of euclidean distance is used and there is high dimensional data it can lead to disasters some times. So, the algorithm DBSCAN is not that well suited for high dimensional data like text-data.

Time & Space Complexity:

The average time and space complexity for DBSCAN is as follows:

Time Complexity :

The average time complexity for DBSCAN is roughly equals to “O(n log n)”.
Because, it typically takes “log n” time for “range query” (or) “region query” with some implementations that are optimized techniques such as “Kd-Tree (or) R* Tree” and things like that and “n” is the total data points.

Time Complexity = ~~O(n log n).
Which is fairly descent and far better compared to hierarchical Clustering which is O(n²).

Space Complexity :

It is still “O(n)”;
Assuming the dimensionality “d” ( n << d ); which is far less compared to the total data points and ignoring “O(n d)” to “O(n)”. And also we don’t support using the DBSCAN algorithm for high dimensional data.

So, the Time & Space complexity is very healthy and better than “Hierarchical Clustering”. But the only problem is that it is extremely sensitive to the hyper-parameters.

Interpretability:-

DBSCAN can provide the interpretable cluster results by identifying the density and connectivity and also it can identify “noise points”.

Typically, what happens in the DBSCAN algorithm in real world as that it is designed for the Database and Data Mining community it can work well when the data is stored in traditional databases such as oracle, MySQL etc… where range or region queries are highly optimized and as the data is stored in a database.

BECOME a WRITER at MLearning.ai // text-to-video // Detect AI img

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com