Ever Wondered How Similar patterns are identified? A Complete Guide about K-Means, K-Means ++, K-Medoids & PAM’s in K-Means Clustering.

Chandra Prakash Bathula
17 min readJun 19, 2023

Wonder How Similar patterns are identified? A Complete Guide about K-Means, K-Means ++, K-Medoids & PAM’s in K-Means Clustering.

Have you ever wondered how internet companies identify similar customer behaviors and offer them personalized discounts based on their purchasing power and preferences? Take Amazon or any other retail product seller application, or any tech giant with a large user base.

Let’s imagine we have a task at one of these companies to identify groups of people who exhibit similar behavior and uncover patterns. How would we tackle this challenge?

To address such tasks and uncover behavioral patterns, we turn to a powerful technique in Machine Learning called Clustering. Originally used in Data Mining, clustering can also serve as a crucial preprocessing step in various Machine Learning algorithms.

Now, let’s dive into the world of Clustering:

Clustering is the process of grouping similar data points based on their inherent similarities, such as customer preferences, purchase history, or browsing habits. By applying clustering algorithms, distinct clusters or groups can be automatically identified within a dataset. Each cluster represents individuals who share common characteristics and behaviors, empowering companies to gain insights into their customer base and tailor strategies accordingly.

Clustering uncovers hidden patterns, allowing businesses to understand preferences, target specific demographics, and offer personalized recommendations. It enables companies to streamline marketing efforts, enhance customer satisfaction, and drive revenue growth by segmenting their customer base and catering to unique needs and preferences.

Through clustering, companies can harness the power of data to better understand customers and create personalized experiences. Behind the scenes, clustering algorithms work diligently, analyzing behavior and identifying patterns to provide an enhanced shopping experience. By leveraging clustering, businesses unlock the potential of data, enabling them to deliver tailored strategies that keep customers engaged and coming back for more.

Identifying Similar Pattern Among People

Among the clustering algorithms, K-Means, Hierarchical clustering, and DBSCAN are widely used and similar in nature.

K-Means Clustering:

Now, let’s delve into the K-Means algorithm, which is known for its simplicity and efficiency. In K-Means clustering, the parameter K represents the number of clusters, and it is a hyper parameter that needs to be determined. The optimal value for K can be found using ideas like Cross Validation (CV).

K = 3 ; 3 Clusters. K = No of clusters.

Geometrically,

K-Means clustering involves assigning centroids to groups of points. Assuming we have a dataset D and K clusters, we will have Cn centroids for each cluster

Centroids: (C1, C2, C3, …, Cn).

Based on these centroids, we assign points to their corresponding sets

Sets: (S1, S2, S3, …, Sn).

For instance, S1 contains points based on the centroid C1, S2 contains points based on C2, and so on.

The dataset D is the union of all these sets (S1 U S2 U … U Sn), and the intersection between any two sets (e.g., S1 intersection S2) is an empty set.

D = (S1 U S2 U … U Sn) &

S1 intersection S2 is an empty set.

Centroid Equation ~ Mean Point.
Clusters; Cluster Centroids Ck’s & corresponding sets Sk’s.

Mathematically,

This implies that every point belongs to a specific set, and no two sets can have the same point. Consequently, when we refer to K-clusters, we are referring to K-centroids, each associated with a set of points. The centroid for any point is the central or mean point of its corresponding set. It is calculated as the mean sum of all the points in that set, represented by the equation:

Ci = (1/n) * Σ(Xj E Si) Xi.

In K-Means clustering, the objective is to find K-centroids and their corresponding sets of points. The clustering process is based on assigning each point to the cluster whose centroid is closest to it. Ultimately, the goal is to identify K-centroids and their corresponding sets, enabling us to analyze the data based on these clusters.

Cluster Centroids and their intra-cluster distances.

So the core idea is to find k-centroids and then corresponding K-set of points based on these K-centroids.

But, Now comes the Big Problem! How can we identify the K-centroids?

Now, let’s tackle the big challenge: How can we identify the K-centroids in K-means clustering? The centroid plays a crucial role in clustering, as it determines the corresponding sets of points.

The mathematical formulation of K-means involves finding K-centroids:

K-Centroids = C1, C2, C3, …, Ck

and their corresponding sets of points

K-Sets = S1, S2, S3, …, Sk.

It is important to note that for each Xi belonging to Sj and for all i, j, the intersection of sets is null.

The objective function of K-means is to minimize the sum of squared distances between each point Xp and its centroid ck. Mathematically, it can be represented as:

argmin(k=1,…,K) ∥xp — ck∥²

This objective function aims to find the K-centroids that minimize the distance between each point and its assigned centroid within the cluster. In other words, it seeks to minimize the intra-cluster distances.

The goal is to minimize all intra-cluster distances, reflecting the intuitive idea that points within each cluster should be close to their respective centroid. The focus in this phase of k-means is solely on minimizing the intra-cluster distances. Once we obtain the K-centroids Ck, we can determine the corresponding sets Sk based on the notion of proximity.

Since the above objective function poses a complex computational problem (known as an NP-hard problem), we rely on approximation algorithms to solve it. One such optimization algorithm widely used is Lloyd’s algorithm, which involves a combination of mathematical techniques and clever strategies.

By employing these mathematical formulations and optimization algorithms, K-means clustering enables us to identify the optimal K-centroids, group similar data points, and uncover valuable insights from the data.

Optimal Clustering:

Optimal Clustering is that in which we have smaller intra-cluster distances and high inter-cluster distances.

Picture Reference : https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf

K-Means: LLoyd’s Algorithm:

Lloyd’s algorithm is remarkably simple and intuitive, providing a solution that is as close as possible to the problem at hand. It serves as an excellent approximation method while remaining straightforward to implement.

Let’s delve into Lloyd’s algorithm and explore how it tackles the challenge of finding centroids for K-means clustering.

Step 1: Initialization:
In this step, we randomly select “K” points from the dataset “D” as the initial K-centroids for clustering. This random initialization sets the foundation for subsequent steps.

Step 2: Assignment:
For each point ‘Xi’ in the dataset ‘D,’ we identify the nearest centroid “Cj” and assign ‘Xi’ to the corresponding set ‘Sj.’ This assignment is determined by evaluating the distances between ‘Xi’ and all the centroids ‘Cj.’

dist(xi,cj) for all j = 1,2,3,……,k.
Add ‘Xi’ to the set ‘Sj’ associated with the centroid ‘Cj.’
Upon completion of this stage, every point is assigned to its respective set ‘Sj.’

Step 3: Re-compute centroid stage/ Update stage:
In this step, we recalculate the centroids using the following formula:
Cj = (1/|Sj|) * Σ(xi) for Xi belongs to Sj.
This formula computes the average mean point within each set ‘Sj.’

As a result, new centroids are obtained after each iteration.

Step 4: Termination:
To achieve convergence, we repeat steps 2 and 3 until little or no change is observed in the centroids. This iterative process is crucial for reaching a stable solution.

Visual representation of Steps in K-Means.

Convergence occurs when the differences between the old centroids (C1, C2, C3, …, Ck) and the new centroids (C1', C2', C3', …, Ck’) are sufficiently small. At this point, we can consider the new centroids as the final centroids. Ultimately, we obtain K centroids and K corresponding sets at the conclusion of this stage.

Old centroids (C1, C2, C3, …, Ck)

New centroids (C1', C2', C3', …, Ck’)

Differences = [(C1' — C1),(C2' — C2),(C3' — C3),….,(Ck’ — Ck)]

Lloyd’s algorithm follows a straightforward four-step process: Initialization, Assignment, Update (or Re-compute), and Repeat and Terminate. By utilizing this approximation algorithm, we overcome the computational complexity associated with the N-p hard problem of K-means clustering.

Assigning point to nearest centroid set.

Is that all? Will Lloyd’s algorithm solve all the issues?

The answer is no! While Lloyd’s algorithm offers a solution for the K-Means clustering problem, it does not address all the challenges associated with initialization. One significant concern is the problem of “Initialization Sensitivity.

  • Initialization sensitivity refers to the fact that the final clusters and centroids obtained from K-Means are dependent on how we randomly select the initial K points during initialization. This means that the resulting clusters can vary significantly based on the chosen initialization, making the random initialization approach problematic.
  • To mitigate the challenges posed by random initialization, we need to adopt strategies to improve the selection of initial centroids. One approach is to perform multiple runs of K-Means with different initializations and choose the clustering with the smallest intra-cluster distances and the largest inter-cluster distances, based on visual inspection. This repeated execution helps reduce the impact of a single random initialization on the final clustering result.
Appropriate & Inappropriate Initializations. Picture Reference : https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf
  • Another method that addresses initialization issues is K-Means++. Unlike random initialization, K-Means++ selects initial centroids by considering the distance between points. It chooses the initial centroids to be far apart from each other, leading to more robust and stable clusters.

K-Means ++ :

K-Means++ is an advanced algorithm that improves upon the random initialization step in the traditional K-Means clustering method, offering better cluster initialization and mitigating the impact of outliers. While the overall framework of K-Means remains the same, the key difference lies in the initialization process.

In K-Means++, instead of randomly selecting the initial centroids, a smarter initialization scheme is used. The algorithm starts by randomly choosing one centroid, C1, from the dataset D. For each data point Xi in D, a distribution of distances from the centroid to every other point is calculated. The next centroid, C2, is then selected from the remaining data points in D, excluding C1, with a probability proportional to the squared distance, di, from the point to the previously chosen centroid, C1.

Mathematically, the distance, d, between a data point Xi and centroid C1 is calculated using the squared Euclidean distance formula:

d = ||(Xi — C1)||²

  • > The probability of selecting a point as the next centroid is determined by how far it is from C1. Points that are farther away have a higher chance of being chosen. This probabilistic approach ensures that centroids are picked from points that are as far as possible from the already selected centroids, reducing the impact of outliers.
Assigning the Farthest point as the centroid from the prior centroid.

-> The process continues iteratively, with subsequent centroids chosen in a similar manner until the desired number of clusters is obtained. By initializing centroids strategically, K-Means++ addresses the limitations of the classic K-Means and Lloyd’s algorithm. It provides more robust and accurate clustering results, particularly in scenarios with potential outliers.

  • > K-Means++ is a highly regarded and modern algorithm, offering significant improvements over its predecessors. Its intelligent initialization scheme enhances the clustering process, leading to more reliable and meaningful cluster assignments. By leveraging K-Means++, data analysts and researchers can extract valuable insights and patterns from their datasets, enabling better decision-making and optimization in various fields such as customer segmentation, image recognition, and anomaly detection.

The question may arise as to why we adopt a probabilistic approach instead of a deterministic one in the initialization step of K-Means++.

  • The probabilistic strategy in K-Means++ addresses the issue of outliers by avoiding their selection as centroids. Deterministic centroid selection runs the risk of including far outliers, which can significantly disrupt the clusters. With probabilistic selection, the likelihood of a point being chosen as a centroid is proportional to its distance from the existing centroids. This reduces the chances of selecting points that are farther away, effectively mitigating the influence of outliers.
  • While the probabilistic approach helps mitigate the impact of outliers, outliers can still present challenges in clustering. However, K-Means++ surpasses traditional algorithms like Lloyd’s and classical K-Means by providing an exceptional and more recent solution. Its sophisticated initialization scheme greatly improves clustering performance and overcomes the limitations of older algorithms.
  • K-Means++ excels in delivering robust and accurate clustering results by intelligently selecting initial centroids. Its applications span various domains, including customer segmentation, image recognition, and anomaly detection. By effectively handling outliers and enhancing cluster quality, K-Means++ becomes the preferred choice for many data analysis tasks, enabling improved decision-making and actionable insights.

Limitation Of K-means:

-> K-means has these following problems or limitations.

-> K-means clustering will be affected and we can attain the optimal results when there is a difference in

  • Size of clusters: K-means focus primarily on intra-cluster distances and can be impacted by the cluster sizes when they are closer.
Picture Reference: https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf
  • Differences in the density of the clusters: Incase of density K-means try to stretch them out. K-means try to allocate same sizes and same density during clustering.
Picture Reference : https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf
  • And non-convex or non- globular shapes:Two sets are said to be convex when the shortest distance between two points is possible within the set and not outside.
Picture Reference: https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf

And the most common problem is that it can be affected by the outliers in the K-means.

Overcoming K-Means limitations:

One way to deal with the above limitations are that we need to increase the number of clusters and recombine the similar clusters. Even this may not attain the optimal solution it will not mix points from two sets.

K-Means is not a perfect solution but is a good approximation.

Clustering is dependent on intra-cluster distance and inter cluster distances as we don’t have the ground-truth or yi’s.And it is also hard to evaluate the clustering.

After all these computations and variations can we interpret the results?

Yes, but it is not possible with the classical K-Means (or) K-Means ++ because we get centroids at the end. Originally, we are given “D” dataset ; D = {D1,D2,…..,Dn}.

At the end we will have the centroid points and it is hard to interpret the centroids.

Example: Incase of any reviews and if we use Bow or any other technique and compute we will get some geometric result as centroid and most of the words in the review will have 0 as the values in the set after removing stop words and etc.

Words and their geometric values.

Big Idea:

In this case instead of giving some centroid point if we are given the point ‘Xi’ as the centroid that belongs to Dataset “D” we can read that particular review either +ve or -ve and understand the cluster belongs to certain group. And this much more interpretable now.

And this is the drawback of K-Means & K-means ++.

So, instead of each centroid what if we have a datapoint in ‘D’ is the big idea now.

=> Ci = Xj E D.

Now, it is not based on the centroids & then this is referred as “K-Medoids”.

Because we no more use mean point as the centroid and we will have the actual data points. This is called as K-medoids and it is a very very popular algorithm especially when we have to (or) want to interpret (or) read the centroids ad make sense what they are.

One among the very popular algorithms for K-Medoids is called “Partitioning Around Medoids” (PAMs).

Just Like the Lloyd’s being an algorithm for K-means; Pam is the algorithm for K-Medoids.

It is very very similar to K-means with subtle differences.

  • We have our initialization: We do this initialization again using the same idea like in K-Means++. Using the K probabilistic method we pick K-pts from D as our initial centroid or medoid.
  • Assignment: Using the proximity idea same as in K-means Xi belongs to Sj we assign each point to the closest medoid.
  • Update/Recompute: Here this phase is different as we swap each medoid with a non-medoid point and compute loss. If the loss is reduced when we swap then we will keep that medoid and repeat untill we find less loss medoids.

In the case of two clusters :

X1,X2,X3,X4,X5 : M1 = X1. ;

X6,X7,X8,X9,X10 : M2 = X6.

M1 = X1 & M2 = X6 | Loss = L1

For every iteration swap the medics with the non-medoid points and compute the loss and swap X1 with X2, X3,X4& X5 keeping M2 as is and also swiping them and try all the swappings possible.

And we need to find the minimal loss.

Loss = ||X-Mj||². Distance from the point to medoid.; it is also intra-cluster distances.

  • Once we swap then we need to re-do the assignment. A swap is successful when we have the less-loss value.

=> The key difference is instead of computing average as in K-means we compute by shaping K-medoids.

Now, this algorithm is called as Partitioning around medoids (PAMs).

The only mathematical equation we needed is ||xi-Mj||².

=> As this is the distance squared and is also a kernel matrix. If we are given kernel matrix or similarity values we can simple replace the distance squared equation and substitute the values provided.

=> The massive advantage of K-medoids is that it makes the life-process of clustering is simpler by making interpretable and additionally it also lets us have kernalization (or) whenever someone gives us similarity (or) distance matrix by just replacing

||Xi-Mj||² => distance(Xi,Mj) & Mj belongs to D dataset.

Standard K-means is not easy for Kernalization.

Determining right K:

1.One way For determining right K we need to have domain knowledge.

2.If we can’t determine the right K then we have something to use that is elbow method (or) Knee method.

Aligning with the K-means in the elbow method we can determine the right “K” by looking at the rate of change of curve or steepness of the loss curve at a particular point as we know that as increasing “K” can lead to zero loss as every point itself becomes cluster itself.

Elbow Method:

The Elbow or Knee method is a popular technique used in K-means clustering to determine the optimal number of clusters, denoted as K, for a given dataset. The method involves plotting the within-cluster sum of squares (WCSS) against the number of clusters, and identifying the point where adding more clusters does not significantly reduce the WCSS.

To find the optimal K using the Elbow method, the following steps can be followed:

1. Compute the K-means clustering algorithm for different values of K, ranging from 1 to a predefined maximum number of clusters.
2. For each K value, calculate the WCSS, which is the sum of the squared distances between each data point and its corresponding centroid within the cluster.
3. Plot the K values on the x-axis and the corresponding WCSS values on the y-axis.
4. Examine the plot and look for the “elbow” or “knee” point, where the reduction in WCSS begins to level off significantly.
5. The K value at the elbow or knee point is considered the optimal number of clusters for the given dataset.

The Elbow or Knee method does not have a specific mathematical equation. However, the WCSS calculation used in this method can be expressed as follows:

WCSS = ∑(∑||x — centroid||²)

where:
- > x represents a data point in the dataset.
- > centroid represents the centroid of the cluster to which the data point belongs.

By applying the Elbow or Knee method, data analysts can make an informed decision about the appropriate number of clusters for their specific dataset, balancing the trade-off between model complexity and the level of within-cluster variation.

Elbow Method (or) Knee Method.

Time complexity & Space complexity:

Time complexity:

In K-means as there is no train and test,

=> We will have the order of O(n*k*d*i);

n = total number of points;

k = number of clusters;

d = dimensions; i = iterations.

Typically we will have around 9–10 clusters; K<_ 10 & iterations i <_ 300;

As the K & i are very smaller compared to n & d;

We will have the Time complexity ~~ O(nd).

Space complexity:

O(nd +kd); assuming K much smaller than “n”.

As we need to store all of the points to deal with.It can be viewed as

~= O(nd) -> linear.

K-Means is quite fast actually for real world applications. This is the reason why it is popular and used widely.

Sample Code For K-Means:

And you can get about parameters here:

K-Means with the example dataset for customer behavior clustering with five named features:

Feature 1: Age (continuous)

Feature 2: Income (continuous)

Feature 3: Purchase Frequency (continuous)

Feature 4: Website Visits (continuous)

Feature 5: Customer Satisfaction Score (continuous)

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

from sklearn.cluster import KMeans
import numpy as np

data = [
[35, 50000, 4, 10, 7.5],
[42, 75000, 8, 15, 8.2],
[28, 40000, 3, 5, 6.7],
[55, 100000, 2, 3, 9.0],
[38, 60000, 5, 12, 7.8],
[46, 85000, 9, 20, 8.9],
[31, 45000, 2, 6, 6.5],
[39, 70000, 7, 18, 8.4],
[27, 35000, 4, 8, 7.2],
[48, 90000, 6, 14, 8.7],
[33, 55000, 3, 7, 6.8],
[50, 95000, 5, 16, 8.1],
[25, 30000, 1, 4, 6.0],
[41, 80000, 8, 17, 8.5],
[29, 38000, 2, 9, 6.3],
[37, 65000, 6, 13, 7.9],
[43, 82000, 7, 19, 8.6],
[30, 42000, 3, 11, 6.9],
[52, 98000, 4, 15, 8.3],
[36, 58000, 5, 10, 7.6],
[44, 90000, 9, 21, 9.2],
[32, 50000, 2, 5, 6.6],
[49, 88000, 7, 16, 8.0],
[26, 32000, 4, 7, 6.4],
[40, 75000, 6, 15, 7.7],
[34, 52000, 3, 9, 6.7],
[51, 93000, 5, 17, 8.2],
[24, 28000, 1, 3, 5.8],
[45, 89000, 8, 20, 8.8],
[33, 48000, 3, 8, 6.7],
[47, 92000, 7, 18, 8.3]
]

X = np.array(data)

kmeans = KMeans(n_clusters=3, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, copy_x=True, algorithm='full')
kmeans.fit(X)
labels = kmeans.labels_
print(labels) # Prints the cluster labels assigned to each data point

predictions = kmeans.predict([[0, 0, 0, 0, 0], [12, 3, 0, 0, 0]])
print(predictions) # Prints the predicted clusters for the given data points

centers = kmeans.cluster_centers_
print(centers) # Prints the coordinates of the cluster centers
Predictions.

In order to Visualize:

#In 2D

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Perform PCA to reduce dimensionality to 2
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Fit KMeans on the reduced data
kmeans.fit(X_pca)

# Get cluster labels
labels = kmeans.labels_

# Get cluster centers
centers = kmeans.cluster_centers_

# Visualize the clusters
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels)
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('KMeans Clustering')
plt.show()
2D Visualization.
#In 3D

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA

# Perform PCA to reduce dimensionality to 3
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)

# Fit KMeans on the reduced data
kmeans.fit(X_pca)

# Get cluster labels
labels = kmeans.labels_

# Get cluster centers
centers = kmeans.cluster_centers_

# Visualize the clusters
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=labels)
ax.scatter(centers[:, 0], centers[:, 1], centers[:, 2], c='red', marker='X', s=200)
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
ax.set_title('KMeans Clustering')
plt.show()
3D Visualization.

So, Enjoy the reading and get insights and know everything about K-Means, K-Means ++, K-Medoids and PAM’s and how they are useful.

BECOME a WRITER at MLearning.ai // text-to-video // Divine Code

--

--