Clustering — Beyonds KMeans+PCA…

Shahar Gino
9 min readJul 14, 2023

Perhaps the most popular way of clustering is K-Means. It natively supports only numerical data, so typically an encoding is applied first for converting the categorical data into a numerical form. It is also very common as well to combine K-Means with PCA for visualizing the clustering results, and many clustering applications follow that path (e.g. this link). The purpose of this article is to introduce powerful alternatives for that common path, which are mainly suitable for mixed-data.

Introduction

Clustering

Clustering is a fundamental technique in the field of machine learning that aims to group similar data points together based on their inherent characteristics or properties. It is a form of unsupervised learning, which means it does not require labeled training data or predefined target variables. Instead, clustering algorithms analyze the structure and patterns within the data to identify natural groupings or clusters.

The objective of clustering is to discover hidden relationships, similarities, or patterns in the data without any prior knowledge or guidance. It can be applied to a wide range of domains and has numerous practical applications, such as customer segmentation, image and document categorization, anomaly detection, and social network analysis.

Shallow and Deep Clustering

Clustering algorithms can be categorized into two types based on their approach to learning: shallow learning and deep learning. Common shallow-learning algorithms are K-Means, Hierarchical Clustering, Expectation-Maximization (EM) and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). Common Deep-leaning algorithms are Self-Organizing Maps (SOM), Deep Embedded Clustering (DEC), Variational AutoEncoders (VAEs) and Deep K-means.

Shallow learning clustering algorithms are often used when the data has well-defined separable clusters, interpretability is important, or the dataset is small. Deep learning clustering algorithms are preferred when the data is complex, high-dimensional, or lacks clear separability, and when more powerful feature representation and pattern extraction capabilities are required. The choice between shallow learning and deep learning clustering techniques depends on the specific characteristics of the dataset, it’s size, the desired level of interpretability, and the complexity of the underlying patterns.

Numerical, Categorical and Mixed data

Any dataset typically comprises Numerical and/or Categorical data, and that distinction is important for planning the respective clustering strategy. Numerical data consists of quantitative values that can be measured or expressed as numbers. It represents quantities or measurements on a continuous or discrete scale. Examples of numerical data include age, height, temperature, weight, and income. Numerical data can further be classified into Continuous and Discrete sub-types. Categorical data represents characteristics or qualities and consists of discrete, non-numerical values that belong to specific categories or classes. It represents qualitative variables rather than quantitative measurements. Categorical data can be further divided into Nominal and Ordinal sub-types. Finally, Mixed data refers to datasets that contain a combination of numerical and categorical variables. In real-world datasets, it is common to encounter both numerical measurements and categorical attributes.

Data Encoding

Encoding is a technique used to convert categorical data into numerical form so that it can be processed by machine learning algorithms which expects numerical data only. Some of the common encoding methods are Label Encoding, One-Hot Encoding, Binary Encoding, Ordinal Encoding, Frequency Encoding, Target Encoding, etc. The choice of encoding method depends on the nature of the categorical variable, the relationship between categories, the size of the dataset, and the specific machine learning algorithm being used. Encoding should be performed carefully to avoid introducing bias or misinterpretation of the data.

Visualization and Dimensionality Reduction

The clustering result ends up with mapping each element in the data into its corresponding cluster. However, the elements typically contains many features, i.e. they belong to a high-dimension space. High-dimensional data can be challenging to visualize directly, as human perception is limited to three dimensions. By reducing the dimensionality of the data to two or three dimensions, these algorithms enable the visualization of complex datasets and aid in understanding the underlying structure, relationships, and patterns within the data. It help identify clusters, outliers, trends, and other important features, making them indispensable for exploratory data analysis and pattern recognition. Some of the common dimensionality reduction algorithms are Principal Component Analysis (PCA), t-SNE (t-Distributed Stochastic Neighbor Embedding), Linear Discriminant Analysis (LDA), Independent Component Analysis (ICA), etc.

K-Prototypes

K-Prototypes offers an advantage (over K-Means) of working with mixed data types. It measures distance between numerical features using Euclidean distance (like K-means) but also measure the distance between categorical features using the number of matching categories. It was first published by Huang (1998) and was implemented in python using this package.

Coding example, based on the kmodes library:

from kmodes.kprototypes import KPrototypes

# Example data with numeric and categorical features
data = [
[25, 'M', 'New York'],
[30, 'F', 'Los Angeles'],
[35, 'M', 'Chicago'],
[20, 'F', 'Chicago'],
[40, 'M', 'New York'],
[32, 'F', 'Los Angeles'],
]

# Specify the indices of categorical columns
categorical_cols = [1, 2]

# Initialize the K-Prototypes algorithm
kproto = KPrototypes(n_clusters=2, init='Cao', verbose=2)

# Fit the model to the data
clusters = kproto.fit_predict(data, categorical=categorical_cols)

# Print the cluster centroids
print(kproto.cluster_centroids_)

In this example, we have a dataset with three columns: age (numeric), gender (categorical), and city (categorical). We specify the indices of the categorical columns in the categorical_cols list. The KPrototypes class is then used to initialize the K-Prototypes algorithm, where n_clusters specifies the number of desired clusters, and init determines the initialization method. The fit_predict method is called to fit the model to the data and obtain the cluster assignments for each data point. The categorical parameter is set to indicate the indices of the categorical columns. Finally, the cluster centroids can be accessed via the cluster_centroids_ attribute of the KPrototypes object, which provides the centroid values for both numeric and categorical features.

Gower + DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that groups together data points based on their proximity in the feature space. It does not require the number of clusters to be predefined and can discover clusters of arbitrary shapes and sizes. It defines clusters as regions of high density separated by regions of low density, and it can identify noise points as well. DBSCAN requires two parameters: epsilon (eps), which determines the maximum distance between two points to consider them as neighbors, and min_samples, which specifies the minimum number of points required to form a dense region (core point). Points that are within the epsilon distance of a core point or are directly or indirectly connected to a core point are considered part of the same cluster.

Gower distance is a metric used to calculate the similarity between pairs of objects or observations in a dataset. It can handle mixed data types and takes into account both the categorical and numerical values to compute the dissimilarity. Gower distance scales the numerical features and treats the categorical features as binary variables, considering the presence or absence of a category. By using Gower distance, it is possible to measure the dissimilarity between observations with mixed feature types in a consistent manner. Gower distance is often used as a distance metric in clustering algorithms, such as DBSCAN, when dealing with datasets containing both categorical and numerical features.

Coding example, based on the gower and DBSCAN libraries:

import gower
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# Example data
data = [
[1.2, 0.5, 'A'],
[2.3, 0.7, 'B'],
[0.9, 1.5, 'A'],
[1.5, 2.0, 'C'],
[2.0, 1.2, 'B']
]

# Separate the categorical column for Gower distance calculation
categorical_col = [row[-1] for row in data]
data = np.array(data)[:, :-1].astype(float)

# Scale the numerical columns
scaler = StandardScaler()
data[:, :2] = scaler.fit_transform(data[:, :2])

# Calculate the Gower distance matrix
gower_dist = gower.gower_matrix(data)

# Initialize DBSCAN with desired parameters
dbscan = DBSCAN(eps=0.3, min_samples=2, metric='precomputed')

# Perform clustering
labels = dbscan.fit_predict(gower_dist)

# Plot the clusters
unique_labels = np.unique(labels)
colors = ['red', 'green', 'blue', 'yellow', 'orange', 'purple'] # Customize colors if needed

for label in unique_labels:
cluster_points = data[labels == label]
plt.scatter(cluster_points[:, 0], cluster_points[:, 1], color=colors[label], label=f'Cluster {label}')

plt.title('DBSCAN Clustering with Gower Distance')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

In this example, we have a dataset with two numerical features and one categorical feature. The categorical column is separated from the numerical columns, and the numerical columns are standardized using StandardScaler. The Gower distance matrix is then calculated using the gower_matrix function, which computes the Gower distance for each pair of data points. Next, the DBSCAN algorithm is initialized with desired parameters, including the epsilon (eps) and minimum samples (min_samples). The metric parameter is set to ‘precomputed’ to indicate that a precomputed distance matrix is applied. Clustering is performed using the fit_predict method of the DBSCAN object, which assigns cluster labels to each data point. Finally, a scatter plot is created to visualize the clusters. Each cluster is plotted with a different color, and a legend is added to identify the clusters.

UMAP Embeddings

Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) is a dimension reduction technique that can be used for visualization similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data

  1. The data is uniformly distributed on Riemannian manifold
  2. The Riemannian metric is locally constant (or approximately..)
  3. The manifold is locally connected

Unlike PCA, UMAP is a nonlinear technique which preserves both local and global structure, making it suitable for capturing complex, nonlinear patterns. UMAP is based on manifold learning, which assumes that the data lies on a lower-dimensional manifold embedded in a higher-dimensional space. It constructs a graph representation of the data, preserves the local distances in that graph, and then optimizes a low-dimensional embedding that best preserves these distances. UMAP is particularly useful for visualizing high-dimensional data, and it can handle both numerical and categorical data, making it more versatile than PCA. It shall be notes that UMAP may be more computationally expensive than PCA and could require tuning of its hyper-parameters.

Coding example, based on the umap-learn library:

import umap
import matplotlib.pyplot as plt

# Example data
data = [
[1.2, 0.5, 0.8],
[2.3, 0.7, 0.6],
[0.9, 1.5, 1.0],
[1.5, 2.0, 1.8],
[2.0, 1.2, 1.5]
]

# Initialize UMAP with desired parameters
umap_model = umap.UMAP(n_neighbors=5, min_dist=0.3, n_components=2)

# Fit the model to the data and obtain the low-dimensional representation
umap_embedding = umap_model.fit_transform(data)

# Plot the UMAP embeddings
plt.scatter(umap_embedding[:, 0], umap_embedding[:, 1])
plt.title('UMAP Embeddings')
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
plt.show()

In this example, data is a list of data points in a high-dimensional space. UMAP is initialized with the desired parameters, such as the number of neighbors (n_neighbors), the minimum distance between points (min_dist), and the number of desired dimensions in the low-dimensional representation (n_components). The fit_transform method is then called to fit the UMAP model to the data and obtain the low-dimensional representation. The resulting umap_embedding contains the UMAP embeddings of the data. Finally, a 2D scatter plot is applied by providing the UMAP embeddings to the scatter plot function, using umap_embedding[:, 0] for the x-axis values and umap_embedding[:, 1] for the y-axis values.

The scatter plot can be colored with a pre-calculated color-vector, e.g. according to K-Prototypes clustering, Gower+DBSCAN clustering, etc. Example for the former combination is provided in this article.

Summary

In the realm of data analysis and machine learning, traditional methods like K-means clustering and Principal Component Analysis (PCA) have long been go-to choices for clustering and dimensionality reduction tasks. However, emerging alternatives such as K-Prototypes, Gower distance, DBSCAN, and UMAP embeddings are gaining recognition for their ability to tackle specific challenges and offer more flexible solutions.

This article explores these novel techniques and highlights their advantages over K-means and PCA. K-Prototypes extends K-means to handle mixed data types, Gower distance measures dissimilarity for diverse data, DBSCAN excels at detecting clusters of arbitrary shapes, and UMAP embeddings preserve the global structure of high-dimensional data.

By understanding and leveraging these alternatives, data scientists and analysts can enhance their clustering and dimensionality reduction capabilities for a wide range of applications.

WRITER at MLearning.ai // Code Interpreter // AI’s Safe Deception

--

--