The effectiveness of clustering in IIoT

6 min readApr 10, 2023

How this machine learning model has become a sustainable and reliable solution for edge devices in an industrial network

An Introduction

Clustering (cluster analysis - CA) and classification are two important tasks that occur in our daily lives. As human beings, we normally have a tendency to cluster and classify objects or ideas hundreds of times a day. With the emergence of data science and AI, clustering has allowed us to view data sets that are not easily detectable by the human eye. Thus, this type of task is very important for exploratory data analysis. CA is widely used in various fields and industries such as Marketing (customer segmentation), Biology/Genetics(group genes/proteins by disease subtypes), Image/Video analysis (grouping similar images/videos together based on the visual features, Natural Language Processing, and Anomaly detection (detect anomalous behavior in systems such as cybersecurity, fraud detection, and industrial control systems).

3 feature visual representation of a K-means Algorithm. Source: Marubon-DS

Unsupervised Learning

In the data science context, clustering is an unsupervised machine learning technique, this means that it does not require predefined labeled inputs or outcomes to learn from. Instead, the goal of clustering is to identify groups or clusters in the data based on distance metrics or similarities. Essentially, the clustering algorithm is grouping data points together without any prior knowledge or guidance to discover hidden patterns or unusual data groupings without the need for human interference.

Unsupervised machine Learning model clustering groups

Components

The key components of an effective clustering technique require the following:

In order to address the curse of dimensionality phenomena, techniques such as feature selection, feature engineering, and dimensionality reduction can be used to remove redundant components in the dataset

Visual representation of the PCA algorithm, in this instance there is a positive relationship between components 1 and 2

Choosing an appropriate distance metric for the performance algorithm (this can be Euclidean, Manhattan, cosine, etc.)

X: Array or Vector X , Y: Array or vector Y, x: values of the horizontal axis in the coordinate plane, y: values of vertical axis in the coordinate plane, n: number of observations

Choosing a clustering algorithm such as K-means, hierarchical clustering, density-based clustering, and spectral clustering
Determining the optimal number of clusters with the usage of the elbow method, gap statistic, or the silhouette score
Evaluating the clustering technique with metrics such as silhouette score, assess the quality and validity of the clustering solution
The results should be easily interpretable in order to provide a meaningful context of the data and its application via visualizations or domain knowledge in the field.

Industrial Internet of Things (IIoT)

The Constraints

Within the area of Industry 4.0, many industrial companies face various technical constraints that can affect their operations and revenue. Issues such as network connectivity (specifically in areas where there may be limited or unreliable networks), bandwidth (due to the very large amount of data generated by IoT devices), security (being vulnerable to cyberattacks that can bring potential threats in unauthorized access of private data), energy (ensuring that edge devices have an energy-efficient design in order to minimize its energy consumption as well as prolong the lifespan of these devices), and data management (controlling information adequately in order to make effective analysis that contributes to data-driven decision making).

IT security Photo by Pixabay from Pexels

The Solutions

Clustering can address all the constraints mentioned above to a greater extent as well as become an adequate form of technology that can exist on hundreds to thousands of edge devices. Supervised machine learning (such as SVM or GradientBoost) and deep learning models (such as CNN or RNN) can promise far superior performances when comparing them to clustering models however this can come at a greater cost with marginal rewards to the environment, end-user, and product owner of such technology. As mentioned in the constraint section:

Connectivity: Clustering can enable local data processing and analysis on edge devices which reduces the amount of data transmitted over the network and also reduce the reliance on a central server to do the data processing (similarly to a federated learning approach)
Bandwidth: In order to reduce bandwidth requirements in IIoT systems, clustering can compress data and transmit specific clusters of interest. This in turn makes the transmission of data more efficient and reduces any risks of network latency. Local data caching can exist in clustering methodologies by reducing the need for continuous data transmission in order to improve network efficiency and reduce energy consumption (Zhao, et al., 2016)
Data Management: By allowing clustering to occur locally, edge devices in the network can enable near-real-time data analysis in order to make data-driven decisions
Energy: Clustering methods have been known to be more energy efficient when it comes to data transmission and processing (Loganathan & Arumugan, 2021). Edge devices can also reduce the constant need for data transmission which contributes to their energy efficiency attribute. In contrast, deep learning models with complex architectures (number of parameters and training processes) typically require more computation power in order to run. On the other hand, clustering does not require training and most importantly meets the resource-constrained demands of many industrial ecosystems
Security: Clustering can improve the privacy of data in IIoT systems by allowing local data processing. New research has shown that utilizing devices with hybrid clustering algorithms can facilitate the broader deployment of trustworthy and smart nodes at the network edge without the need for central servers (Lapegna, et al., 2023). Clustering locally can allow avoiding the transmission of sensitive data over the network
Other benefits: Clustering can also be deployed as a machine learning model to perform anomaly detection and predictive analytics. The purpose is to predict the cluster assignment for new data points based on the patterns learned from the training data. In the context of anomaly detection, clustering can be used to group edge devices with similar behavior based on features such as CPU, memory usage, or network traffic. Once clusters are established, any devices that do not belong to the established clusters can be labeled as an anomaly, thus indicating that the device is malfunctioning or in the process of a cybersecurity breach

In the code below, I provide an example of a known anomaly detection clustering methodology called OPTICS with the usage of blobs in Python

from sklearn.cluster import OPTICS
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt


random.seed(10)
x, _ = make_blobs(n_samples=350, centers=1, cluster_std=.4, center_box=(20, 5))
model = OPTICS().fit(x)

#visualize the results in a plot by highlighting anomalies in red
plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.legend(("normal", "anomal"), loc="best", fancybox=True, shadow=True)
plt.grid(True)
plt.show()

Anomaly detection with clustering algorithm OPTICS more info here

References

Onu Peter, Anup Pradhan, & Charles Mbohwa (2023). Industrial internet of things (IIoT): opportunities, challenges, and requirements in manufacturing businesses in emerging economies.
Introduction to K-means: Algorithm and visualization with julia from scratch. Introduction to K-means: Algorithm and Visualization with Julia from scratch. (n.d.). Retrieved April 9, 2023, from http://marubon-ds.blogspot.com/2018/04/introduction-to-k-means-algorithm-and.html
Lapegna M, Mele V, Romano D. Clustering Algorithms for Enhanced Trustworthiness on High-Performance Edge-Computing Devices. Electronics. 2023; 12(7):1689. https://doi.org/10.3390/electronics12071689
Leilei, S., Guoqing, C., Hui, X., & Chonghui, G. (2017). Cluster Analysis in Data-Driven Management and Decisions. Journal of Management Science and Engineering, 2, 227.
Loganathan, S., Arumugam, J. Energy Efficient Clustering Algorithm Based on Particle Swarm Optimization Technique for Wireless Sensor Networks. Wireless Pers Commun 119, 815–843 (2021). https://doi.org/10.1007/s11277-021-08239-z
Lorenza Prospero, Roberto Costa, Leonardo Badia, Resource Sharing in the Internet of Things and Selfish Behaviors of the Agents, IEEE Transactions on Circuits and Systems II: Express Briefs, 10.1109/TCSII.2021.3121560, 68, 12, (3488–3492), (2021).
Z. Zhao, M. Peng, Z. Ding, W. Wang and H. V. Poor, “Cluster Content Caching: An Energy-Efficient Approach to Improve Quality of Service in Cloud Radio Access Networks,” in IEEE Journal on Selected Areas in Communications, vol. 34, no. 5, pp. 1207–1221, May 2016, doi: 10.1109/JSAC.2016.2545384.

BECOME a WRITER at MLearning.ai

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com