KMeans and Decision Tree Simplified

Ashmal Vayani
6 min readMay 2, 2023

Connect with me on LinkedIn.

K-Means Clustering

What is K-Means Clustering in Machine Learning?

K-Means Clustering is an unsupervised machine learning algorithm used for clustering data points into groups or clusters based on their similarity. K-Means aims to partition a set of data points into K clusters, where each data point belongs to the cluster whose centroid is closest. The algorithm tries to minimize the sum of squared distances between each data point and its assigned centroid, known as the Within-Cluster Sum of Squares (WCSS).

How Does K-Means Clustering Work?

The K-Means Clustering algorithm works by iteratively assigning each data point to its nearest centroid and then updating the centroid based on the mean of the data points assigned to it. The steps are as follows:

  1. Initialize K centroids randomly from the data points.
  2. Assign each data point to its nearest centroid based on the Euclidean distance.
  3. Calculate the mean of the data points assigned to each centroid and update the centroid location.
  4. Repeat steps 2–3 until convergence, i.e., no more changes occur in the assignments.

The algorithm outputs the final K clusters with their respective centroid locations.

How is K Determined in K-Means Clustering?

K, the number of clusters, is a hyperparameter that needs to be chosen before applying the K-Means algorithm. There are different methods to determine the optimal value of K:

  1. Elbow method: Plot the WCSS against the number of clusters, and choose the value of K where the curve starts to flatten out, resembling an elbow.

To find the optimal value of clusters, the elbow method follows the below steps:

  • It executes the K-means clustering on a given dataset for different K values (ranges from 1–10).
  • For each value of K, calculates the WCSS value.
  • Plots a curve between calculated WCSS values and the number of clusters K.
  • The sharp point of bend or a point of the plot looks like an arm, which is considered the best value of K.

2. Silhouette method: Compute the silhouette score for different values of K and choose the one with the highest score. The silhouette score measures how well-separated the clusters are and ranges from -1 to 1, with higher values indicating better clustering.

3. Domain knowledge: If prior knowledge about the data is available, it can help in determining a reasonable value of K.

How is K-Means Clustering Better Than Other Unsupervised Machine Learning Models?

K-Means Clustering has several advantages over other unsupervised machine learning models:

  1. Fast and scalable: K-Means is a relatively simple algorithm that scales well to large datasets, making it computationally efficient.
  2. Easy to implement: K-Means is easy to understand and implement, even for non-experts in machine learning.
  3. Versatile: K-Means can be applied to a wide range of data types and is effective in finding clusters of different shapes and sizes.
  4. Robust to noise: K-Means can handle noise and outliers well by assigning them to a separate cluster.

What Are the Limitations or Drawbacks of K-Means Clustering?

K-Means Clustering has a few limitations and drawbacks:

  1. Sensitive to initial centroid locations: K-Means is sensitive to the initial centroid locations and can converge to a suboptimal solution, depending on the initialization.
  2. Requires a predetermined number of clusters: K-Means requires the number of clusters to be specified beforehand, which can be challenging in practice.
  3. Assumes spherical clusters: K-Means assumes that clusters are spherical and of equal size, which may not be accurate in some datasets.
  4. May not work well with high-dimensional data: K-Means may not work well with high-dimensional data, as the distance metric becomes less meaningful in higher dimensions.

In what situations KMeans clustering is a suitable algorithm to use?

K-Means Clustering is suitable to use in various situations, including:

  1. Customer Segmentation: K-Means can be used to group customers based on their behaviour or preferences, allowing companies to tailor their marketing strategies.
  2. Image Segmentation: K-Means can be used to segment an image into different regions based on the colour or intensity of pixels.
  3. Anomaly Detection: K-Means can be used to identify outliers or anomalies in a dataset by assigning them to a separate cluster.
  4. Document Clustering: K-Means can be used to cluster similar documents based on their content, allowing for easier organization and retrieval.
  5. Recommender Systems: K-Means can be used to group similar items or products based on their features, allowing for personalized recommendations.

Decision Tree Classifier

A Decision Tree is a Supervised learning technique that can be used for classification and Regression problems. (unlike linear regression models that calculate the coefficients of predictors, tree regression models calculate the relative importance of predictors).

What is Decision Tree in Machine Learning?

A decision tree is a supervised machine-learning algorithm for regression and classification problems. It models the relationship between a target variable and its predictors in a tree-like structure, where each internal node represents a decision based on one of the predictor variables, and each leaf node represents a value of the target variable. A decision tree aims to create a model that can accurately predict the target variable.

How Does Decision Tree Work?

The decision tree algorithm works by recursively partitioning the data based on the values of the predictor variables, creating a tree-like structure of decisions. The steps are as follows:

  1. Choose the best predictor variable to split the data. The split should maximize the difference between the target variable’s values in the resulting subsets.
  2. Split the data into subsets based on the chosen predictor variable.
  3. Repeat steps 1–2 recursively for each subset until a stopping criterion is met, such as a maximum depth or a minimum number of samples in a leaf node.
  4. Assign the target variable’s value to each leaf node by taking the average value (for regression) or choosing the majority class (for classification).

The resulting decision tree can be used to make predictions for new data points by following the path from the root node to a leaf node that matches the values of the predictor variables.

How is Decision Tree Better Than Other Supervised Machine Learning Models?

Decision trees have several advantages over other supervised machine learning models:

  1. Easy to understand and interpret: Decision trees provide a graphical representation of the decision-making process, making them easy to understand and interpret, even for non-experts in machine learning.
  2. Non-parametric: Decision trees do not make any assumptions about the underlying distribution of the data, making them flexible and adaptable to various data types.
  3. Handles non-linear relationships: Decision trees can model non-linear relationships between the predictor and target variables, allowing for more accurate predictions.
  4. Robust to outliers: Decision trees are less sensitive to outliers and noise in the data, as they use a recursive partitioning approach.

What are the Limitations or Drawbacks of Decision Tree?

Decision trees have a few limitations and drawbacks:

  1. Overfitting: Decision trees can be prone to overfitting, especially if the tree is too deep or the data has high noise or outliers.
  2. Instability: Small changes in the data can result in a different decision tree, making them unstable.
  3. Bias: Decision trees can be biased towards variables with more levels or categories, resulting in an uneven data split.
  4. Difficulty handling continuous variables: Decision trees can need help to handle continuous variables or variables with many unique values, as they tend to split the data into too many small partitions.

In What Situations is Decision Tree Classifier Suitable to Use?

The decision tree classifier is suitable to use in various situations, including:

  1. Medical diagnosis: Decision trees can diagnose medical conditions based on symptoms, lab tests, or other patient data.
  2. Customer segmentation: Decision trees can segment customers based on their demographics, behaviour, or preferences, allowing companies to tailor their marketing strategies.
  3. Credit risk assessment: Decision trees can be used to assess the creditworthiness of loan applicants based on their income, credit history

BECOME a WRITER at MLearning.ai

--

--

Ashmal Vayani

A motivated, versatile, and responsible Computing undergraduate student, seeking opportunities to enhance interpersonal, and technical skills through my skills.