K-Means Clustering: An Introduction and Implementation in Python

7 min readJan 23, 2023

image segmentation ( soucer: developer.nvidia.com)

Introduction

K-Means Clustering is a type of unsupervised machine learning algorithm that is used for dividing a given dataset into ‘k’ clusters. The main objective of K-Means is to minimize the distance between the data points within a cluster and maximize the distance between the data points from different clusters. Each cluster is represented by its centroid, which is the mean of all the data points in the cluster. This algorithm is easy to implement, computationally efficient, and widely used in various applications such as image segmentation, customer segmentation, anomaly detection, and more.

K-Means Clustering is widely used in various industries such as marketing, finance, and healthcare. For example, in marketing, it can be used for customer segmentation to group similar customers based on their demographics, purchase history, and other characteristics. In finance, it can be used for anomaly detection to identify fraudulent transactions. In healthcare, it can be used for image segmentation to analyze medical images and detect abnormalities.

The article will provide a comprehensive introduction to K-Means Clustering. We will start with a mathematical formulation of the algorithm and its steps. We will then move on to implementation in Python, where we will cover the required libraries, data preparation, and evaluation metrics. We will also look at some real-world examples of K-Means Clustering and its applications. Finally, we will conclude the article with a summary of the algorithm, its limitations, and future work.

Understanding K-Means Clustering

The K-Means algorithm aims to partition a dataset into ‘k’ clusters, where each cluster is defined by its centroid. The centroid is the mean of all the data points in the cluster. The algorithm minimizes the sum of the squared distances between each data point and its closest centroid. This can be mathematically represented as:

$$ J = \sum_{i=1}^{k}\sum_{x \in S_i}^{} ||x-\mu_i||² $$

Where J is the cost function, S_i is the set of data points belonging to the i-th cluster, and μ_i is the centroid of the i-th cluster.

The algorithm works by initially selecting k centroids randomly, then it assigns each data point to the closest centroid, after that the centroid is recomputed for each cluster based on the data points that belong to it. Then the algorithm repeats the previous two steps until the centroids don’t change or a maximum number of iterations is reached. The algorithm optimizes the cost function J, which is the sum of the squared distances between each data point and its closest centroid. The optimal solution is the one that minimizes J.

The K-Means algorithm follows the following steps:

Initialize ‘k’ centroids randomly.
Assign each data point to the closest centroid.
Recompute the centroid for each cluster.
Repeat steps 2 and 3 until the algorithm converges. Convergence is achieved when the centroids no longer change or the maximum number of iterations is reached.

The K-Means algorithm optimizes the cost function J, which is the sum of the squared distances between each data point and its closest centroid. The optimal solution is the one that minimizes J. The algorithm is sensitive to the initial centroid locations and can get stuck in a local minimum, therefore it is recommended to run the algorithm multiple times with different initial centroid locations and choose the best solution.

It’s also worth noting that the k-means algorithm has some assumptions, such as the clusters being spherical, equally sized, and having similar densities. When these assumptions are not met, the k-means algorithm might not work well.

Implementation in Python

To implement K-Means Clustering in Python, we will need to use some popular libraries such as NumPy, Pandas, and Scikit-learn. NumPy is used for mathematical operations, Pandas for data manipulation, and Scikit-learn for machine learning.

Before implementing K-Means Clustering, we will need to prepare our data. This includes loading the data, cleaning it, and pre-processing it. The data should be in the form of a NumPy array or a Pandas dataframe.

Once the data is prepared, we can implement the K-Means Clustering algorithm using the KMeans class from Scikit-learn. We will need to specify the number of clusters ‘k’ and the initialization method. We can also specify other parameters such as the maximum number of iterations, the tolerance for convergence, and the random state. The fit() method is used to fit the model to the data and the predict() method is used to predict the cluster for new data points.

To evaluate the performance of the K-Means Clustering model, we can use various evaluation metrics such as the silhouette score, the Calinski-Harabasz Index, and the Davies-Bouldin Index. These metrics can help us determine the quality of the clusters and the optimal number of clusters.

To visualize the clusters, we can use various plotting libraries such as Matplotlib and Seaborn. We can create scatter plots of the data points colored according to their cluster. This can help us understand the distribution of the data and the clusters.

It’s worth noting that the k-means algorithm might have problems when the data is not well-separated, or there are clusters with different densities or different sizes. In these cases, another clustering algorithm might perform better.

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
import matplotlib.pyplot as plt
import seaborn as sns

# load data
data = pd.read_csv("data.csv")

# pre-processing
data = data.dropna()

# k-means clustering
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(data)

# evaluation metrics
print("Silhouette score:", silhouette_score(data, pred_y))
print("Calinski-Harabasz Index:", calinski_harabasz_score(data, pred_y))
print("Davies-Bouldin Index:", davies_bouldin_score(data, pred_y))

# visualizing the clusters
sns.scatterplot(x=data[0], y=data[1], hue=pred_y)
plt.show()

This code snippet shows an example of how to implement K-Means Clustering in Python, evaluate the performance using metrics, and visualize the clusters using matplotlib and seaborn. It loads the data using pandas, and then preprocess it as needed. Then it creates an instance of KMeans class and fits it to the data. The prediction method is used to predict the cluster for each data point. After that, the code uses three evaluation metrics: silhouette score, Calinski-Harabasz Index and Davies-Bouldin Index, to evaluate the performance of the model. Finally, the code visualizes the clusters using a scatter plot.

Please note that this is just an example, and you might need to adjust it based on the structure and type of your data, also the code might not run as is, since it doesn’t have the actual data to work on.

Real-world examples

Image Segmentation is the process of dividing an image into multiple segments or regions, each corresponding to a different object or background. K-Means Clustering can be used for image segmentation by grouping similar pixels based on their color values. This can be used for applications such as object recognition, image compression, and medical image analysis.
Customer Segmentation is the process of dividing customers into different groups based on their characteristics and behavior. K-Means Clustering can be used for customer segmentation by grouping similar customers based on their demographics, purchase history, and other characteristics. This can be used for applications such as targeted marketing, personalization, and customer retention.
Anomaly Detection is the process of identifying abnormal or unusual data points. K-Means Clustering can be used for anomaly detection by identifying data points that are far from the centroid of their cluster. This can be used for applications such as fraud detection, network intrusion detection, and medical diagnosis.

These are just a few examples of how K-Means Clustering can be used in real-world applications. The algorithm is widely used in various industries such as marketing, finance, healthcare, and more. The key is to understand the data and the problem and to choose the appropriate number of clusters and evaluation metrics.

Conclusion

K-Means Clustering is a type of unsupervised machine learning algorithm that is used for dividing a given dataset into ‘k’ clusters. The main objective of K-Means is to minimize the distance between the data points within a cluster and maximize the distance between the data points from different clusters. The algorithm is easy to implement, computationally efficient, and widely used in various applications such as image segmentation, customer segmentation, anomaly detection, and more.

While K-Means Clustering is widely used and efficient, it does have some limitations. The algorithm assumes that the clusters are spherical, equally sized and have similar densities. When these assumptions are not met, the algorithm might not work well. Also, the algorithm is sensitive to the initial centroid locations and can get stuck in a local minimum. Moreover, the number of clusters k should be defined prior to the clustering process, which might not be easy in practice.

Despite its limitations, K-Means Clustering is still a widely used and powerful algorithm. Future work can focus on developing more robust and flexible versions of the algorithm that can handle non-spherical, unevenly sized, and non-uniformly distributed clusters. Researchers also continue to develop new techniques to determine the optimal number of clusters k, and to improve the initialization process to avoid getting stuck in local minima. Additionally, there are many other clustering algorithms that can be used in different scenarios, and it’s important to understand when and how to use them.

In conclusion, K-Means Clustering is a powerful unsupervised learning algorithm that can be used for various applications, but it’s important to keep in mind its assumptions and limitations, and to choose the appropriate evaluation metrics and visualization techniques.

References

Books and Articles

“Pattern Recognition and Machine Learning” by Christopher M. Bishop (2006)
“Data Mining: Concepts and Techniques” by Jiawei Han, Micheline Kamber, and Jian Pei (2011)
“An Introduction to Statistical Learning” by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2013)
“A Tutorial on Clustering Algorithms” by R. Jain and R. Dubes (1988)

2. Code and Datasets

Scikit-learn library for Python (https://scikit-learn.org/stable/modules/clustering.html)
UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets.php)
Kaggle (https://www.kaggle.com/)
https://github.com/llSourcell/k_means_clustering