K-means clustering is one of the most used and taught clustering methods in the world, thanks to its simplicity and fast execution. However, there is one major challenge that users face when applying k-means clustering: how to choose the number of clusters (k).
The most common metric for choosing k is the "elbow method," which involves plotting the approximation error (SSE) on the y-axis over a range of k values on the x-axis. The elbow criterion is based on the concept of diminishing returns: as we increase the number of clusters, the SSE decreases until we reach a point where the gain in SSE is minimal. This point is called the "elbow," and it is assumed to be the optimal number of clusters.
However, the elbow method has several problems, and better alternatives have been known in the literature for a long time. This blog post will explain why you should stop using the elbow criterion for k-means clustering and introduce alternative methods for choosing the number of clusters.
The elbow method has several problems that make it unreliable for choosing the number of clusters:
Fortunately, there are several better alternatives to the elbow method that have been proposed in the literature. We will introduce some of these methods below:
The gap statistic is a method for estimating the optimal number of clusters based on the comparison of the SSE for the data to the SSE for randomly generated data with the same distribution. The optimal number of clusters is the value of k that maximizes the gap statistic, which is defined as the difference between the log(SSE) of the data and the expected log(SSE) of the random data. The gap statistic is a rigorous statistical method that is less subjective and more reliable than the elbow method.
My personal favorite, the silhouette method is used for estimating the optimal number of clusters based on the quantitative measure of how well each data point fits its assigned cluster (silhouette width), enabling the evaluation of clustering quality. The silhouette coefficient is calculated for each data point(silhouette width) and the silhouette widht measures how close a point is to its own cluster compared to neighboring clusters, ranging from -1 to 1. The optimal number of clusters is the value of k that maximizes the average silhouette coefficient, meaning that it provides the best separation and cohesion of the data points within clusters. This one is easy to understand and interpret and provides more reliable results than the elbow method.
Model-based clustering is a technique that aims to group data points into clusters based on a probabilistic model, typically a Gaussian mixture model. The algorithm begins by assuming a certain number of clusters and estimates the model parameters using the Expectation-Maximization (EM) algorithm. The EM algorithm iteratively assigns data points to clusters (E-step) based on the current model parameters and updates the parameters (M-step) to maximize the likelihood of the observed data. To estimate the optimal number of clusters, the algorithm compares different models with varying numbers of clusters. It computes a criterion, such as the likelihood, Bayesian Information Criterion (BIC), or Akaike Information Criterion (AIC), for each model. These criteria balance the goodness of fit and the complexity of the model. The model with the highest likelihood or the lowest value of the criterion is chosen as the optimal number of clusters.
The elbow method is a common heuristic for choosing the number of clusters in k-means clustering. However, it has several problems, and better alternatives have been proposed in the literature. Gap statistic, silhouette method and model-based clustering are some of the alternative methods that provide more reliable results than the elbow method. As a data scientist, it's important to note that the optimal number of clusters is not always definitive, and it should be interpreted in the context of the specific problem and the characteristics of the data. In some cases, domain knowledge or additional validation techniques may be needed to confirm the chosen number of clusters.
Feel free to contact me for any question :)
I'll be getting back to you as soon as possible
🟥⬜🟥