AI/ML News & Innovations Hub

The problem of deciding the number of clusters is a difficult one and arguably not well defined. Personally I don't believe any formal method blindly and will always take data visualisation and background information into accound (meaning of the data, what will the clustering be used for).

The "elbow method" isn't a well defined method, and I'm rather sceptical about it. Furthermore, as far as I can see, the documentation of fviz_nbclust doesn't precisely explain what method="wss" actually does, i.e., how exactly the number 4 is determined from the values you see in the plot. For me this is reason enough not to use it.

The gap statistic approach is a way to formalise and objectify the elbow method (it's actually a way to relate the WSS to what would be expected for the given data if there were no clustering). Note however that the clusGap function together with maxSE implements a bewildering number of possible versions of the original gap statistic, some of which are backed up by literature, and my personal taste is somewhat different from the clusGap-default, which already is different from what is recommended in the original paper by Tibshirani, Walther and Hastie; I use d.power=2 and method="globalSEmax" (there are reasons for this but not sure how precisely you want to know;-). That said, your second plot suggests to me that in your situation the result 2 could be stable over most if not all different versions.

Furthermore note that the outcome of kmeans depends on random initialisation, and many random starts should be used in order to have a stable result. For this reason it makes sense that you provide nstart=25 to clusGap. I don't actually know how big your dataset is, but as long as computation time is not an issue, I would use an even higher number of nstart.

Now fviz_nbclust apparently uses the kmeans default which stupidly is nstart=1. This means that the solution on which the fviz_nbclust result is based on a single random initialisation and very unstable. I suspect that for this reason the kmeans-solutions on which the first plot is based may not be the same as (worse than) those on which the second plot is based, and this may be another explanation why numbers of clusters are estimated differently. It may be (but I'm not sure) that this could be repaired by providing fviz_nbclust with nstart=25 (or bigger) as well. In any case I'm not a fan of running kmeans through fviz_nbclust, so I'm not the best source for recommendations of how exactly to use it, but the fact that not everything is precisely explained certainly doesn't help. (The fviz_nbclust examples on its help page do not set nstart when using fviz_nbclust but set it when using clusGap, which would be bad practice if it were possible to set it with fviz_nbclust, so it may well not be possible, but I haven't tried.)

Gap Statistic Versus Total Sum of Squares For Cluster Centres (r language)