In my previous post, I showed some interactive charts related to the Human Development Index. Following the UNDP (the original source for the data), the countries were grouped in 4 categories: Very High, High, Medium and Low.
So there are two questions: why 4 groups, and how one decides to which group a given country should belong, based only on its HDI.
In 'machine learning' jargon, this is a clustering problem.
So I decided to apply the kmeans algorithm to find the clusters in this dataset.
First step: find the number of clusters. I have tried the values of k from 1 to 15. A good diagnostic is to calculate, for each value of k, the associated 'within sum of squares' (WSS).
The plot below shows the graph of WSS for k from 1 to 15:
We see that, after k=4, the reduction in WSS becomes more or less linear. So k=4 seems a reasonable choice for the number of clusters.
k  wss 
1

4.53075954

2

1.21305450

3

0.55921525

4

0.31337541

5

0.19527297

6

0.13297448

7

0.09393269

8

0.06275523

9

0.05142409

10

0.04458469

11

0.03726298

12

0.03321763

13

0.02687930

14

0.02388013

15

0.01934118

The R code for the calculation is:
for(k in 1:15) wss[k]<sum(kmeans(HDI,centers=k,nstart=25)$withinss)
Notice the 'nstart=25', meaning 25 initial guesses for the cluster's means.
Second step: identify the clusters. The 4 clusters have means :
0.8665102
0.7378413
0.6020000
0.4430811
and sizes: 49, 63, 38, 37, respectively.
Plotting the HDI agains the rank for each country, and attributing colours according to these clusters, gives the next plot:
This plot was obtained with the following R command:
g=ggplot(data=data,aes(x=Rank,y=HDI,color=cluster))+geom_point()+guides(fill=FALSE)
Note that the first group includes 49 countries, as in the UNDP classification. So the first group found by the algorithm coincides with the UNDP group.
However, the next three clusters do not coincide with those chosen by the UNDP.
For instance, the second group in the UNDP has size = 53, while the algorithm gave a size of 63.
It would be interesting to know where these differences come from.
By visual inspection, the plot above shows that the first ('Very High') and the second group ('High') are well separated, whereas the frontier between the second and the third is much less evident.
Conclusion: the choice of 4 groups, or clusters, seems appropriate, and the 'very high' group is indeed quite distinctly identified by the kmeans algorithm. On the other hand, the separation between the three other groups is more uncertain.
No comments:
Post a Comment