Chapter 10 Clustering - Hierarchical and k-Means

10.1 Hierarchical Clustering

The point of clustering is to organize observations that are close together and separate them into groups. Closeness is generally a measure of distance between observations, the primary measures being Euclidean, Manhattan or Cosine. You just have to pick the one that makes sense for your situation. For most uses, Euclidean distance (often the default) does a great job, but occasionally cosine distances are more useful when doing natural language analysis.

The dist function in R calculates distances. It works on any numeric data frame or matrix. The default method is Euclidean. If the dataframe has a non-numeric value, the results contain a warning and are inaccurate. The calculations for the dist function are replicable in Excel - the distance between any two rows is sqrt((col1 - col2)^2 + ...).

To be able to do hierarchical clustering, we need the distance. We use the function hclust to get the clusters, which can then be plotted as a dendrogram using plot.

data(mtcars)
plot(hclust(dist(mtcars)))

Let us do this a bit slowly, one step at a time.

mydist = dist(mtcars) #calculate distance
mycluster = hclust(mydist) #create clusters
mydendro = as.dendrogram(mycluster) #create dendrogram
plot(mydendro) #plot dendrogram

Now let us look at mydendro

mydendro
## 'dendrogram' with 2 branches and 32 members total, at height 425.3447

Question is, what can you do with this? Answer is, that by ‘cutting’ the dendrogram at the right height, you can get any number of clusters or groups that you desire.

cutree(mycluster, k = 3)
##           Mazda RX4       Mazda RX4 Wag          Datsun 710 
##                   1                   1                   1 
##      Hornet 4 Drive   Hornet Sportabout             Valiant 
##                   2                   3                   2 
##          Duster 360           Merc 240D            Merc 230 
##                   3                   1                   1 
##            Merc 280           Merc 280C          Merc 450SE 
##                   1                   1                   2 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood 
##                   2                   2                   3 
## Lincoln Continental   Chrysler Imperial            Fiat 128 
##                   3                   3                   1 
##         Honda Civic      Toyota Corolla       Toyota Corona 
##                   1                   1                   1 
##    Dodge Challenger         AMC Javelin          Camaro Z28 
##                   2                   2                   3 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2 
##                   3                   1                   1 
##        Lotus Europa      Ford Pantera L        Ferrari Dino 
##                   1                   3                   1 
##       Maserati Bora          Volvo 142E 
##                   3                   1
cutree(mycluster, k = 2:5)
##                     2 3 4 5
## Mazda RX4           1 1 1 1
## Mazda RX4 Wag       1 1 1 1
## Datsun 710          1 1 1 1
## Hornet 4 Drive      1 2 2 2
## Hornet Sportabout   2 3 3 3
## Valiant             1 2 2 2
## Duster 360          2 3 3 3
## Merc 240D           1 1 1 1
## Merc 230            1 1 1 1
## Merc 280            1 1 1 1
## Merc 280C           1 1 1 1
## Merc 450SE          1 2 2 2
## Merc 450SL          1 2 2 2
## Merc 450SLC         1 2 2 2
## Cadillac Fleetwood  2 3 3 3
## Lincoln Continental 2 3 3 3
## Chrysler Imperial   2 3 3 3
## Fiat 128            1 1 1 4
## Honda Civic         1 1 1 4
## Toyota Corolla      1 1 1 4
## Toyota Corona       1 1 1 1
## Dodge Challenger    1 2 2 2
## AMC Javelin         1 2 2 2
## Camaro Z28          2 3 3 3
## Pontiac Firebird    2 3 3 3
## Fiat X1-9           1 1 1 4
## Porsche 914-2       1 1 1 1
## Lotus Europa        1 1 1 1
## Ford Pantera L      2 3 3 3
## Ferrari Dino        1 1 1 1
## Maserati Bora       2 3 4 5
## Volvo 142E          1 1 1 1

Simply displaying cluster numbers isn’t very helpful, we might also want to see how many items in each cluster.

mytree = cutree(mycluster, k = 3)
table(mytree)
## mytree
##  1  2  3 
## 16  7  9

10.2 K-means clustering

K-means clusters data in a given number of clusters by using an iterative algorithm. As for hierarchical clustering, different distance measures can be applied. As a default, we need to provide the number of clusters.

kmeansfit = kmeans(mtcars, 3) # three is the number of clusters
kmeansfit
## K-means clustering with 3 clusters of sizes 16, 7, 9
## 
## Cluster means:
##        mpg      cyl     disp       hp     drat       wt     qsec        vs
## 1 24.50000 4.625000 122.2937  96.8750 4.002500 2.518000 18.54312 0.7500000
## 2 17.01429 7.428571 276.0571 150.7143 2.994286 3.601429 18.11857 0.2857143
## 3 14.64444 8.000000 388.2222 232.1111 3.343333 4.161556 16.40444 0.0000000
##          am     gear     carb
## 1 0.6875000 4.125000 2.437500
## 2 0.0000000 3.000000 2.142857
## 3 0.2222222 3.444444 4.000000
## 
## Clustering vector:
##           Mazda RX4       Mazda RX4 Wag          Datsun 710 
##                   1                   1                   1 
##      Hornet 4 Drive   Hornet Sportabout             Valiant 
##                   2                   3                   2 
##          Duster 360           Merc 240D            Merc 230 
##                   3                   1                   1 
##            Merc 280           Merc 280C          Merc 450SE 
##                   1                   1                   2 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood 
##                   2                   2                   3 
## Lincoln Continental   Chrysler Imperial            Fiat 128 
##                   3                   3                   1 
##         Honda Civic      Toyota Corolla       Toyota Corona 
##                   1                   1                   1 
##    Dodge Challenger         AMC Javelin          Camaro Z28 
##                   2                   2                   3 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2 
##                   3                   1                   1 
##        Lotus Europa      Ford Pantera L        Ferrari Dino 
##                   1                   3                   1 
##       Maserati Bora          Volvo 142E 
##                   3                   1 
## 
## Within cluster sum of squares by cluster:
## [1] 32838.00 11846.09 46659.32
##  (between_SS / total_SS =  85.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

It is as simple as that!

You can add the cluster names as a column to the data frame.

mtcars$kmeanscluster = kmeansfit$cluster