Hierarchical Clustering

Hierarchical Clustering#

Today we will look at a fairly different approach to clustering.

So far, we have been thinking of clustering as finding a partition of our dataset.

That is, a set of nonoverlapping clusters, in which each data item is in one cluster.

However, in many cases, the notion of a strict partition is not as useful.

How Many Clusters?#

How many clusters would you say there are here?

/Users/crovella/miniconda3/lib/python3.9/site-packages/pandas/plotting/_matplotlib/core.py:1259: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  scatter = ax.scatter(

_images/9c0398b63f0f12aafa7a38cbbc1fcd981ba6591960d407e5b7b814773b2c4291.png

Three clusters?

_images/222591a9a6af11c3786ba6af466d40db815c4c7404f71ce6d84b8be52abd5025.png

Four clusters?

_images/df1e6f2da488e3732fd66e9f705f618b0247dcb848d21332b713b56db9c623ea.png

Six clusters?

_images/93e6489b8eba9233e43184fdda1b1fb8c47cf524fa792ae0a7c92a7785f6808b.png

This dataset shows clustering on multiple scales.

To fully capture the structure in this dataset, two things are needed:

Capturing the differing clusters depending on the scale
Capturing the containment relations – which clusters lie within other clusters

These observations motivate the notion of hierarchical clustering.

In hierarchical clustering, we move away from the partition notion of \(k\)-means,

and instead capture a more complex arrangement that includes containment of one cluster within another.

Hierarchical Clustering#

A hierarchical clustering produces a set of nested clusters organized into a tree.

A hierarchical clustering is visualized using a dendrogram

A tree-like diagram that records the containment relations among clusters.

Strengths of Hierarchical Clustering#

Hierarchical clustering has a number of advantages:

First, a hierarchical clustering encodes many different clusterings. That is, it does not itself decide on the correct number of clusters.

A clustering is obtained by “cutting” the dendrogram at some level.

This means that you can make this crucial decision yourself, by inspecting the dendrogram.

Put another way, you can obtain any desired number of clusters.

The second advantage is that the dendrogram may itself correspond to a meaningful structure, for example, a taxonomy.

The third advantage is that many hierarchical clustering methods can be performed using either similarity (proximity) or dissimilarity (distance) metrics.

This can be very helpful!

(Note that techniques like \(k\)-means cannot be used with unmodified similarity metrics.)

Compared to \(k\)-means#

Another aspect of hierachical clustering is that it can handle certain cases better than \(k\)-means.

Because of the nature of the \(k\)-means algorithm, \(k\)-means tends to produce:

Roughly spherical clusters
Clusters of approximately equal size
Non-overlapping clusters

In many real-world situations, clusters may not be round, they may be of unequal size, and they may overlap.

Hence we would like clustering algorithms that can work in those cases also.

Hierarchical Clustering Algorithms#

There are two main approaches to hierarchical clustering: “bottom-up” and “top-down.”

Agglomerative Clustering (“bottom-up”):

Start by defining each point as its own cluster
At each successive step, merge the two clusters that are closest to each other
Repeat until only one cluster is left.

Divisive Clustering (“top-down”):

Start with one, all-inclusive cluster
At each step, find the cluster split that creates the largest distance between resulting clusters
Repeat until each point is in its own cluster.

Agglomerative techniques are by far the more common.

The key to both of these methods is defining the distance between two clusters.

Different definitions for the inter-cluster distance yield different clusterings.

To illustrate the impact of the choice of cluster distances, we’ll focus on agglomerative clustering.

Defining Cluster Proximity#

Given two clusters, how do we define the distance between them?

Here are three natural ways to do it:

Single-Linkage: the distance between two clusters is the distance between the closest two points that are in different clusters.

\[ D_\text{single}(i,j) = \min_{x, y}\{d(x, y) \,|\, x \in C_i, y \in C_j\}\]

Complete-Linkage: the distance between two clusters is the distance between the farthest two points that are in different clusters.

\[ D_\text{complete}(i,j) = \max_{x, y}\{d(x, y) \,|\, x \in C_i, y \in C_j\}\]

Average-Linkage: the distance between two clusters is the average distance between all pairs of points from different clusters.

\[ D_\text{average}(i,j) = \frac{1}{|C_i|\cdot|C_j|}\sum_{x \in C_i,\, y \in C_j}d(x, y)\]

Single-Linkage

Complete-Linkage

Average-Linkage

Notice that it is easy to express the definitions above in terms of similarity instead of distance.

Here is a set of 6 points that we will cluster to show differences between distance metrics.

_images/42885deea5f9f3558da50d9e50b97eeb43f3deb0ac8bd3af8f042980a0fdef97.png

	p0	p1	p2	p3	p4	p5
p0	0.00	0.23	0.22	0.37	0.34	0.24
p1	0.23	0.00	0.14	0.19	0.14	0.24
p2	0.22	0.14	0.00	0.16	0.28	0.10
p3	0.37	0.19	0.16	0.00	0.28	0.22
p4	0.34	0.14	0.28	0.28	0.00	0.39
p5	0.24	0.24	0.10	0.22	0.39	0.00

Single-Linkage Clustering#

_images/f73b9ec14a9d29e52859e1ee4b27e63be0b34b46fceeff6906bc8867bd13f751.png

Advantages:

Single-linkage clustering can handle non-elliptical shapes.

In fact it can produce long, elongated clusters:

_images/5c54a72b8028bfd950ab1b7aec9bd6c336b2abaecab9269d6c18a25ecc0c4824.png

_images/dbfbe3787f66b1d39ecbbfce49861b09c58553efefb3defcbf25f5e1f07cd7e9.png

Disadvantages:

Single-linkage clustering can be sensitive to noise and outliers.

_images/8177d4d014be8d71e34fa97cea7edcde4d88eb2ab70feb895575d1e0f1d4ef31.png

_images/97b1bda4937a3620bf7566d45fba93264a00bf27f59f8d5d698509de2b9d6c43.png

Complete-Linkage Clustering#

_images/2a413637e199d96fc98661172f01c31871c8201651f6b5657c2fd62690d8306a.png

Advantages:

Produces more-balanced clusters – more-equal diameters

_images/0c7d96a7a27d13462d09bc5a2048ed1bf1132965711a512565d7d724f640a392.png

_images/aebc41732ee6a0793c9fa9e97411288859d50d05eda4c6b0bdc093b2d2a58cac.png

Less susceptible to noise:

_images/ca025323947578a439702724d255aefc1f7339645f733a43cae6ae0dde92af3c.png

Average-Linkage Clustering#

_images/5b864045cfd4623bf8a8895b6fe8e04827de2eee383556b94a4b2ffae78a68a2.png

Average-Linkage clustering is in some sense a compromise between Single-linkage and Complete-linkage clustering.

Strengths:

Less susceptible to noise and outliers

Limitations:

Biased toward elliptical clusters

_images/64c8594c039b5c8d4305db07c9aceb04de45b2625d4a1c9531f961c69539f281.png

_images/ad1adae9f68c4a48e27f1f93f133057441d9615e01eb2474c8253dbfca45d8c3.png

All Three Compared#

Single-Linkage

Complete-Linkage

Average-Linkage

Ward’s Distance#

Finally, we consider one more cluster distance.

Ward’s distance asks “what if”.

That is, “What if we combined these two clusters – how would clustering improve?”

To define “how would clustering improve?” we appeal to the \(k\)-means criterion.

So:

Ward’s Distance between clusters \(C_i\) and \(C_j\) is the difference between the total within cluster sum of squares for the two clusters separately, compared to the within cluster sum of squares resulting from merging the two clusters into a new cluster \(C_{i+j}\):

\[D_\text{Ward}(i, j) = \sum_{x \in C_i} (x - r_i)^2 + \sum_{x \in C_j} (x - r_j)^2 - \sum_{x \in C_{i+j}} (x - r_{i+j})^2 \]

where \(r_i, r_j, r_{i+j}\) are the corresponding cluster centroids.

In a sense, this cluster distance results in a hierarchical analog of \(k\)-means.

As a result, it has properties similar to \(k\)-means:

Less susceptible to noise and outliers
Biased toward elliptical clusters

Hence it tends to behave more like group-average hierarchical clustering.

Hierarchical Clustering In Practice#

Now we’ll look at doing hierarchical clustering in practice, using python.

We’ll use the same synthetic data as we did in the k-means case – ie., three “blobs” living in 30 dimensions.

X, y = sk_data.make_blobs(n_samples=100, centers=3, n_features=30,
                          center_box=(-10.0, 10.0),random_state=0)

As a reminder of the raw data here is the visualization: first the raw data, then an embedding into 2-D (using MDS).

sns.heatmap(X, xticklabels=False, yticklabels=False, linewidths=0,cbar=False);

_images/be370f3f53ae5c2cd0f7ea39f0cec21642cd737681128baa36de0cd4265b6412.png

import sklearn.manifold
import sklearn.metrics as metrics
euclidean_dists = metrics.euclidean_distances(X)
mds = sklearn.manifold.MDS(n_components = 2, max_iter = 3000, eps = 1e-9, random_state = 0,
                   dissimilarity = "precomputed", n_jobs = 1)
fit = mds.fit(euclidean_dists)
pos = fit.embedding_
plt.axis('equal')
plt.scatter(pos[:, 0], pos[:, 1], s = 8);

/Users/crovella/miniconda3/lib/python3.9/site-packages/sklearn/manifold/_mds.py:299: FutureWarning: The default value of `normalized_stress` will change to `'auto'` in version 1.4. To suppress this warning, manually set the value of `normalized_stress`.
  warnings.warn(

_images/c26283a1c394d2f7a1080cf9355988a69656e815e38853b0ee89f19426622d63.png

Hierarchical clustering is available in sklearn, but there is a much more fully developed set of tools in the scipy package and that is the one to use.

import scipy.cluster
import scipy.cluster.hierarchy as hierarchy
import scipy.spatial.distance

# linkages = ['single','complete','average','weighted','ward']
Z = hierarchy.linkage(X, method = 'single')

R = hierarchy.dendrogram(Z)

_images/ee148f74dc55564a047e727e369eb92b0470655b1dffde64214e372f420ae2be.png

Hierarchical Clustering Real Data#

Once again we’ll use the “20 Newsgroup” data provided as example data in sklearn.

(http://scikit-learn.org/stable/datasets/twenty_newsgroups.html).

from sklearn.datasets import fetch_20newsgroups
categories = ['comp.os.ms-windows.misc', 'sci.space','rec.sport.baseball']
news_data = fetch_20newsgroups(subset = 'train', categories = categories)

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', min_df = 4, max_df = 0.8)
data = vectorizer.fit_transform(news_data.data).todense()
data.shape

(1781, 9409)

# metrics can be ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, 
# ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, 
# ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, 
# ‘sqeuclidean’, ‘yule’.
Z_20ng = hierarchy.linkage(data, method = 'ward', metric = 'euclidean')
plt.figure(figsize=(14,4))
R_20ng = hierarchy.dendrogram(Z_20ng, p=4, truncate_mode = 'level', show_leaf_counts=True)

_images/ddb6121dfe41e06780c2995041630532e43da9055fb7c24b50e1f37537b8cbfb.png

Selecting the Number of Clusters#

clusters = hierarchy.fcluster(Z_20ng, 3, criterion = 'maxclust')
print(clusters.shape)
clusters

(1781,)

array([3, 3, 3, ..., 1, 3, 1], dtype=int32)

max_clusters = 20
s = np.zeros(max_clusters+1)
for k in range(2, max_clusters+1):
    clusters = hierarchy.fcluster(Z_20ng, k, criterion = 'maxclust')
    s[k] = metrics.silhouette_score(np.asarray(data), clusters, metric = 'euclidean')
plt.plot(range(2, len(s)), s[2:], '.-')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score');

_images/35eb011731afbc6ffb4b898bb70a2c093593485fd769c2f1edc179cb3b8e180b.png

print('Top Terms Per Cluster:')
k = 5
clusters = hierarchy.fcluster(Z_20ng, k, criterion = 'maxclust')
for i in range(1,k+1):
    items = np.array([item for item,clust in zip(data, clusters) if clust == i])
    centroids = np.squeeze(items).mean(axis = 0)
    asc_order_centroids = centroids.argsort()#[:, ::-1]
    order_centroids = asc_order_centroids[::-1]
    terms = vectorizer.get_feature_names_out()
    print(f'Cluster {i}:')
    for ind in order_centroids[:10]:
        print(f' {terms[ind]}')
    print('')

Top Terms Per Cluster:
Cluster 1:
 space
 nasa
 edu
 henry
 gov
 alaska
 access
 com
 moon
 digex

Cluster 2:
 ax
 max
 b8f
 g9v
 a86
 145
 1d9
 pl
 2di
 0t

Cluster 3:
 edu
 com
 year
 baseball
 article
 writes
 cs
 team
 game
 university

Cluster 4:
 risc
 instruction
 ghhwang
 csie
 set
 nctu
 cisc
 tw
 reduced
 mq

Cluster 5:
 windows
 edu
 file
 dos
 com
 files
 card
 drivers
 driver
 use

Hierarchical Clustering

Contents

Hierarchical Clustering#

How Many Clusters?#

Hierarchical Clustering#

Strengths of Hierarchical Clustering#

Compared to \(k\)-means#

Hierarchical Clustering Algorithms#

Defining Cluster Proximity#

Single-Linkage Clustering#

Complete-Linkage Clustering#

Average-Linkage Clustering#

All Three Compared#

Ward’s Distance#

Hierarchical Clustering In Practice#

Hierarchical Clustering Real Data#

Selecting the Number of Clusters#