Network Centrality and Clustering

Network Centrality and Clustering#

Now we turn to two important concepts in the analysis of networks:

Important nodes, and
Important groups of nodes.

The question of important nodes leads to the notion of centrality,

And the question of important groups of nodes leads to the notion of clustering.

In both cases, we will draw connections between graphs and linear algebra.

Centrality#

An common question in the analysis of networks is to understand the relative “importance” of the nodes in the network.

For example:

in a social network, who are the most influential individuals?
in the Web, which pages are more informative?
in road network, which intersections are most heavily used?

The key idea is that the structure of the network should give us some information about the relative importance of the nodes in the network.

To introduce concepts here, we’ll look at a very famous dataset in the history of network analysis: Zachary’s karate club.

The back story: from 1970 to 1972 the anthropologist Wayne Zachary studied the social relationships inside a university karate club.

While he was studying the club, a factional division led to a splitting of the club in two.

The club became split between those who rallied around the club president, and those who rallied around the karate instructor.

Note

You can read the story of the Karate club here. This dataset has become so famous that it has spawned its own academic traditions.

Here’s a view of the social network of the karate club.

_images/e0cb95d71b72f574ca4c253ca2200db05991cfe34c304e889fbc34b7bc2e128d.png

Do some nodes in the network have a special role?

Are some nodes more “important” than others?

These are often termed questions of centrality (or prestige).

Today we’ll study metrics for these notions. We’ll work with a graph \(G\).

We will study three basic notions of centrality:

Closeness Centrality: A central node is close to all others.
Betweenness Centrality: A central node is on many paths through the network.
Status Centrality: A central node is connected to other central nodes.

Closeness Centrality#

The closeness centrality of a node \(i\) is an indicator of the proximity between \(i\) and all the other nodes in the graph.

We assume that \(G\) is connected; let \(d(i,j)\) be the shortest path distance between \(i\) and \(j\) in \(G\).

(Although other dissimilarity measures could be used.)

Then the standard way of formulating closeness centrality is the reciprocal of the total distance to all other nodes.

\[ \text{closeness}(i) = \frac{1}{\sum_{j \in V} d(i,j)}.\]

_images/a07fc8a58b82ff402cc8abf7ff5999996240bf5097f16f5792cb7539a3057af1.png

In this graph, most nodes are close to most other nodes.

However we can see that some nodes are slightly more central than others.

_images/0193aa55e64961095d44e5a73161a5069b811c6fa48c49d6071bc95d8817933a.png

Betweenness Centrality#

Another way to think about centrality is “is the node on many paths?”

If we picture the network as a conduit for information, then betweenness captures how important a node is to the communication process (or “how much” information passes through the node).

First, let’s consider the case in which there is only one shortest path between any pair of nodes.

Then, the betweenness centrality of node \(i\) is the number of shortest paths that pass through \(i\).

Mathematically:

\[\begin{split} \text{betweenness}(i) = \sum_{i \neq j \neq k \in V} \left\{\begin{array}{ll}1&\text{if path from }j\text{ to }k\text{ goes through }i\\0&\text{otherwise}\end{array}\right. \end{split}\]

We can convert this to a value between 0 and 1 by dividing by \({n \choose 2} = n(n-1)/2\).

Now, in a general graph, there may be multiple shortest paths between \(j\) and \(k\).

To handle this, we define:

\(\sigma(i \mid j,k)\) is the number of shortest paths between \(j\) and \(k\) that pass through \(i\), and
\(\sigma(j,k)\) is the total number of shortest paths between \(j\) and \(k\).

Then we define the dependency of \(i\) on the paths between \(j\) and \(k\):

\[ \text{dependency}(i \mid j,k) = \frac{\sigma(i \mid j,k )}{\sigma(j,k)} \]

You can think of this as “the probability that a shortest path between \(j\) and \(k\) goes through \(i\).”

And finally:

\[ \text{betweenness}(i) = \sum_{i \neq j \neq k \in V} \text{dependency}(i \mid j, k) \]

_images/16dac42ecd5b3f280dbdee7047e83046a1ab317195f9544f63211f4969525c34.png

We start to see with this metric the “importance” of two or three key members of the karate club.

Note that many nodes will have a betweenness centrality of zero – no shortest paths go through them.

_images/b0dd09718df4d74c9bdf8adc27524c342ab698223c81b49031a79005bff655c4.png

Adjacency Matrices#

To define the next centrality, we need to start thinking about graphs as matrices.

Given an \(n\)-node undirected graph \(G = (V, E)\), its adjacency matrix \(A\) is defined as:

\[\begin{split} A_{ij} = \left\{ \begin{array}{ll}1 & \text{if }(i, j) \in E\\ 0 & \text{otherwise} \\ \end{array}\right. \end{split}\]

Here is what the adjacency matrix of the karate club graph looks like:

/var/folders/nm/m_437v6x2qj7_7k4k5wjbk9r000s6y/T/ipykernel_46700/2850787089.py:14: FutureWarning: adjacency_matrix will return a scipy.sparse array instead of a matrix in Networkx 3.0.
  A = nx.adjacency_matrix(Gk).astype('int').todense()

_images/07c47ebd264d9707491763a9f13a21d3430a090e24c54c9e451b9cf29b191d9d.png

An important way to think about adjacency matrices: column \(j\) holds \(j\)’s neighbors.

Note that the adjacency matrix has some important algebraic properties:

It is nonnegative, and
It is symmetric.

Status Centrality#

The third notion of centrality is more subtle.

In this context, one often talks of “prestige” rather than “centrality.” But the concepts are related.

The idea of status centrality is that “high status” nodes are those that are connected to “high status” nodes.

If you think this definition is circular, you are right!

Nonetheless, it leads to some metrics that are quite well defined and not hard to compute.

Let’s make this definition more precise: the centrality of a node is proportional to the sum of the centrality of its neighbors.

Then we would say that:

\[ \mathbf{e}_i = 1/\lambda \sum_{(i, j) \in E} \mathbf{e}_j \]

This encodes our recursive definition for \(\mathbf{e}_i\), based on a calculation that may or may not converge, depending on \(\lambda.\)

Now, we can write this equation in terms of the adjacency matrix \(A\):

\[ \mathbf{e}_i = 1/\lambda \sum_j A_{ij} \mathbf{e}_j \]

Notice that this is just the definition of a matrix-vector multiplication.

So we can write even more concisely:

\[ A\mathbf{e} = \lambda\mathbf{e} \]

… which means that \(\mathbf{e}\) must be an eigenvector of \(A\).

For this reason, status centrality is often called eigenvector centrality.

Which eigenvector of \(A\) should we choose?

Let’s assume that the graph \(G\) is connected.

Then there is a good argument for choosing the largest eigenvalue of \(A\), in which case all of the entries in \(\mathbf{e}\) will be nonnegative.

(This fact comes from the Perron-Frobenius theorem, which tells us that if \(G\) is connected, then the largest eigenvalue of \(A\) is positive and the corresponding eigenvector is nonnegative.)

_images/3dae0afaa79d8b8563ae8039d08beb7815277b5b439da0efea7bec17a640cb54.png

_images/e1ee294d2374d18e095606cebf773c8a1fd956622979c50cf68bac4cb753120b.png

As a more detailed example, we can ask whether this definition of “prestige” applies to NCAA Division 1A football teams, based on who they play against.

(Remember this data is from 2000!)

_images/05075d849e497f1d96c9e5da4f15a9d62219f4407c8589be6c676a4dff8e3d62.png

So let’s compare the three versions of centrality we’ve looked at:

_images/7e4709fe02292f4136605f1302a280388754e956adad6b35862c202ee4d9930f.png

Clustering and Partitioning#

We now turn to the question of finding important groups of nodes.

Why might we want to cluster graph nodes?

Assiging computations to processors in a parallel computer
Segmenting images (finding boundaries between objects)
Clustering words found together in documents, or documents with similar words
Divide and conquer algorithms
Circuit layout in VLSI
Community detection in social networks

Min \(s\)-\(t\) cut#

We’ll start with a problem that is fundamental to many other problems in graph analysis.

Let’s say we have a graph \(G\) and two nodes in mind, \(s\) and \(t\).

We would like to isolate \(s\) from \(t\). What is the cheapest way to do it?

_images/L23-max-flow-soviet-rail-1955.png

Note

For an interesting historical perspective on the min-cut problem and its relation to the Cold War, see “On the history of the transportation and maximum flow problems,” by Alexander Schrijver, in Mathematical Programming 91.3 (2002): 437-445.

A min \(s\)-\(t\) cut problem is as follows.

We are given a weighted graph \(G = (V,E)\).

An \(s\)-\(t\) cut \(C\) of \(G\) is a partition of \(V\) into \((U, V-U)\) such that \(s \in U\) and \(t \in V-U\).

The cost of a cut is the total weight of the edges that go between the two parts:

\[ \text{Cost}(C) = \sum_{e(u,v),\, u\in U,\, v\in V-U} w(e)\]

This is a very famous problem that can be solved in time that is polynomial in \(|V|\) and \(|E|\).

Increasingly better solutions have been found over the past 60+ years.

What can a min \(s\)-\(t\) cut tell us about a graph?

Let’s look at the karate club, in which I’ve highlighted the president and the instructor:

_images/c49b9a881ff86152f10572bdc9141955cea8aa3a093dde27ec162306bd2bbe51.png

As mentioned, when Wayne Zachary studied the club, a conflict arose between the instructor and the president (nodes 0 and 33).

Zachary predicted the way the club would split based on an \(s\)-\(t\) min cut.

In fact, he correctly predicted every single member’s eventual association except for node 8!

_images/cfe6d332b343fff00db925f30708a1c54670a19a44f3937ca21782acd20c4051.png

Minimum Cuts#

Now, in partitioning a graph, we may not have any particular \(s\) and \(t\) in mind.

Rather, we may want to simply find the “cheapest” way to disconnect the graph.

Clearly, we can do this using \(s\)-\(t\) min cut, by simply trying all \(s\) and \(t\) pairs.

Let’s try this approach of finding the minimum \(s-t\) cut over all possibilities in the karate club graph:

_images/60b29f04d820997fccdb1f2388b8a78802d8e986c798833b909c79f58d7e4eac.png

This is in fact the minimum cut: node 11 only has one edge to the rest of the graph, so the min cut is 1.

As this example shows, minimum cut is not, in general, a good approach for clustering or partitioning.

To get a more useful partition, we need to define a new goal: finding a balanced cut.

Balanced Cuts#

The idea to avoid the problem above is to normalize the cut by the size of the smaller of the two components.

The problem above would be avoided because the smaller of the two cuts is just a single node.

This leads us to define the isoperimetric ratio:

\[ \alpha = \frac{E(U, V\setminus U)}{\min(|U|, |V\setminus U|)} \]

And the isoperimetric number of G:

\[ \alpha(G) = \min_U \frac{E(U, V\setminus U)}{\min(|U|, |V\setminus U|)} \]

The idea is that finding \(\alpha(G)\) gives a balanced cut – one that maximizes the number of disconnected nodes per edge removed.

How easily can we compute this?

Unfortunately, it’s not computable in polynomial time.

However, we can make good approximations, which we’ll look at now.

To do so, we’ll return to using linear algebra for graphs, and introduce spectral graph theory.

Spectral Graph Theory#

Note

If you want to study this in more detail, some excellent references are

Allow Me to Introduce Spectral and Isoperimetric Graph Partitioning by Jonathan Shewchuck, which has outstanding visualizations and physical intuition.
Spectral and Algebraic Graph Theory by Daniel Spielman which provides proofs and much more detail.

Spectral graph theory is the use of linear algebra to study the properties of graphs.

To introduce spectral graph theory, we define some terms.

For a undirected graph \(G\) having \(n\) nodes, we define the \(n\times n\) matrix \(D\) as a diagonal matrix of node degrees.

I.e., \(D = \text{diag}(d_1, d_2, d_3, \dots)\) where \(d_i\) is the degree of node \(i\).

Then assuming \(G\) has adjacency matrix \(A\), we define the Laplacian of \(G\) as:

\[ L = D - A \]

Here is the Laplacian matrix \(L\) for the karate club network, shown as a heatmap:

<class 'networkx.utils.decorators.argmap'> compilation 37:4: FutureWarning: laplacian_matrix will return a scipy.sparse array instead of a matrix in Networkx 3.0.

_images/395a7e40eadcdf7c88a55ebe7ca0c7ac8f45fc33133a5326f6da1090dbecd91f.png

Now let us think about an \(n\)-component vector \(\mathbf{x} \in \mathbb{R}^n\) as an assignment of values to nodes in the graph \(G\).

For example, \(\mathbf{x}\) could encode node “importance” or “strength” or even a more concrete notion like “temperature” or “altitude.”

Then here is an amazing fact about the Laplacian of \(G\).

(For a proof you can see the notes.)

The quadratic form

\[ \mathbf{x}^TL\mathbf{x}\]

is exactly the same as

\[ \sum_{(i,j)\in E} (x_i - x_j)^2 \]

Note

To see that

\[ \mathbf{x}^TL\mathbf{x} = \sum_{(i,j)\in E} (x_i - x_j)^2, \]

first consider \(\mathbf{x}^TL\mathbf{x}\). Writing out the quadratic form explicitly, we have that

\[ \mathbf{x}^TL\mathbf{x} = \sum_{i, j} L_{ij}x_i x_j. \]

Now, taking into account the values in \(L\), we see that in the sum we will have the term \(d_i x_i^2\) for each \(i\), and also 2 terms of \(-x_ix_j\) whenever \((i,j)\in E\).

Turning to

\[\sum_{(i,j)\in E} (x_i - x_j)^2 = \sum_{(i,j)\in E} x_i^2 - 2x_ix_j + x_j^2, \]

we have the same set of terms in the sum.

That is, the quadratic form \( \mathbf{x}^TL\mathbf{x}\) is the sum of squared differences of \(\mathbf{x}\) over the edges in \(G\).

In other words, “When nodes have the values in \(\mathbf{x}\), how much do adjacent nodes vary?”

Now, let’s think about vectors \(\mathbf{x}\) that minimize the differences over the edges in the graph.

We can think of these as “smooth” functions on the graph – neighboring nodes don’t differ too much.

To find such “smooth” vectors, we would solve this optimization:

\[ \min_{\Vert \mathbf{x}\Vert = 1}\sum_{(i,j)\in E} (x_i - x_j)^2 \]

We constrain \(\mathbf{x}\) to have a nonzero norm, otherwise \(\mathbf{x} = \mathbf{0}\) would be a trivial solution.

But we can express this in terms of the graph Laplacian:

\[ \min_{\Vert \mathbf{x}\Vert = 1}\sum_{(i,j)\in E} (x_i - x_j)^2 = \min_{\Vert \mathbf{x}\Vert = 1} \mathbf{x}^TL\mathbf{x} \]

Now, we know how to solve this constrained minimization!

From linear algebra, we know that when

\[ \lambda = \min_{\Vert \mathbf{x}\Vert = 1} \mathbf{x}^TL\mathbf{x} \]

then \(\lambda\) is the smallest eigenvalue of \(L\) and \(\mathbf{x}\) is the corresponding eigenvector.

So … we are connecting functions on the graph \(G\) with eigenvectors of the matrix \(L\).

Quite remarkable!

Well, what do we know about \(L\)?

\(L\) is symmetric. Therefore the eigenvectors of \(L\) are orthogonal and its eigenvalues are real.
\(L\) is positive semidefinite. Therefore the eigenvalues of \(L\) are all positive or zero. (For a proof see the notes.)

We can order the eigenvalues from largest to smallest \(\lambda_n \geq \dots \geq \lambda_2 \geq \lambda_1 \geq 0.\)

Note

How do we know that \(L\) is positive semidefinite?

Consider \( \sum_{(i,j)\in E} (x_i - x_j)^2.\)

This is always a nonnegative quantity.

So \(\mathbf{x}^T L\mathbf{x} \geq 0\) for all \(\mathbf{x}\), which is in fact the definition of positive-semidefiniteness.

Next, let’s assume that \(G\) is connected.

Then \(L\) has a single eigenvalue of value \(\lambda_1 = 0\). The corresponding eigenvector is \(\mathbf{w}_1 = \mathbf{1} = [1, 1, 1, \dots]^T\).

This is easily seen:

\[L{\mathbf 1}={\mathbf 0}.\]

Recall that row \(i\) of \(L\) consists of \(d_i\) on the diagonal, and \(d_i\) -1s in other positions.

The second-smallest eigenvalue of \(L\), \(\lambda_2\), is called the Fiedler value.

We know that all of the other eigenvectors of \(L\) are orthogonal to \(\mathbf 1\), because \(L\) is symmetric.

Because of that, a definition of the second smallest eigenvalue is:

\[\lambda_2 = \min_{\Vert \mathbf{x}\Vert = 1, \;\mathbf{x}\perp {\mathbf 1}} \mathbf{x}^TL\mathbf{x}\]

Note that another way of saying that \(\mathbf{x} \perp {\mathbf 1}\) is that the entries of \(\mathbf{x}\) sum to 0.

In other words, \(\mathbf{x}\) is mean-centered or zero-mean.

The corresponding eigenvector is called the Fiedler vector.

It minimizes:

\[\mathbf{w}_2 = \arg \min_{\Vert \mathbf{x}\Vert=1,\;\mathbf{x}\perp {\mathbf 1}} \sum_{(i,j)\in E} (x_i - x_j)^2\]

Let’s look at this closely:

if we think of \(x_i\) as a 1-D “coordinate” for node \(i\) in the graph,

then choosing \(\mathbf{x} = \mathbf{w}_2\) (the eigenvector corresponding to \(\lambda_2\)) puts each node in a position that minimizes the sum of the “squared stretching” of each edge.

Now, perhaps you recall that the energy in a stretched spring is proportional to the square of its stretched length.

So imagine that we use the entries in \(\mathbf{w}_2\) to position the nodes of the graph along a single dimension.

Then using the Fiedler vector \(\mathbf{w}_2\) for node coordinates is exactly the spring layout of nodes that we discussed in the last lecture – except that it is in one dimension only.

This is the basis for the spectral layout that we showed in the last lecture.

In spectral layout, we use \(\mathbf{w}_2\) for the first dimension, and \(\mathbf{w}_3\) for the second dimension.

\(\mathbf{w}_3\) is the eigenvector corresponding to

\[\lambda_3 = \min_{\Vert \mathbf{x}\Vert = 1, \;\mathbf{x}\perp \{\mathbf{1}, \mathbf{w}_2\}} \mathbf{x}^TL\mathbf{x}\]

Let’s look again at layouts for the football network:

_images/40124378630124ea14ab5bba9b435867ed554a57c2df265f238783f534eeb6f7.png

So what is the difference between the spectral layout and the spring layout?

In one dimension, they are the same, but in multiple dimensions, spectral layout optimizes each dimension separately.

Spectral Partitioning#

This leads to key ideas in node partitioning.

The basic idea is to partition nodes according to the Fiedler vector \(\mathbf{w}_2\).

This can be shown to have provably good performance for the balanced cut problem.

Note

See Spectral and Algebraic Graph Theory by Daniel Spielman, Chapter 20, where it is proved that for every \(U \subset V\) with \(|U| \leq |V|/2\), \(\alpha(G) \geq \lambda_2 (1-s)\) where \(s = |U|/|V|\). In particular, \(\alpha(G) \geq \lambda_2/2.\)

There are a number of options for how to split based on the Fiedler vector.

If \(\mathbf{w}_2\) is the Fiedler vector, then split nodes according to a value \(s\):

bisection: \(s\) is the median value in \(\mathbf{w}_2\)
ratio cut: \(s\) is the value that maximizes \(\alpha\)
sign: separate positive and negative vaues (\(s = 0\))
gap: separate according to the largest gap in the values of \(\mathbf{w}_2\)

Here is a spectral parititioning for the karate club graph:

_images/fe0b51c351d022f0409673f4d8c69d460ecc4b463b1957469627b7e63b0fd61a.png

Interestingly, this is almost the same as the \(s\)-\(t\) min cut based on the president and instructor!

Spectral Clustering#

In many cases we would like to move beyond graph partitioning, to allow for clustering nodes into, say, \(k\) clusters.

The idea of spectral clustering takes the observations about the Fiedler vector and extends them to more than one dimension.

Let’s look again at the spectral layout of the football dataset.

Here we’ve labelled the nodes according to their conference, which we will think of as ground-truth labels.

_images/583fc6b7de9026b42282919f85f481c8352b6f60178d5d0352f5621b43449fc5.png

Now, the key idea is that using spectral layout, we have placed nodes into a Euclidean space.

So … we could use a standard clustering algorithm in that space.

_images/23ec8180200edfa35db53d0efef3efc7294fc3526e5a8c4d7fbebc4171f31f78.png

Now the above plot shows that many clusters are well-separated in this space, but some still overlap.

To address this, we can use additional eigenvectors of the Laplacian, ie, \(\mathbf{w}_4, \mathbf{w}_5, \dots\).

So: the idea of spectral clustering is:

use enough of the smallest eigenvectors of \(L\) to sufficiently “spread out” the nodes
cluster the nodes in the Euclidean space created by this embedding.

More specifically: Given a graph \(G\):

Compute \(L\), the Laplacian of \(G\)
Compute the smallest \(d\) eigenvectors of \(L\), excluding the smallest eigenvector (the ones vector)
Let \(U \in \mathbb{R}^{n\times d}\) be the matrix containing the eigenvectors \(\mathbf{w}_2, \mathbf{w}_3, \dots, \mathbf{w}_{d+1}\) as columns
Let the position of each node \(i\) be the point in \(\mathbb{R}^d\) given by row \(i\) of \(U\)
Cluster the points into \(k\) clusters using \(k\)-means

Let’s explore the results of spectral clustering using \(d = 2\) dimensions.

_images/626322f913fa24bf6fae8e47ba37d5e8b8563fdd586173e0026ea3a3afc7487f.png

This is pretty good, but we can see that in some cases the clustering is not able to separate clusters that overlap in the visualization.

Which makes sense, as for the case \(d = 2\), we are running \(k\)-means on the points just as we see them in the visualization.

Let’s try \(d = 3\). Now there will be another dimension available to the clustering, which we can’t see in the visualization.

<class 'networkx.utils.decorators.argmap'> compilation 37:4: FutureWarning: laplacian_matrix will return a scipy.sparse array instead of a matrix in Networkx 3.0.

_images/35410a70f02120c1928977c5551539bdef37c92a91c3b4a5701ba6b0110524e5.png

We can see visually that using 3 dimensions is giving us a better clustering than 2 dimensions.

What happens as we increase the dimension further?

To evaluate this question we can use Adjusted Rand Index:

<class 'networkx.utils.decorators.argmap'> compilation 37:4: FutureWarning: laplacian_matrix will return a scipy.sparse array instead of a matrix in Networkx 3.0.

_images/2fa99e17cb7ab4204df434d5d0d2be282a29a858f233fd85b0b4063809bb52e2.png

Based on this plot, it looks like the football graph is best described as about six-dimensional.

When we embed it in six dimensions and cluster there we get an extremely high Adjusted Rand Index.

Maximum ARI is 0.868, using 6 dimensions for spectral embedding.

Network Centrality and Clustering

Contents

Network Centrality and Clustering#

Centrality#

Closeness Centrality#

Betweenness Centrality#

Adjacency Matrices#

Status Centrality#

Clustering and Partitioning#

Min \(s\)-\(t\) cut#

Minimum Cuts#

Balanced Cuts#

Spectral Graph Theory#

Spectral Partitioning#

Spectral Clustering#