Dimensionality Reduction and PCA – SVD II

Dimensionality Reduction and PCA – SVD II#

In the last lecture we learned about the SVD as a tool for constructing low-rank matrices.

Today we’ll look at it as a way to transform our data objects.

As a reminder, here is what the SVD looks like:

\[\begin{split} \mbox{objects}\left\{\begin{array}{c}\;\\\;\\\;\\\;\\\;\end{array}\right.\;\;\overbrace{\left[\begin{array}{cccc}\begin{array}{c}\vdots\\\vdots\\{\bf a_1}\\\vdots\\\vdots\end{array}&\begin{array}{c}\vdots\\\vdots\\{\bf a_2}\\\vdots\\\vdots\end{array}&\dots&\begin{array}{c}\vdots\\\vdots\\{\bf a_n}\\\vdots\\\vdots\end{array}\\\end{array}\right]}^{\mbox{features}} = \overbrace{\left[\begin{array}{cc}\vdots&\vdots\\\vdots&\vdots\\\sigma_1\mathbf{u}_1&\sigma_k\mathbf{u}_k\\\vdots&\vdots\\\vdots&\vdots\end{array}\right]}^{\large k} \times \left[\begin{array}{ccccc}\dots&\dots&\mathbf{v}_1&\dots&\dots\\\dots&\dots&\mathbf{v}_k&\dots&\dots\end{array}\right]\end{split}\]

\[ A = U\Sigma V^T\]

Notice that \(U\) contains a row for each object.

In a sense we have transformed objects from an \(n\) dimensional space to a \(k\) dimensional space, where \(k\) is (probably much) smaller than \(n\).

This is an example of dimensionality reduction.

When we take our data to be the rows of \(U\Sigma\) instead of rows of \(A\), we are reducing the dimension of our data from \(n\) dimensions to \(k\) dimensions.

This suggests an idea: is there an optimal transformation of the data into \(k\) dimensions?

What would that mean?

Note

An excellent reference for the following section is https://liorpachter.wordpress.com/2014/05/26/what-is-principal-component-analysis/. Some figures and discusion are taken from there.

Here is a natural criterion for the “best” \(k\)-dimensional transformation:

Find the \(k\)-dimensional hyperplane that is “closest” to the points.

More precisely:

Given n points in \(\mathbb{R}^n\), find the hyperplane (affine space) of dimension \(k\) with the property that the squared distance of the points to their orthogonal projection onto the hyperplane is minimized.

This sounds like an appealing criterion.

But it also turns out to have a strong statistical guarantee.

In fact, this criterion is a transformation that captures the maximum variance in the data.

That is, the resulting \(k\)-dimensional dataset is the one with maximum variance.

Let’s see why this is the case.

First, let’s recall the idea of a dataset centroid.

Given a \(n\times d\) data matrix \(X\) with observations on the rows (as always):

then

\[\overline{\mathbf{x}}^T = \frac{1}{n}\mathbf{1}^TX.\]

In words: the centroid – ie, the mean vector – is the average over the rows.

It is the “center of mass” of the dataset.

Next, recall the sample variance of a dataset is:

\[ \operatorname{Var}(X) = \frac{1}{n}\sum_j\Vert \mathbf{x}_j^T - \overline{\mathbf{x}}^T\Vert^2 \]

where \(\mathbf{x}_j^T\) is row \(j\) of \(X\).

In other words, the sample variance of the set of points is the average squared distance from each point to the centroid.

Consider if we move the points (translate each point by some constant amount). Clearly, the sample variance does not change.

So, let’s move the points to be centered on the origin.

\[ \tilde{X} = X - \mathbf{1}\overline{\mathbf{x}}^T \]

The sample variance of the new points \(\tilde{X}\) is the same as the old points \(X\), but the centroid of the new point set is the origin.

Now that the mean of the points is the zero vector, we can reason geometrically.

Here is a picture to show why the distance-minimizing subspace is variance-maximizing.

In this figure,

the red point is one example point from \(\tilde{X}\),
the green point is the origin / centroid, and
the blue point is the \(k\)-dimensional projection of the red point.

The length of the black line is fixed – it is the distance of the original point from the origin.

So the squared length of the black line is this point’s contribution to the sample variance.

Now, regardless of how we shift the green line, a right triangle is formed because the projection is orthogonal.

So – by virtue of the Pythagorean Theorem, the blue line squared plus the red line squared equals the black line squared.

Which means that when we shift the subspace (green line) so as to minimize the squared distances to all the example points \(\tilde{X}\) (blue lines) we automatically maximize the squared distance of all the resulting blue points to the origin (red lines).

And the squared distance of the blue point from the origin (red dashed line) is its contribution to the new \(k\)-dimensional sample variance.

In other words, the distance-minimizing projection is the variance-maximizing projection!

_images/6febf5d4ccf36443f99485cfe28b54bed42fddfd3e434ec9ec0f68aa4e06394e.png

For example, here is a dataset \(X\) in \(\mathbb{R}^2\).

_images/899ee69b7b991f3cfe028c97450e2436cb9d203a608d37434d4ee9061e0b6b35.png

By taking

\[ \tilde{X} = X - \mathbf{1}\overline{\mathbf{x}}^T \]

we translate each point so that the new mean is the origin.

Now, the last step is to find the \(k\)-dimensional subspace that minimizes the distance between the data (red points) and their projection on the subspace.

Remember that \(\ell_2\) norm of a vector difference is Euclidean distance.

In other words, what rank \(k\) matrix \(X^{(k)} \in \mathbb{R}^{n\times k}\) is closest to \(\tilde{X}\)?

We seek

\[X^{(k)} =\arg \min_{\{B\;|\;\operatorname{Rank} B = k\}} \Vert \tilde{X}-B\Vert_F.\]

We know how to find this matrix –

as we showed in the last lecture, we obtain it via the SVD!

So for this case, let’s construct the best 1-D approximation of the mean-centered data:

Xc = X - np.mean(X, axis = 0)
u, s, vt = np.linalg.svd(Xc, full_matrices=False)
scopy = s.copy()
scopy[1] = 0.
reducedX = u @ np.diag(scopy) @ vt

_images/061c3ac09b6b005714abe74f8bc3cb7110634c52bf719196558bae0a2d4f885c.png

This method is called Principal Component Analysis.

In summary, PCA consists of:

Mean center the data, and
Reduce the dimension of the mean-centered data via SVD.

This is equivalent to projecting the data onto the hyperplane that captures the maximum variance in the data.

It winds up constructing the best low dimensional linear approximation of the data.

What are “principal components”?

These are nothing more than the columns of \(U\) (or the rows of \(V^T\)). Because they capture the direction of maximum variation, they are called “principal” components.

Uses of PCA/SVD#

There are many uses of PCA (and SVD).

We’ll cover three of the main uses:

Visualization
Denoising
Anomaly Detection

As already mentioned, SVD is also useful for data compression – we won’t discuss it in detail, but it is the principle behind audio and video compression (MP3s, HDTV, etc).

Visualization and Denoising – Extended Example.#

We will study both visualization and denoising in the context of text processing.

As we have seen, a common way to work with documents is using the bag-of-words model (perhaps considering n-grams), which results in a term-document matrix.

Entries in the matrix are generally TF-IDF scores.

Often, terms are correlated – they appear together in combinations that suggest a certain “concept”.

That is, term-document matrices often show low effective rank – many columns can be approximated as combinations of other columns.

When PCA is used for dimensionality reduction of documents, it tends to to extract these “concept” vectors.

The application of PCA to term-document matrices is called Latent Semantic Analysis (LSA).

Among other benefits, LSA can improve the performance of clustering of documents.

This happens because the important concepts are captured in the most significant principal components.

Data: 20 Newsgroups#

from sklearn.datasets import fetch_20newsgroups

categories = ['comp.os.ms-windows.misc', 'sci.space','rec.sport.baseball']
news_data = fetch_20newsgroups(subset='train', categories=categories)

print(news_data.target_names)
print(news_data.target)

['comp.os.ms-windows.misc', 'rec.sport.baseball', 'sci.space']
[2 0 0 ... 2 1 2]

Basic Clustering#

To get started, let’s compute tf-idf scores.

Notice that we will let the tokenizer compute \(n\)-grams for \(n=\)1 and 2.

An \(n\)-gram is a set of \(n\) consecutive terms.

We’ll compute a document-term matrix dtm.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', min_df=4,max_df=0.8)
dtm = vectorizer.fit_transform(news_data.data)

print(type(dtm), dtm.shape)
terms = vectorizer.get_feature_names_out()

<class 'scipy.sparse._csr.csr_matrix'> (1781, 9409)

As a comparison case, let’s first cluster the documents using the raw tf-idf scores.

(This is without any use of PCA, and so includes lots of noisy or meaningless terms.)

from sklearn.cluster import KMeans
k = 3
kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=10, random_state = 0)
kmeans.fit_predict(dtm)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
error = kmeans.inertia_

Let’s evaluate the clusters. We’ll assume that the newgroup the article came from is the ‘ground truth.’

Rand Index is 0.7203279994023624
Silhouette Score is 0.009377353619567392

Improvement: Stemming#

One source of noise that we can eliminate (before we use LSA) comes from word endings.

For example: a Google search on ‘run’ will return web pages on ‘running.’

This is useful, because the difference between ‘run’ and ‘running’ in practice is not enough to matter.

The usual solution taken is to simply ‘chop off’ the part of the word that indicates a variation from the base word.

(For those of you who studied Latin or Greek, this will sound familiar – we are removing the ‘inflection.’)

The process is called ‘stemming.’

A very good stemmer is the “Snowball” stemmer.

You can read more at http://www.nltk.org and http://www.nltk.org/howto/stem.html.

Installation Note: From a cell you need to call nltk.download() and select the appropriate packages from the interface that appears. In particular you need to download: stopwords from corpora and punkt and snowball_data from models.

Let’s stem the data using the Snowball stemmer:

from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize, sent_tokenize


stemmed_data = [" ".join(SnowballStemmer("english", ignore_stopwords=True).stem(word)  
         for sent in sent_tokenize(message)
        for word in word_tokenize(sent))
        for message in news_data.data]

dtm = vectorizer.fit_transform(stemmed_data)
terms = vectorizer.get_feature_names_out()

And now let’s see how well we can cluster on the stemmed data.

from sklearn.cluster import KMeans
k = 3
kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=10,random_state=0)
kmeans.fit_predict(dtm)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
error = kmeans.inertia_

Rand Index is 0.844864300823809
Silhouette Score is 0.01080754741232728

So the Rand Index went from 0.816 to 0.846 as a result of stemming.

Demonstrating PCA#

OK. Now, let’s apply PCA.

Our data matrix is in sparse form.

First, we mean center the data. Note that vectors is a sparse matrix, but once it is mean centered it is not sparse any longer.

Then we use PCA to reduce the dimension of the mean-centered data.

dtm_dense = dtm.todense()
centered_dtm = dtm_dense - np.mean(dtm_dense, axis=0)
u, s, vt = np.linalg.svd(centered_dtm)

Note that if you have sparse data, you may want to use scipy.sparse.linalg.svds() and for large data it may be advantageous to use sklearn.decomposition.TruncatedSVD().

\[\begin{split} \mbox{objects}\left\{\begin{array}{c}\;\\\;\\\;\\\;\\\;\end{array}\right.\;\;\overbrace{\left[\begin{array}{cccc}\begin{array}{c}\vdots\\\vdots\\{\bf a_1}\\\vdots\\\vdots\end{array}&\begin{array}{c}\vdots\\\vdots\\{\bf a_2}\\\vdots\\\vdots\end{array}&\dots&\begin{array}{c}\vdots\\\vdots\\{\bf a_n}\\\vdots\\\vdots\end{array}\\\end{array}\right]}^{\mbox{features}} = \overbrace{\left[\begin{array}{cc}\vdots&\vdots\\\vdots&\vdots\\\sigma_1\mathbf{u}_1&\sigma_k\mathbf{u}_k\\\vdots&\vdots\\\vdots&\vdots\end{array}\right]}^{\large k} \times \left[\begin{array}{ccccc}\dots&\dots&\mathbf{v}_1&\dots&\dots\\\dots&\dots&\mathbf{v}_k&\dots&\dots\end{array}\right]\end{split}\]

\[ A = U\Sigma V^T\]

The principal components (rows of \(V^T\)) encode the extracted concepts.

Each LSA concept is a linear combination of words.

pd.DataFrame(vt, columns=vectorizer.get_feature_names_out())

	00	000	0005	0062	0096b0f0	00bjgood	00mbstultz	01	0114	01wb	...	zri	zrlk	zs	zt	zu	zv	zw	zx	zy	zz
0	0.007831	0.012323	0.000581	0.005558	0.001032	0.002075	0.002008	0.005575	0.001247	0.000813	...	-0.000028	-0.000025	-0.000200	-0.000025	-0.000128	-0.000207	-0.000087	-0.000150	-0.000113	0.000534
1	-0.005990	0.009540	0.002089	-0.010679	-0.001646	-0.003477	-0.002687	0.002143	-0.003394	0.002458	...	-0.000015	-0.000013	-0.000054	-0.000013	-0.000042	-0.000100	-0.000026	-0.000064	-0.000040	-0.001041
2	-0.012630	-0.011904	-0.002443	0.001438	0.000439	0.000044	0.000349	-0.006817	0.000692	-0.001124	...	-0.000095	-0.000086	-0.000289	-0.000087	-0.000252	-0.000576	-0.000134	-0.000293	-0.000204	-0.000013
3	0.013576	0.017639	0.003552	0.001148	0.003354	-0.000410	0.000622	0.011649	0.002237	0.001969	...	0.000205	0.000186	0.000486	0.000172	0.000464	0.001142	0.000220	0.000508	0.000352	0.000200
4	-0.002254	-0.004619	-0.005458	-0.001938	-0.000251	0.000689	0.000043	-0.002620	-0.000533	0.001434	...	-0.000310	-0.000283	-0.000775	-0.000252	-0.000698	-0.001714	-0.000331	-0.000728	-0.000529	-0.000961
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8048	0.000043	-0.000048	-0.000207	0.000041	-0.002082	-0.000734	-0.000556	0.000508	-0.000423	-0.000179	...	-0.001449	-0.002007	0.000133	-0.000929	0.000642	0.976204	-0.000030	0.000284	-0.000042	0.000657
8049	0.000137	0.000255	-0.000053	-0.000149	0.001295	-0.000098	-0.000018	-0.000708	-0.000041	-0.000346	...	0.000114	0.000102	-0.000522	-0.000170	-0.001086	-0.000132	0.999364	-0.000520	-0.000747	-0.000892
8050	-0.000186	0.000999	-0.000025	-0.000613	-0.000284	-0.000495	-0.000221	0.000041	0.000112	0.000328	...	0.000246	0.000210	-0.001592	-0.000179	-0.000888	0.000155	-0.000515	0.996858	-0.001115	0.000816
8051	0.000019	0.000606	-0.000161	-0.000327	0.000523	-0.000124	-0.000112	-0.000705	-0.000055	0.000097	...	0.000202	0.000184	-0.000606	-0.000288	-0.001271	-0.000157	-0.000769	-0.001136	0.998706	-0.000867
8052	0.000098	0.000264	0.000639	-0.000181	-0.003336	0.002154	0.001583	-0.001134	0.000224	0.000040	...	-0.000003	0.000017	0.000071	-0.001074	-0.002254	0.000697	-0.000814	0.000834	-0.000757	0.978108

8053 rows × 8053 columns

Principal Component 0:
year         0.140
game         0.111
henri        0.108
team         0.107
space        0.106
nasa         0.091
toronto      0.086
alaska       0.083
player       0.079
hit          0.077

Principal Component 1:
space        0.260
nasa         0.218
henri        0.184
gov          0.135
orbit        0.134
alaska       0.129
access       0.129
toronto      0.118
launch       0.109
digex        0.102

Principal Component 2:
henri        0.458
toronto      0.364
zoo          0.228
edu          0.201
spencer      0.194
zoolog       0.184
alaska       0.123
work         0.112
umd          0.096
utzoo        0.092

The rows of \(U\) correpond to documents, which are linear combinations of concepts.

Denoising#

In order to improve our clustering accuracy, we will exclude the less significant concepts from the documents’ feature vectors.

That is, we will choose the leftmost \(k\) columns of \(U\) and the topmost \(k\) rows of \(V^T\).

The reduced set of columns of \(U\) are our new document encodings, and it is those that we will cluster.

_images/ea329b24218b5c361b8e1df7a36469d14604de72f0d496fb9289766efdd6d0a8.png

It looks like 2 is a reasonable number of principal components.

_images/4bdc385f77d0441268f3216fcdd756449917ed8d96f8f7716ba0c8314bf93aaf.png

news_data.target_names

['comp.os.ms-windows.misc', 'rec.sport.baseball', 'sci.space']

_images/1dead680e9fde07194e52df615b1a31a8168f8c45f40cb9a56efb174834e1702.png

Note that we can get good accuracy and coherent clusters with just two principal components.

Visualization#

That’s a good thing, because it means that we can also visualize the data well with the help of PCA.

Recall that the challenge of visualization is that the data live in a high dimensional space.

We can only look at 2 (or maybe 3) dimensions at a time, so it’s not clear which dimensions to look at.

The idea behind using PCA for visualization is that since low-numbered principal components capture most of the variance in the data, these are the “directions” from which it is most useful to inspect the data.

We saw that the first two principal components were particularly large – let’s start by using them for visualization.

_images/f39cbe0d944d564b289c858e6850e2b2966ebf6b084608a7a5564b42c40d5d18.png

Points in this plot have been labelled with their “true” (aka “ground truth”) cluster labels.

Notice how clearly the clusters separate and how coherently they present themselves. This is obvious an excellent visualization that is provided by PCA.

Since this visualization is so clear, we can use it to examine the results of our various clustering methods and get some insight into how they differ.

_images/5fa417bb56557f1f2592cc4a9d1afe15770f21b9ae8b85020e0b32e71f48b37c.png

_images/56d4f24952a436118c7b96c9f107290c5fca5b8e68af6824e0fd974d96b0cb57.png

_images/4131543b68e57c2af62938e2718c7635fb7a00b05ffba89338c430a6c577355f.png

What happens if we misjudge the number of clusters? Let’s form 6 clusters.

_images/9d8084432a95ad76a4997d2035fa2d6e58f1607df8cfbefde9783d4513762333.png

_images/6f64649f01860a2e7444028a185db19dc0664439a384d0da3d6928b0b89e0bb7.png

What about the other principal components? Are they useful for visualization?

A common approach is to look at all pairs of (low-numbered) principal components, to look for additional structure in the data.

_images/251cdf6567fc472f93e2a3b1e28f4e92f20e45ef29bb38159dbcbee604c4ecd9.png