Linear Algebra Refresher

Linear Algebra Refresher#

_images/d00b775855ab3189853144d194bf67792283b4fd77f88891ed606f3e956635bc.png

This lecture is a review of some aspects of linear algebra that are important for data science.

Given the prerequisites for this course, I assume that you learned all of this once.

What I want to do today is bring the material back into your mind fresh.

Vectors and Matrices#

A matrix is a rectangular array of numbers, for example:

\[\begin{split}X = \left[\begin{array}{rrr} 1 & -2 & 1 \\ 0 & 2 & - 8 \\ 6 & 5 &9 \end{array} \right]\end{split}\]

A matrix with only one column is called a column vector, or simply a vector.

Here are some examples.

These are vectors in $\mathbb{R}^2$: $$ \begin{array}{ccc} {\bf u} = \left[\begin{array}{r}3\\-1\end{array}\right] & {\bf v} = \left[\begin{array}{c}.2\\.3\end{array}\right] & {\bf w} = \left[\begin{array}{c}w_1\\w_2\end{array}\right] \end{array} $$

and these are vectors in $\mathbb{R}^3$:

\[\begin{split} \begin{array}{ccc} {\bf u} = \left[\begin{array}{c}2\\3\\4\end{array}\right] & {\bf v} = \left[\begin{array}{r}-1\\0\\2\end{array}\right] & {\bf w} = \left[\begin{array}{c}w_1\\w_2\\w_3\end{array}\right] \end{array} \end{split}\]

We will use uppercase letters ($X$) for matrices and lowercase bold leters for vectors ($\mathbf{u}$).

A vector like $\left[\begin{array}{c}-2\\-1\end{array}\right]$ (also denoted $(-2,-1)$) can be thought of as a point on the plane.

_images/eaab1f4947d0c8111da37f50da064422f26e7b182d20a5beac316e3840cbe87d.png

Sometimes we draw an arrow from the origin to the point. This comes from physics, but can be a helpful visualization in any case.

_images/a9285acbb58fc53a1e5408cffc79c1688450f417e18d8ec53e20a036aaab3060.png

Vector Addition, Geometrically#

A geometric interpretation of vector sum is as a parallelogram. If ${\bf u}$ and ${\bf v}$ in $\mathbb{R}^2$ are represented as points in the plane, then ${\bf u} + {\bf v}$ corresponds to the fourth vertex of the parallelogram whose other vertices are ${\bf u}, 0,$ and ${\bf v}$.

_images/93890622f5014f62410040bcec86235a3eaa3f6bbc955a6349e01a4ad9074ccf.png

Matrix Multiplication#

Using addition and multiplication by scalars, we can create equations using vectors.

Then we make the follow equivalence:

A vector equation $$ x_1{\bf a_1} + x_2{\bf a_2} + ... + x_n{\bf a_n} = {\bf b} $$

Can also be written as

\[ A\mathbf{x} = \mathbf{b}\]

where

\[ A = [{\bf a_1} \; {\bf a_2} \; ... \; {\bf a_n}] \]

and

\[\begin{split}\mathbf{x} = \left[\begin{array}{c}x_1\\x_2\\\vdots\\x_n\end{array}\right]\end{split}\]

We then extend this to define matrix multiplication as

Definition. If $A$ is an $m \times n$ matrix and $B$ is $n \times p$ matrix with columns ${\bf b_1},\dots,{\bf b_p},$ then the product $AB$ is defined as the $m \times p$ matrix whose columns are $A{\bf b_1}, \dots, A{\bf b_p}.$ That is,

\[ AB = A[{\bf b_1} \; \dots \; {\bf b_p}] = [A{\bf b_1} \; \dots \; A{\bf b_p}]. \]

Continuing on, we can define the inverse of a matrix.

We have to recognize that this inverse does not exist for all matrices.

It only exists for square matrices
And not even for all square matrices – only those that are “invertible.”

A matrix that is not invertible is called a singular matrix.

Definition. A matrix $A$ is called invertible if there exists a matrix $C$ such that

\[ AC = I \mbox{ and } CA = I. \]

In that case $C$ is called the inverse of $A$. Clearly, $C$ must also be square and the same size as $A$.

The inverse of $A$ is denoted $A^{-1}.$

Inner Product#

Now we introduce the inner product, also called the dot product.

The inner product is defined for two vectors, in which one is transposed to be a single row. It returns a single number.

For example, the expression

\[\begin{split}[3\;5\;1] \left[\begin{array}{c}2\\-1\\4\end{array}\right]\end{split}\]

is the same as

\[(3\cdot 2) + (5\cdot -1) + (1 \cdot 4)\]

which is the sum of the products of the corresponding entries, and yields the scalar value 5.

This is the inner product of the two vectors.

Thus the inner product of $\mathbf{x}$ and $\mathbf{y}$ is $\mathbf{x}^T \mathbf{y} = \mathbf{y}^T \mathbf{x}$.

The general definition of the inner product is:

\[\begin{split}[x_1\;x_2\;\dots\;x_n] \left[\begin{array}{c}y_1\\y_2\\\vdots\\y_n\end{array}\right] = \sum_{i=1}^n x_i y_i\end{split}\]

Vector Norm#

If $\mathbf{v}$ is in $\mathbb{R}^n$, with entries $v_1,\dots,v_n,$ then the square root of $\mathbf{v}^T\mathbf{v}$ is defined because $\mathbf{v}^T\mathbf{v}$ is nonnegative.

Definition. The $\ell_2$ norm of $\mathbf{v}$ is the nonnegative scalar $\Vert\mathbf{v}\Vert_2$ defined by

\[\Vert\mathbf{v}\Vert_2 = \sqrt{\mathbf{v}^T\mathbf{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} = \sqrt{\sum_{i=1}^n{v_i}^2}.\]

If we leave the $_2$ subscript off, it is implied.

The ($\ell_2)$ norm of $\mathbf{v}$ is its length in the usual sense.

This follows directly from the Pythagorean theorem:

_images/707bc0ceaf2e76b519c272e3541c414e3b2a77d75bd187280dc49211e04344e4.png

_images/ca37e538eaaa98f69e50b3ce3f04d4cdc982aa09f6dae32aaf8cf92c93f4b9fe.png

A vector of length 1 is called a unit vector.

If we divide a nonzero vector $\mathbf{v}$ by its length – that is, multiply by $1/\Vert\mathbf{v}\Vert$ – we obtain a unit vector $\mathbf{u}$.

We say that we have normalized $\mathbf{v}$, and that $\mathbf{u}$ is in the same direction as $\mathbf{v}.$

Distance in $\mathbb{R}^n$#

It’s very useful to be able to talk about the distance between two points (or vectors) in $\mathbb{R}^n$.

Definition. For $\mathbf{u}$ and $\mathbf{v}$ in $\mathbb{R}^n,$ the distance between $\mathbf{u}$ and $\mathbf{v}$, written as $\mbox{dist}(\mathbf{u},\mathbf{v}),$ is the length of the vector $\mathbf{u}-\mathbf{v}$. That is,

\[\mbox{dist}(\mathbf{u},\mathbf{v}) = \Vert\mathbf{u}-\mathbf{v}\Vert.\]

This definition agrees with the usual formulas for the Euclidean distance between two points. The usual formula is

\[\mbox{dist}(\mathbf{u},\mathbf{v}) = \sqrt{(v_1-u_1)^2 + (v_2-u_2)^2 + \dots + (v_n-u_n)^2}.\]

Which you can see is equal to

\[\begin{split}\Vert\mathbf{u}-\mathbf{v}\Vert = \sqrt{(\mathbf{u}-\mathbf{v})^T(\mathbf{u}-\mathbf{v})} = \sqrt{\left[\begin{array}{cccc}u_1-v_1&u_2-v_2&\dots&u_n-v_n\end{array}\right]\left[\begin{array}{c}u_1-v_1\\u_2-v_2\\\vdots\\u_n-v_n\end{array}\right]}\end{split}\]

There is another important reformulation of distance. Consider the squared distance:

\[\Vert\mathbf{u}-\mathbf{v}\Vert^2 = (\mathbf{u}-\mathbf{v})^T(\mathbf{u}-\mathbf{v})\]

Expanding this out, we get:

\[\Vert\mathbf{u}-\mathbf{v}\Vert^2 = \mathbf{u}^T\mathbf{u} + \mathbf{v}^T\mathbf{v} - 2\mathbf{u}^T\mathbf{v}.\]

\[= \Vert\mathbf{u}\Vert^2 + \Vert\mathbf{v}\Vert^2 - 2\mathbf{u}^T\mathbf{v}.\]

This is an important formula because it relates distance between two points to their lengths and inner product.

Orthogonal Vectors#

Now we turn to another familiar notion from 2D geometry, which we’ll generalize to $\mathbb{R}^n$: the notion of being perpendicular.

We will generally use the fancier term orthogonal – which means the same thing as perpendicular.

We will say that two vectors are orthogonal if they form a right angle at the origin.

_images/59e6caef96e27d0fc3b55582cc0f648e3f0788a3407e7c093872031b45bf3b9d.png

Draw the line connecting $\mathbf{u}$ and $\mathbf{v}$.

Then $\mathbf{u}$ and $\mathbf{v}$ are orthogonal if and only if they make a right triangle with the origin.

So $\mathbf{u}$ and $\mathbf{v}$ are orthogonal if and only if the Pythagorean Theorem is satisified for this triangle.

_images/4bb2dd3b2c3f5031c6d20a081983366dd53e516f6ec9deb1a25ee52896090ab5.png

What is the length of the red side of the triangle?

According to the definition of length based on norm, it is $\Vert \mathbf{u} - \mathbf{v} \Vert$.

_images/0dc5df7f4e186639e9e6cccf8e66ba3e2ef0dc2df4a6e98043b78dac53eb53d7.png

So the blue and green lines are perpendicular if and only if:

\[ \Vert \mathbf{u} - \mathbf{v} \Vert^2 = \Vert \mathbf{u} \Vert^2 + \Vert \mathbf{v} \Vert^2\]

First let’s simplify the expression for squared distance from $\mathbf{u}$ to $\mathbf{v}$:

\[ \Vert\mathbf{u}-\mathbf{v}\Vert^2 = (\mathbf{u}-\mathbf{v})^T(\mathbf{u}-\mathbf{v}) \]

\[ = \mathbf{u}^T\mathbf{u} - \mathbf{u}^T\mathbf{v} - \mathbf{v}^T\mathbf{u} + \mathbf{v}^T\mathbf{v}\]

\[ = \Vert\mathbf{u}\Vert^2 + \Vert\mathbf{v}\Vert^2 - 2\mathbf{u}^T\mathbf{v}\]

Now, let’s go back to the Pythagorean Theorem.

$ \mathbf{u}$ and $ \mathbf{v} $ are perpendicular if and only if:

\[ \Vert \mathbf{u} - \mathbf{v} \Vert^2 = \Vert \mathbf{u} \Vert^2 + \Vert \mathbf{v} \Vert^2\]

But we’ve seen that this means:

\[ \Vert\mathbf{u}\Vert^2 + \Vert\mathbf{v}\Vert^2 - 2\mathbf{u}^T\mathbf{v}= \Vert \mathbf{u} \Vert^2 + \Vert \mathbf{v} \Vert^2\]

So now we can define perpendicularity in $\mathbb{R}^n$:

Definition. Two vectors $\mathbf{u}$ and $\mathbf{v}$ in $\mathbb{R}^n$ are orthogonal to each other if $\mathbf{u}^T\mathbf{v} = 0.$

This gives us a very simple rule for determining whether two vectors are orthogonal: if and only if their inner product is zero.

The Angle Between Two Vectors#

There is a further connection between the inner product of two vectors and the angle between them.

This connection is very useful (eg, in visualizing data mining operations).

We start from the law of cosines:

\[ c^2 = a^2 + b^2 - 2ab\cos\theta\]

_images/c2d207d0fb69bbfe786bbebbf5278c09c686077f673015844d0ba15ef2493100.png

Now let’s interpret this law in terms of vectors $\mathbf{u}$ and $\mathbf{v}$.

Once again, it is the angle that these vectors make at the origin that we are concerned with:

_images/9b2b68f8e13549aa69bffb595fcd1f248186930e59179f73186d7c3da9111b2e.png

Applying the law of cosines we get:

\[\Vert\mathbf{u}-\mathbf{v}\Vert^2 = \Vert\mathbf{u}\Vert^2 + \Vert\mathbf{v}\Vert^2 - 2\Vert\mathbf{u}\Vert\Vert\mathbf{v}\Vert\cos\theta\]

And by definition:

\[ \Vert\mathbf{u}-\mathbf{v}\Vert^2 = \Vert\mathbf{u}\Vert^2 + \Vert\mathbf{v}\Vert^2 - 2\mathbf{u}^T\mathbf{v}\]

So

\[ \mathbf{u}^T\mathbf{v} = \Vert\mathbf{u}\Vert\Vert\mathbf{v}\Vert\cos\theta\]

This is a very important connection between the notion of inner product and trigonometry.

One implication in particular concerns unit vectors.

\[ \mathbf{u}^T\mathbf{v} = \Vert\mathbf{u}\Vert\Vert\mathbf{v}\Vert\cos\theta\]

So

\[ \frac{\mathbf{u}^T\mathbf{v}}{\Vert\mathbf{u}\Vert\Vert\mathbf{v}\Vert} = \cos\theta\]

\[ \frac{\mathbf{u}^T}{\Vert\mathbf{u}\Vert}\frac{\mathbf{v}}{\Vert\mathbf{v}\Vert} = \cos\theta\]

So we have the very simple rule, that for two unit vectors, their inner product is the cosine of the angle between them!

_images/937d3c81654eca55bd6d4f49203f0a8701fcd605567c31dcf418ece10aa317ef.png

Here $\mathbf{u} = \left[\begin{array}{c}0.95\\0.31\end{array}\right],$ and $\mathbf{v} = \left[\begin{array}{c}0.20\\0.98\end{array}\right].$

So $\mathbf{u}^T\mathbf{v} = (0.95\cdot 0.20) + (0.31 \cdot 0.98) = 0.5$

So $\cos\theta = 0.5.$

So $\theta = 60$ degrees.

Least Squares#

In many cases we have a linear system

\[ A \mathbf{x} = \mathbf{b}\]

that has no solution – perhaps due to noise or measurement error.

In such as case, we generally look for an $\mathbf{x}$ such that $A\mathbf{x}$ makes a good approximation to $\mathbf{b}.$

We can think of the quality of the approximation of $A\mathbf{x}$ to $\mathbf{b}$ as the distance from $A\mathbf{x}$ to $\mathbf{b},$ which is

\[\Vert A\mathbf{x} - \mathbf{b}\Vert.\]

The general least-squares problem is to find an $\mathbf{x}$ that makes $\Vert A\mathbf{x}-\mathbf{b}\Vert$ as small as possible.

Just to make this explicit: say that we denote $A\mathbf{x}$ by $\mathbf{y}$. Then

\[\Vert A\mathbf{x}-\mathbf{b}\Vert^2 = \sum_i (y_i-b_i)^2\]

Where we interpret $y_i$ as the estimated value and $b_i$ as the measured value.

So this expression is the sum of squared error. This is the most common measure of error used in statistics. Note that we could also say we are minimizing the $\ell_2$ norm of the error.

This is a key principle!

Minimizing the length of $A\mathbf{x} - \mathbf{b}$ is the same as minimizing the sum of squared error.

An equivalent (and more common) way to express this is:

\[\hat{\mathbf{x}} = \arg\min_\mathbf{x} \Vert A\mathbf{x} - \mathbf{b}\Vert.\]

which emphasizes that this is a minimization problem, also called an optimization problem.

We can find $\hat{\mathbf{x}}$ using either

geometric arguments (based on projecting $\mathbf{b}$ on the column space of $A$), or
by calculus (taking the derivative of the rhs expression above and setting it equal to zero).

Either way, we get the result that:

$\hat{\mathbf{x}}$ is the solution of:

\[A^TA\mathbf{x} = A^T\mathbf{b}\]

And we can then prove that these equations always have at least one solution.

When $A^TA$ is invertible, we call the system “overdetermined.” This means that there is just one solution, and so we then solve $A^TA\mathbf{x} = A^T\mathbf{b}$ as

\[\hat{x} = (A^TA)^{-1}A^T\mathbf{b}\]

Eigendecomposition#

Consider a square matrix $A$. An eigenvector of $A$ is a special vector which does not change its direction when multiplied by $A$.

Example.

Let $A = \left[\begin{array}{rr}3&-2\\1&0\end{array}\right], \mathbf{u} = \left[\begin{array}{r}-1\\1\end{array}\right], \mathbf{v}=\left[\begin{array}{r}2\\1\end{array}\right].$ The images of $\mathbf{u}$ and $\mathbf{v}$ under multiplication by $A$ are shown here:

_images/569e9e83459632ed4c71e078228c433b2579dac8a81054270d8da0cfb1776ac8.png

Definition.

An eigenvector of an $n\times n$ matrix $A$ is a nonzero vector $\mathbf{x}$ such that $A\mathbf{x} = \lambda\mathbf{x}$ for some scalar $\lambda.$

A scalar $\lambda$ is called an eigenvalue of $A$ if there is a nontrivial solution $\mathbf{x}$ of $A\mathbf{x} = \lambda\mathbf{x}.$

Such an $\mathbf{x}$ is called an eigenvector corresponding to $\lambda.$

An $n \times n$ matrix has at most $n$ distinct eigenvectors and at most $n$ distinct eigenvalues.

In some cases, the matrix may be factorable using it eigenvectors and eigenvalues.

This is called diagonalization.

A square matrix $A$ is said to be diagonalizable if it can be expressed as:

\[A = PDP^{-1}\]

for some invertible matrix $P$ and some diagonal matrix $D$.

In fact, $A = PDP^{-1},$ with $D$ a diagonal matrix, if and only if the columns of $P$ are $n$ linearly independent eigenvectors of $A.$

In that case, $D$ holds the corresponding eigenvalues of $A$.

That is,

\[\begin{split}D = \left[\begin{array}{cccc}\lambda_1&0&\cdots&0\\0&\lambda_2&\cdots&0\\\vdots&\vdots&\ddots&\vdots\\0&0&\cdots&\lambda_n\end{array}\right]\end{split}\]

and

\[ P = [\mathbf{v}_1\;\mathbf{v}_2\;\cdots\;\mathbf{v}_n].\]

A special case occurs when $A$ is symmetric, i.e., $A = A^T.$

In that case, the eigenvectors of $A$ are all mutually orthogonal.

That is, any two distinct eigenvectors have inner product zero.

Furthermore, we can scale each eigenvector to have unit norm, in which case we say that the eigenvectors are orthonormal.

So, let’s put this all together.

The diagonal of $D$ are the eigenvalues of $A$.
The columns of $P$ are the eigenvectors of $A$.

For a symmetric matrix $A$:

distinct columns of $P$ are orthogonal
each column of $P$ has unit norm

Hence, we find that the diagonal elements of $PP^T$ are 1, and the off-diagonal elements are 0.

Hence $PP^T = I$.

So $P^{-1} = P^T.$

So, in the special case of a symmetric matrix, we can decompose $A$ as:

\[A = PDP^T.\]

This amazing fact is known as the spectral theorem.

This decomposition of $A$ is its spectral decomposition.

(The eigenvalues of a matrix are also called its spectrum.)

These facts will be key to understanding the Singular Value Decomposition, which we will study later on.