Neural Networks I – Gradient Descent#

The “Unreasonable” Effectiveness of Deep Neural Networks#

Deep Neural Networks have been effective in many applications.

_images/IntroModels.svg
_images/IntroModels2a.svg

Understanding Deep Learning, Simon J.D. Prince, MIT Press, 2023

Emergent Behavior in Pre-Trained Large Language Models#

Emergence


Emergent Abilities of Large Language Models. J. Wei et al., Oct. 26, 2022.

Theory Sometimes Follows Invention#

Invention

Theory

Telescope (1608)

Optics (1650-1700)

Steam Engine (1695-1715)

Thermodynamics (1824…)

Electromagnetism (1820)

Electrodynamics (1821)

Sailboat (??)

Aerodynamics (1757), Hydrodynamics (1738)

Airplane (1885-1905)

Wing Theory (1907-1918)

Computer (1941-1945)

Computer Science (1950-1960)

Teletype (1906)

Information Theory (1948)

  • But then when theory is developed it can more quickly improve invention

  • The same can be said for Neural Networks. The theory to make them work is well understood. The theory of why they work is still developing.

  • We’ll balance theory and application


The Power and Limits of Deep Learning, Yann LeCun, March 2019.

Underlying all these techniques is the idea of applying optimization techniques to minimize some kind of “loss” function.

Loss Functions for Model Fitting#

Most of the machine learning we have studied this semester is based on the idea that we have a model that is parameterized, and our goal is to find good settings for the parameters.

We have seen example after example of this problem.

  • In \(k\)-means, our goal was to find \(k\) cluster centroids, so that the \(k\)-means objective was minimized.

  • In linear regression, our goal was to find a parameter vector \(\beta\) so that sum of squared error \(\Vert \mathbf{y} - \hat{\mathbf{y}}\Vert_2\) was minimized.

  • In the support vector machine, our goal was to find a parameter vector \(\theta\) so that classification error was minimized.

And similarly we’ll want to find good parameter settings in neural networks.

It’s time now to talk about how, in general, one can find “good settings” for the parameters in problems like these.

What allows us to unify our approach to many such problems is the following:

First, we start by defining an error function, generally called a loss function, to describe how well our method is doing.

And second, we choose loss functions that are differentiable with respect to the parameters.

These two requirements mean that we can think of the parameter tuning problem using surfaces like these:

_images/L23-convex_cost_function.jpeg

Imagine that the \(x\) and \(y\) axes in these pictures represent parameter settings. That is, we have two parameters to set, corresponding to the values of \(x\) and \(y\).

For each \((x, y)\) setting, the \(z\)-axis shows the value of the loss function.

What we want to do is find the minimum of a surface, corresponding to the parameter settings that minimize loss.

Notice the difference between the two kinds of surfaces.

The surface on the left corresponds to a strictly convex loss function. If we find a local minimum of this function, it is a global minimum.

The surface on the right corresponds to a non-convex loss function. There are local minima that are not globally minimal.

Both kinds of loss functions arise in machine learning.

For example, convex loss functions arise in

  • Linear regression

  • Logistic regression

While non-convex loss functions arise in

  • \(k\)-means

  • Gaussian Mixture Modeling

  • and of course neural networks

Gradient Descent Intuitively#

The intuition of gradient descent is the following.

Imagine you are lost in the mountains, and it is foggy out. You want to find a valley. But since it is foggy, you can only see the local area around you.

_images/L23-fog-in-the-mountains.jpeg

The natural thing to do is:

  1. Look around you 360 degrees.

  2. Observe in which direction the ground is sloping downward most steeply.

  3. Take a few steps in that direction.

  4. Repeat the process
    … until the ground seems to be level.

The key to this intuitive idea is formalizing the idea of “direction of steepest descent.”

This is where the differentiability of the loss function comes into play.

As long as the loss function is locally differentiable, we can define the direction of steepest descent (really, ascent).

That direction is called the gradient.

Derivatives on Single Variable Functions#

We’ll build up to concept of gradient by starting with derivatives on single variable functions.

Let’s start with a simple quadratic function.

\[ f(x) = 3x^2 - 4x +5\]

Which we can write in python as well.

def f(x):
  return 3*x**2 - 4*x + 5

And we can plot it.

_images/23-NN-I-Gradient-Descent_29_0.png

Let’s assume for a minute that this is our loss function that we are minimizing.

Question

What do we know about where the minimum is in terms of the slope of the curve?

Answer

It is necessary but not sufficient that the slope be zero.

Question

How do we calculate the slope?

We take the derivative, denoted

\[ \frac{d f(x)}{dx} \hspace{10pt} \textrm{Leibniz' notation} \]

or $\( f'(x) \hspace{10pt} \textrm{Lagrange's notation} \)$

You may see both notations. The nice thing about Leibniz’ notation is that it is easy to express partial derivatives when we get to multivariate differentiation, which we’ll get to shortly.

We can take the derivate of the \(f(x)\)

\[ f(x) = 3x^2 - 4x +5\]

By definition of the derivative, the function \(f(x)\) is differentiable at \(x\) if

\[ \lim_{h\to 0} \frac{f(a+h)-f(a)}{h} \]

exists at \(x\). And in fact, that limit approaches the value of the derivative in the limit.

We use the rules of derivatives. See for example the derivative rules for basic functions, e.g.

\[ \frac{d}{dx} x^a = ax^{a-1}, \quad \textrm{e.g.} \quad \frac{d}{dx} 3x^2 = 6x \quad \textrm{,} \quad \frac{d}{dx} 6x = 6 \quad \textrm{,} \quad \frac{d}{dx} 6 = 0 \]

so

\[ \frac{d f(x)}{dx} = 6x - 4 \]
# define the derivate of f as df

def df(x):
    return 6*x - 4

We can solve for where \(\frac{d}{dx} f(x) = 0\)

\[ 6x - 4 = 0 \]
# Evaluate df and f for x where df = 0 
x_zero = 2/3

# Evaluate df
df(x_zero)
0.0
# And f at that value is
f(x_zero)
3.666666666666667

Which we can add to the plot of \(f(x)\) to see if it indeed is at the minimum.

_images/23-NN-I-Gradient-Descent_46_0.png

Now as Wikipedia states,

The derivative of a function of a single variable at a chosen input value, when it exists, is the slope of the tangent line to the graph of the function at that point.

Slope of a Function#

We can explore the tangent at different x-values.

Slope Shows Influence of \(x\) on \(f\)#

Important Note:

  • if the slope is negative, then by increasing \(x\), we will decrease \(f(x)\).

  • And if the slope is positive, then decreasing \(x\) will decrease \(f(x)\).

Interpretation of Slope#

Let’s illustrate with this function \(f(x)\) a useful way to interpret the slope.

In the graph above, with \(x=-2\), we see the slope, call it \(m\), is -16. What that means is that when we change the value of \(x\), the impact on the ouptut will roughly be amplified by \(m\), or -16 when \(x=2\).

Put another way, the slope (equivalently the derivative) of a function \(f(x)\) at an input \(x\) indicates how sensitive the output is to changes in the input.

This will be key to understanding how we have to tweak the weights of our model to minimize our loss function.

Gradient Descent on a Linear Regression Model#

Now, in 2 or higher dimensions we can there many directions that will descend, but we want to pick the direction of steepest descent. We’ll formalize that idea.

As long as the loss function is locally differentiable, we can define the direction of steepest descent.

That direction is given by the negative of the gradient.

The gradient is a generalization of the slope of a line.

Let’s say we have a loss function \(\mathcal{L}(\mathbf{w})\).

The components of \(\mathbf{w}\in\mathbb{R}^n\) are the parameters we want to optimize.

Just a reminder that \( \mathbf{w} \in \mathbb{R}^n \) denotes an n-dimensional vector.

For linear regression, the loss function could be squared loss:

\[ \mathcal{L}(\mathbf{w}) = \Vert\mathbf{y} - \hat{\mathbf{y}}\Vert^2 \]

where \(\hat{\mathbf{y}}\) is our estimate, ie, \(\hat{\mathbf{y}} = X\mathbf{w}\) so that

\[ \mathcal{L}(\mathbf{w}) = \Vert\mathbf{y} - X\mathbf{w}\Vert^2 \]

To find the gradient, we take the partial derivative of our loss function with respect to each parameter:

\[ \frac{\partial \mathcal{L}}{\partial w_i} \]

and collect all the partial derivatives into a vector of the same shape as \(\mathbf{w}\):

\[\begin{split} \nabla_\mathbf{w}\mathcal{L} = \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial w_1}\\ \frac{\partial \mathcal{L}}{\partial w_2}\\ \vdots \\ \frac{\partial \mathcal{L}}{\partial w_n} \end{bmatrix} \end{split}\]

When you see the notation \(\nabla_\mathbf{w}\mathcal{L},\) think of it as the derivate with respect to the vector \(\mathbf{w}\).

The nabla symbol, \(\nabla\), denotes the vector differentiator operator called del.

It turns out that if we are going to take a small step of unit length, then the gradient is the direction that maximizes the change in the loss function.

_images/L23-gradient-of-convex.png

As you can see from the above figure, in general the gradient varies depending on where you are in the parameter space.

So we write:

\[\begin{split} \nabla_\mathbf{w}\mathcal{L}(\mathbf{w}) = \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial w_1}(\mathbf{w})\\ \frac{\partial \mathcal{L}}{\partial w_2}(\mathbf{w})\\ \vdots \\ \frac{\partial \mathcal{L}}{\partial w_n}(\mathbf{w}) \end{bmatrix} \end{split}\]

Each time we seek to improve our parameter estimates \(\mathbf{w}\), we will take a step in the negative direction of the gradient.

… “negative direction” because the gradient specifies the direction of maximum increase – and we want to decrease the loss function.

How big a step should we take?

For step size, will use a scalar value, here denoted by the greek letter “eta”, \(\eta\), which we call the learning rate.

The learning rate is a hyperparameter that needs to be tuned for a given problem, or even can be modified adaptively as the algorithm progresses as we will see later.

Now we can write the gradient descent algorithm formally:

  1. Start with an initial parameter estimate \(\mathbf{w}^0\).

  2. Update: \(\mathbf{w}^{n+1} = \mathbf{w}^n - \eta \nabla_\mathbf{w}\mathcal{L}(\mathbf{w}^n)\)

  3. If not converged, go to step 2.

How do we know if we are “converged”?

Typically we stop

  • after a certain number of iterations, or

  • the loss has not improved by a fixed amount – early stopping

Example: Linear Regression#

Let’s say we have this dataset.

_images/23-NN-I-Gradient-Descent_70_0.png

Let’s fit a least-squares line to this data.

The loss function for this problem is the least-squares error:

\[\mathcal{L}(\mathbf{\beta}) = \Vert\mathbf{y} - X\mathbf{\beta}\Vert^2\]

Of course, we know how to solve this problem using the normal equations, but let’s do it using gradient descent instead.

Here is the line we’d like to find:

_images/23-NN-I-Gradient-Descent_73_0.png

There are \(n = 10\) data points, whose \(x\) and \(y\) values are stored in xlin and y.

First, let’s create our \(X\) (design) matrix, and include a column of ones to model the intercept:

X = np.column_stack([np.ones((n, 1)), xlin])

Now, let’s visualize the loss function \(\mathcal{L}(\mathbf{\beta}) = \Vert \mathbf{y}-X\mathbf{\beta}\Vert^2.\)

_images/23-NN-I-Gradient-Descent_77_0.png

We won’t take you through computing the gradient for this problem (you can find it in the online text).

We’ll will just tell you that the gradient for a least squares problem is:

\[\nabla_\beta \mathcal{L}(\mathbf{\beta}) = X^T X \beta - X^T\mathbf{y} \]

Note

For those interested in a little more insight into what these plots are showing, here is the derivation.

We start from the rule that \(\Vert \mathbf{v}\Vert = \sqrt{\mathbf{v}^T\mathbf{v}}\).

Applying this rule to our loss function:

\[ \mathcal{L}(\mathbf{\beta}) = \Vert \mathbf{y} - X\mathbf{\beta} \Vert^2 = \beta^T X^T X \beta - 2\mathbf{\beta}^TX^T\mathbf{y} + \mathbf{y}^T\mathbf{y} \]

The first term, \(\beta^T X^T X \beta\), is a quadratic form, and it is what makes this surface curved. As long as \(X\) has independent columns, \(X^TX\) is positive definite, so the overall shape is a paraboloid opening upward, and the surface has a unique minimum point.

To find the gradient, we can use standard calculus rules for derivates involving vectors. The rules are not complicated, but the bottom line is that in this case, you can almost use the same rules you would if \(\beta\) were a scalar:

\[\nabla_\beta \mathcal{L}(\mathbf{\beta}) = 2X^T X \beta - 2X^T\mathbf{y} \]

And by the way – since we’ve computed the derivative as a function of \(\beta\), instead of using gradient descent, we could simply solve for the point where the gradient is zero. This is the optimal point which we know must exist:

\[ \nabla_\beta \mathcal{L}(\mathbf{\beta}) = 0 \]
\[ 2X^T X \beta - 2X^T\mathbf{y} = 0 \]
\[ X^T X \beta = X^T\mathbf{y} \]

Which of course, are the normal equations for this linear system.

So here is our code for gradient descent:

def loss(X, y, beta):
    return np.linalg.norm(y - X @ beta) ** 2

def gradient(X, y, beta):
    return X.T @ X @ beta - X.T @ y

def gradient_descent(X, y, beta_hat, eta, nsteps = 1000):
    losses = [loss(X, y, beta_hat)]
    betas = [beta_hat]
    #
    for step in range(nsteps):
        #
        # the gradient step
        new_beta_hat = beta_hat - eta * gradient(X, y, beta_hat)
        beta_hat = new_beta_hat
        #
        # accumulate statistics
        losses.append(loss(X, y, new_beta_hat))
        betas.append(new_beta_hat)
        
    return np.array(betas), np.array(losses)

We’ll start at an arbitrary point, say, \((-8, -3.2)\).

That is, \(\beta_0 = -8\), and \(\beta_1 = -3.2\).

beta_start = np.array([-8, -3.2])
eta = 0.002
betas, losses = gradient_descent(X, y, beta_start, eta)

What happens to our loss function per GD iteration?

_images/23-NN-I-Gradient-Descent_85_0.png

And how do the parameter values \(\beta\) evolve?

_images/23-NN-I-Gradient-Descent_87_0.png

Notice that the improvement in loss decreases over time. Initially the gradient is steep and loss improves fast, while later on the gradient is shallow and loss doesn’t improve much per step.

Now remember that in reality we are like the person who is trying to find their way down the mountain, in the fog.

In general we cannot “see” the entire loss function surface.

Nonetheless, since we know what the loss surface looks like in this case, we can visualize the algorithm “moving” on that surface.

This visualization combines the last two plots into a single view.

We can also see how evolution of the parameters translate to the line fitting to the data.

Challenges in Gradient Descent#

Gradient Descent is a very general algorithm, one that can be applied to a huge array of problem types.

However, there are a variety of issues that arise in using gradient descent in practice.

Learning Rate#

Setting the learning rate can be a challenge.

Previously we had set the learning rate \(\eta = 0.002\).

Let set it a little higher and see what happens: \(\eta = 0.0065.\)

beta_start = np.array([-8, -2])
eta = 0.0065
betas, losses = gradient_descent(X, y, beta_start, eta, nsteps = 100)
_images/23-NN-I-Gradient-Descent_99_0.png
_images/23-NN-I-Gradient-Descent_100_0.png

This is a total disaster. What is going on?

It is helpful to look at the progress of the algorithm using the loss surface:

_images/23-NN-I-Gradient-Descent_103_0.png

We can see what is going on more clearly here.

What is happening is that because the steps are too large, each step overshoots the local minimum.

The next step then lands on a portion of the surface that steeper … and in the opposite direction.

And so the process diverges.

For an interesting comparison, try setting \(\eta = 0.0055\) and observe the evolution of \(\beta\).

Hence it is important to decrease the step size when divergence appears.

Unfortunately, on a complicated loss surface, a given step size may diverge in one location or starting point, but not in another.

Complex Loss Surfaces#

The loss surface for linear regression is the best possible kind: it is strictly convex, so it has a single global minimum.

For neural networks, the loss surface is more complex.

In general, the larger the neural network, the more complex the loss surface.

And deep neural networks, especially transformers have billions of parameters.

Here’s a visualization of the loss surface for the 56 layer neural network VGG-56, from Visualizing the Loss Landscape of Neural Networks.

_images/L23-complex-landscape.png

For a fun exploration, see https://losslandscape.com/explorer.

Recap#

So far we applied gradient descent on a simple linear regression model.

As we’ll soon see, deep neural networks are much more complicated multi-stage models, with millions or billions of parameters to differentiate.

Fortunately, the Chain Rule from calculus gives us a relatively simple and scalable algorithm, called Back Propagation, that solves this problem.

Neuron and Neural Networks#

Now let’s switch gears a bit to define an artificial neuron. For better or worse it is named after and loosely modeled on a biological neuron.

_images/neuron.png

From cs231n

  • The dendrites carry impulses from other neurons of different distances.

  • Once the collective firing rate of the impulses exceed a certain threshold, the neuron fires its own pulse through the axon to other neurons

There are companies trying to mimic this impulse (i.e. spiking) based neuron in silicon – so called neuromorphic computing.

See for example Neuromorphic Computing or Spiking Neural Network

Some examples of companies and projects are Intel’s Loihi and startups such as GrAI Matter Labs VIP processor.

Artificial Neuron#

_images/neuron_model.jpeg

From cs231n

The more common artifical neuron

  • collects one or more inputs,

  • each multiplied by a unique weight

  • sums the weighted inputs

  • adds a bias

  • then finally usually applies a nonlinear activation function

Multi-Layer Perceptron (MLP) or Fully Connected Network (FCN)#

_images/neural_net2.jpeg

From cs231n

Multiple artificial neurons can be acting on the same inputs, in what we call a layer, and we can have more than one layer until we produce one or more outputs.

The example above shows a network with 3 inputs, two layers of neurons, each with 4 neurons, followed by one layer that produces a single value output.

E.g. a binary classifier.

Activation function is typically some nonlinear function that compresses the input in some way. Historically, it’s been the sigmoid and \(\tanh()\) functions. See for example Hyperbolic Functions.

_images/23-NN-I-Gradient-Descent_124_0.png

A more common activation function these days and that is more efficient to implement is the Rectified Linear Unit or ReLU.

\[ \textrm{ReLU}(x) = \mathrm{max}(0, x) \]
_images/23-NN-I-Gradient-Descent_126_0.png

There are many other variations. See for example PyTorch Non-linear Activations

Next Lecture#

  • We’ll build out a Value class

  • Visualize our compute graph

  • Implement Backpropagation

  • Build out our neural network

  • Train and evaluate it

  • Recreate and match it in PyTorch