Linear Models
Sir Francis Galton by Charles Wellington Furse by Charles Wellington Furse (died 1904) - National Portrait Gallery: NPG 3916
Many parts of this page are based on Linear Algebra and its Applications, by David C. Lay
In 1886 Francis Galton published his observations about how random factors affect outliers.
This notion has come to be called “regression to the mean” because unusually large or small phenomena, after the influence of random events, become closer to their mean values (less extreme).
One of the most fundamental kinds of machine learning is the construction of a model that can be used to summarize a set of data.
A model is a concise description of a dataset or a real-world phenomenon.
For example, an equation can be a model if we use the equation to describe something in the real world.
The most common form of modeling is regression, which means constructing an equation that describes the relationships among variables.
Regression Problems
For example, we may look at these points and observe that they approximately lie on a line.
So we could decide to model this data using a line.
We may look at these points and decide to model them using a quadratic function.
And we may look at these points and decide to model them using a logarithmic function.
Clearly, none of these datasets agrees perfectly with the proposed model. So the question arises:
How do we find the best linear function (or quadratic function, or logarithmic function) given the data?
The Framework of Linear Models
The regression problem has been studied extensively in the field of statistics and machine learning.
Certain terminology is used:
- Some values are referred to as “independent,” and
- Some values are referred to as “dependent.”
The basic regression task is:
- given a set of independent variables
- and the associated dependent variables,
- estimate the parameters of a model (such as a line, parabola, etc) that describes how the dependent variables are related to the independent variables.
The independent variables are collected into a matrix
The dependent variables are collected into an observation vector
The parameters of the model (for any kind of model) are collected into a parameter vector
Fitting a Line to Data
The first kind of model we’ll study is a linear equation:
This is the most commonly used type of model, particularly in fields like economics, psychology, biology, etc.
The reason it is so commonly used is that, like Galton’s data, experimental data often produce points
The question we must confront is: given a set of data, how should we “fit” the equation of the line to the data?
Our intuition is this: we want to determine the parameters
Let’s develop some terminology for evaluating a model.
Suppose we have a line
Image credit: Linear Algebra and its Applications, by David C. Lay, 4th ed.
We call
and we call
The difference between an observed
There are several ways to measure how “close” the line is to the data.
The usual choice is to sum the squares of the residuals.
(Note that the residuals themselves may be positive or negative – by squaring them, we ensure that our error measures don’t cancel out.)
The least-squares line is the line
The coefficients
A Least-Squares Problem
Let’s imagine for a moment that the data fit a line perfectly.
Then, if each of the data points happened to fall exactly on the line, the parameters
We can write this system as
where
Of course, if the data points don’t actually lie exactly on a line,
… then there are no parameters
… and
Now, since the data doesn’t fall exactly on a line, we have decided to seek the
This is key: the sum of squares of the residuals is exactly the square of the distance between the vectors
This is a least-squares problem,
Computing the least-squares solution of
Example 1. Find the equation
x | y |
---|---|
2 | 1 |
5 | 2 |
7 | 3 |
8 | 3 |
Solution. Use the
Now, to obtain the least-squares line, find the least-squares solution to
We do this via the method we learned last lecture (just with new notation):
So, we compute:
So the normal equations are:
Solving, we get:
So the least-squares line has the equation
The General Linear Model
Another way that the inconsistent linear system is often written is to collect all the residuals into a residual vector.
Then an exact equation is
Any equation of this form is referred to as a linear model.
In this formulation, the goal is to minimize the length of
In some cases, one would like to fit data points with something other than a straight line.
For example, think of Gauss trying to find the equation for the orbit of Ceres.
In cases like this, the matrix equation is still
The least-squares solution
Least-Squares Fitting of Other Models
Most models have parameters, and the object of model fitting is to to fix those parameters. Let’s talk about model parameters.
In model fitting, the parameters are the unknown. A central question for us is whether the model is linear in its parameters.
For example, the model
The model
For a model that is linear in its parameters, an observation is a linear combination of (arbitrary) known functions.
In other words, a model that is linear in its parameters is
where
Example. Suppose data points
Describe the linear model that produces a “least squares fit” of the data by the equation.
Solution. The ideal relationship is
Suppose the actual values of the parameters are
where
Each data point determines a similar equation:
Clearly, this system can be written as
Multiple Regression
Suppose an experiment involves two independent variables – say,
A linear equation for predicting
Since there is more than one independent variable, this is called multiple regression.
A more general prediction equation might have the form
A least squares fit to equations like this is called a trend surface.
In general, a linear model will arise whenever
with
Example. In geography, local models of terrain are constructed from data
Let’s take an example. Here are a set of points in
Let’s describe the linear models that gives a least-squares fit to such data. The solution is called the least-squares plane.
Solution. We expect the data to satisfy these equations:
This system has the matrix for
This example shows that the linear model for multiple regression has the same abstract form as the model for the simple regression in the earlier examples.
We can see that there the general principle is the same across all the different kinds of linear models.
Once
Thus, for any linear model where
Multiple Regression in Practice
Let’s see how powerful multiple regression can be on a real-world example.
A typical application of linear models is predicting house prices. Linear models have been used for this problem for decades, and when a municipality does a value assessment on your house, they typically use a linear model.
We can consider various measurable attributes of a house (its “features”) as the independent variables, and the most recent sale price of the house as the dependent variable.
For our case study, we will use the features:
- Lot Area (sq ft),
- Gross Living Area (sq ft),
- Number of Fireplaces,
- Number of Full Baths,
- Number of Half Baths,
- Garage Area (sq ft),
- Basement Area (sq ft)
So our design matrix will have 8 columns (including the constant for the intercept):
and it will have one row for each house in the data set, with
We will use data from housing sales in Ames, Iowa from 2006 to 2009:
= pd.read_csv('data/ames-housing-data/train.csv') df
'LotArea', 'GrLivArea', 'Fireplaces', 'FullBath', 'HalfBath', 'GarageArea', 'TotalBsmtSF', 'SalePrice']].head() df[[
LotArea | GrLivArea | Fireplaces | FullBath | HalfBath | GarageArea | TotalBsmtSF | SalePrice | |
---|---|---|---|---|---|---|---|---|
0 | 8450 | 1710 | 0 | 2 | 1 | 548 | 856 | 208500 |
1 | 9600 | 1262 | 1 | 2 | 0 | 460 | 1262 | 181500 |
2 | 11250 | 1786 | 1 | 2 | 1 | 608 | 920 | 223500 |
3 | 9550 | 1717 | 1 | 1 | 0 | 642 | 756 | 140000 |
4 | 14260 | 2198 | 1 | 2 | 1 | 836 | 1145 | 250000 |
= df[['LotArea', 'GrLivArea', 'Fireplaces', 'FullBath', 'HalfBath', 'GarageArea', 'TotalBsmtSF']].values
X_no_intercept = df['SalePrice'].values y
Next we add a column of 1s to the design matrix, which adds a constant intercept to the model:
= np.column_stack([np.ones(X_no_intercept.shape[0], dtype = 'int'), X_no_intercept]) X
X
array([[ 1, 8450, 1710, ..., 1, 548, 856],
[ 1, 9600, 1262, ..., 0, 460, 1262],
[ 1, 11250, 1786, ..., 1, 608, 920],
...,
[ 1, 9042, 2340, ..., 0, 252, 1152],
[ 1, 9717, 1078, ..., 0, 240, 1078],
[ 1, 9937, 1256, ..., 1, 276, 1256]])
Now let’s peform the least-squares regression:
= np.linalg.inv(X.T @ X) @ X.T @ y beta_hat
What does our model tell us?
beta_hat
array([-2.92338280e+04, 1.87444579e-01, 3.94185205e+01, 1.45698657e+04,
2.29695596e+04, 1.62834807e+04, 9.14770980e+01, 5.11282216e+01])
We see that we have:
: Intercept of -$29,233 : Marginal value of one square foot of Lot Area: $18 : Marginal value of one square foot of Gross Living Area: $39 : Marginal value of one additional fireplace: $14,570 : Marginal value of one additional full bath: $22,970 : Marginal value of one additional half bath: $16,283 : Marginal value of one square foot of Garage Area: $91 : Marginal value of one square foot of Basement Area: $51
Is our model doing a good job?
There are many statistics for testing this question, but we’ll just look at the predictions versus the ground truth.
For each house we compute its predicted sale value according to our model:
= X @ beta_hat y_hat
And for each house, we’ll plot its predicted versus actual sale value:
We see that the model does a reasonable job for house values less than about $250,000.
For a better model, we’d want to consider more features of each house, and perhaps some additional functions such as polynomials as components of our model.