Maximum Likelihood Estimation
Contents
Maximum Likelihood Estimation#
Maximum likelihood estimation is one of the most useful tools for parameter estimation. The technique is intuitive and has nice theoretical properties that make it the most popular method for learning parametric models.
Today’s lecture builds upon our previous lecture, model fitting. We will shortly review it but refer to the previous section for mathematical details.

The Maximum Likelihood Principle#
We have seen some examples of estimators for parameters in the first lecture on parameter estimation. But where do estimators come from exactly?
Rather than making guesses about what might be a good estimator, we’d like a principle that we can use to derive good estimators for different models.
The most common principle is the maximum likelihood principle.
The Maximum Likelihood Estimate (MLE) of
Formally, we say that the maximum likelihood estimator for
which for a dataset of
This is the same thing as maximizing the log-likelihood:
We will drop the ML subsript in the remaining part of the lecture.
Since
Illustration of this property for a general (i.e., not likehood function) function

Example. Looking back at the deaths-by-horse-kick data, we can see that there is clearly a maximum in the plot of the log-likelihood function, somewhere between 0.5 and 1.
If we use the value of

Computation of MLE#
Often, the MLE is found using calculus by locating a critical point:

The computation of the second derivatives becomes reduntant when the graph of the log-likelihood function is available and confirms that the critical point is indeed a maximum.
The table below shows some useful properties of the derivatives.
Natural logarithm |
|
Product rule |
|
Quotient rule |
|
Chain rule |
Summary of the computational steps#
The steps to find Maximum Likelihood Estimate
Compute the likehood function
Compute the corresponding log-likelihood function
Take the derivative of the log-likelihood function with respect to
.Set the derivative to zero to find the MLE.
(Optional) Confirm that the obtained extremum is indeed a maximum by taking the second derivative or from the plot of the log-likelihood function.
Example.
Previously we looked at the following problem:
Suppose that
Here
Before we found the corresponding log-likelihood function. This time we want to go a step further and find the MLE for
We know that the likelihood function is equal to
This function is not easy to maximize. Let us look at the log-likelihood function.
The log-likelihood function is given by
The derivative of the log-likelihood function with respect to
Setting the derivative to zero leads to
This suggests that the MLE for

The figure shows that
Example.
Let’s revisit the Bortkeiwicz’s horse-kick data as well. Recall that Bortkeiwicz had collected deaths by horse-kick in the Prussian army over a span of 200 years, and was curious whether they occurred at a constant, fixed rate.
Deaths Per Year | Observed Instances |
---|---|
0 | 109 |
1 | 65 |
2 | 22 |
3 | 3 |
4 | 1 |
5 | 0 |
6 | 0 |
Previously we started fitting the data to a Poisson distribution. We found that the likelihood function is given by
From the likelihood function we obtained the log-likehood function:
To find the maximum of the log-likelihood of the data as a function of
Setting the derivative to zero, we get the expression for the maximum likelihood estimator
This is just the mean of the data, that is, the average number of deaths per year! From the available data we can compute that the mean is 0.61. The plot confirms that 0.61 is indeed a maximum. Therefore, the MLE of

The plot confirms that 0.61 is indeed a maximum. Therefore, the MLE for
Using this estimate for

Which shows that the Poisson model is indeed a very good fit to the data!
From this, Bortkeiwicz concluded that there was nothing particularly unusual about the years when there were many deaths by horse-kick. They could be just what is expected if deaths occurred at a constant rate.
Example.
We also had looked at sample data
We found previously that the likelihood function is equal to
while the log-likelihood function is given by
Replacing the constant term by
The partials with respect to
We assume that
Thus,
To find an estimate for the standard deviation we use
where
Thus, the potential MLE for
The upcoming Python example will illustrate that the above
Visualizing MLE#
We look at a sample of size 100 from a normal distribution. The log-likelihood function can be created in the following way.
# log-likelihood function for a normal distribution
# input: sample, parameters' domains
def normloglik(X,muV,sigmaV):
m = len(X)
return -m/2*np.log(2*np.pi) - m*np.log(sigmaV)\
- (np.sum(np.square(X))- 2*muV*np.sum(X)+m*muV**2)/(2*sigmaV**2)
The MLE for
# MLE for the mean and the standard deviation
mu_hat = np.mean(data)
sig_hat = np.sqrt(np.var(data, axis = 0, ddof = 0))
llik = normloglik(data, mu_hat, sig_hat)
We can visualize the MLE on the surface plot.

Alternatively, if one of the parameters is known, we visualize the log-likelihood as function of the unknown parameter and indicate the computed MLE.

Consistency#
The maximum likelihood estimator has a number of properties that make it the preferred way to estimate parameters whenever possible.
In particular, under appropriate conditions, the MLE is consistent, meaning that as the number of data items grows large, the estimate converges to the true value of the parameter. We will introduce the concept of consistency more precisely in one of the future lectures.
For now we are going to focus on the fact that it can be shown that for large sample sizes, no consistent estimator has a lower MSE than the maximum likelihood estimator.