Model Fitting#

Today we will talk about model fitting, likelihood, and log-likelihood functions. This lecture provides a foundation for our next lecture, where we will discuss maximum likelihood estimation.

Model Fitting#

The notion of parameter estimation is closely related to the concept called model fitting. We have actually been doing this quite a bit already, but now we want to treat the notion more directly.

Imagine that you know that data is drawn from a particular kind of distribution, but you don’t know the value(s) of the distribution’s parameter(s).

The following table shows a number of common distributions and their corresponding parameters:

Distribution

Parameters θ

Bernoulli

p

Binomial

(N,p)

Poisson

λ

Geometric

p

Exponential

λ

Uniform

(a,b)

Normal

(μ,σ)

We formalize the problem as follows. We say that data is drawn from a distribution

p(x;θ).

The way to read this is: the probability of x under a distribution having parameter(s) θ.

We call p(x;θ) a family of distributions because there is a different distribution for each value of θ.

The graph below illustrates the probability density functions of several normal distributions (from the same parametric family).

_images/Probability_distribution_functions_for_normal_distribution.svg

Model fitting is finding the parameters θ of the distribution given that we know some data x and assuming that a certain distribution can describe the population.

Notice that in this context, it is the parameters θ that are varying (not the data x).

When we think of p(x;θ) as a function of θ (instead of x, say) we call it a likelihood.

This change in terminology is just to emphasize that we are thinking about varying θ when we look at the probability p(x;θ).

For example, consider the dataset below:

_images/10-MLE-1_14_0.png

Can you imagine that this dataset might be drawn from a normal distribution?

In that case,

p(x;θ)=N(x;μ,σ2).

Then model fitting would consist of finding the μ and σ that best match the given data x shown above.

In other words, the likelihood function gives the probability of observing the given data as a function of the parameter(s). Therefore, the parameters can be estimated by maximizing the likelihood function. We will do that in the next lecture. In this lecture we focus on computing the likelihood function.

Calculating Likelihood#

Let’s think about how to calculate likelihood. Consider a set of m data items

Xs=(x1,x2,,xm)

drawn independently from the true but unknown data-generating distribution pdata(x).

Let pmodel(x;θ) be a family of probability distributions over the same space.

Then, for any value of θ, pmodel(x;θ) maps a particular data item x to a real number that we can use as an estimate of the true probability pdata(x).

What is the probability of the entire dataset Xs under the model?

We assume that the xi are independent; so the probabilities multiply.

Therefore, the joint probability is

pmodel(Xs;θ)=pmodel(x1;θ)pmodel(x2;θ)pmodel(xm;θ).

We can use a special shorthand notation to represent products. Just like is shorthand for summing, is shortand for taking the product.

For example, the product of two numbers a1 and a2 (i.e., a1a2), can be written as i=12ai.

So the joint probability can be written as:

pmodel(Xs;θ)=i=1mpmodel(xi;θ).

Now, each individual pmodel(xi;θ) is a value between 0 and 1.

And there are m of these numbers being multiplied. So for any reasonable-sized dataset, the joint probability is going to be very small.

For example, if a typical probability is 1/10, and there are 500 data items, then the joint probability will be a number on the order of 10500.

So the probability of a given dataset as a number will usually be too small to even represent in a computer using standard floating point!

Log-Likelihood#

Luckily, there is an excellent way to handle this problem. Instead of using likelihood, we will use the log of likelihood.

The table below shows some of the properties of the natural logarithm.

Product rule

logab=loga+logb

Quotient rule

logab=logalogb

Power rule

logan=nloga

Exponential\logarithmic

logex=elogx=x

In the next lecture we will see that we are only interested in the maxima of the likelihood function. Since the log function does not change those points (the log is a monotonic function), using the log of the likelihood works for us.

_images/natural_log_max.png

So we will work with the log-likelihood:

logpmodel(Xs;θ).

Which becomes:

logpmodel(Xs;θ)=logi=1mpmodel(xi;θ)=i=1mlogpmodel(xi;θ).

This way we are no longer multiplying many small numbers, and we work with values that are easy to represent.

Note: The log of a number less than one is negative, so log-likelihoods are always negative values.

Example.

Suppose that X is a discrete random variable with the probability mass function shown below.

x

0

1

2

3

p(x;θ)

23θ

13θ

23(1θ)

13(1θ)

Here 0θ1 is a parameter. The following 10 independent observations were taken from this distribution:

Xs=(3,0,2,1,3,2,1,0,2,1).

We want to find the corresponding log-likelihood function.

Based on the observed data sample, the (joint) likelihood function is equal to

p(Xs;θ)=i=110p(x(i);θ)=p(0;θ)2p(1;θ)3p(2;θ)3p(3;θ)2
=(23θ)2(13θ)3(23(1θ))3(13(1θ))2.

Since the likelihood function is given by

p(Xs;θ)=(23θ)2(13θ)3(23(1θ))3(13(1θ))2,

the log-likelihood function can be written as

logp(Xs;θ)=log((23θ)2(13θ)3(23(1θ))3(13(1θ))2)
=2(log23+logθ)+3(log13+logθ)+3(log23+log(1θ))+2(log13+log(1θ))
=5logθ+5log(1θ)+5log29.

We can visualise the log-likelihood function by varying θ.

_images/10-MLE-1_46_0.png

Example.

As an example, let’s return to Bortkeiwicz’s horse-kick data, and see how the log-likelihood changes for different values of the Poisson parameter λ.

Recall that Bortkeiwicz had collected deaths by horse-kick in the Prussian army over a span of 200 years, and was curious whether they occurred at a constant, fixed rate.

To do this he fitted the data to a Poisson distribution. To do that, he needed to estimate the parameter λ.

Let’s see how the log-likelihood of the data varies as a function of λ.

Deaths Per Year Observed Instances
0 109
1 65
2 22
3 3
4 1
5 0
6 0

As a reminder, λ is rate of deaths per year, and T=1 reflects that we are interested in one-year intervals.

The Poisson distribution predicts:

P[k deaths in time T]=(λT)keλTk!

Since T=1 we can drop it, and so the likelihood of a particular number of deaths xi in year i is:

p(xi;λ)=λxieλxi!

The likelihood function then becomes

p(Xs;λ)=i=1mλxieλxi!.

The corresponding log-likelihood function can be written as

logp(Xs;λ)=log(i=1mλxieλxi!)=i=1mlog(λxieλxi!).

Note that the log-likelihood of a particular number of deaths xi in year i is equal to

logp(xi;λ)=log(λxieλxi!)=xilogλλlogxi!

Using this we can express the log-likelihood of the data as

logp(Xs;λ)=i=1m(xilogλλlogxi!)

Which looks like this as we vary λ:

_images/10-MLE-1_60_0.png

Recall that for an arbitrary Normal distribution with mean μ and variance σ2, the PDF is given by

pμ,σ(x)=1σ2πe12(xμσ)2.
_images/Galton-Bean-Machine.png
Francis Galton's "Bean Machine"

Example.

Suppose Xs=(x1,,xm) represents the observed data from a normal distribution N(μ,σ2), where μ and σ are the unknown parameters. To fit the data to a normal distribution we need to estimate both μ and σ.

The likelihood function is the product of the marginal densities:

p(Xs;μ,σ)=p(x1,,xm;μ,σ)=i=1m1σ2πe12(xiμσ)2.

This can be rewritten as

p(Xs;μ,σ)=(1σ2π)mexp(i=1m12(xiμσ)2).

Since the likelihood function is equal to

p(Xs;μ,σ)=(1σ2π)mexp(i=1m12(xiμσ)2),

the log likelihood becomes

logp(Xs;μ,σ)=log(σ2π)m+i=1m(12)(xiμσ)2
=mlogσm2log(2π)12σ2i=1m(xiμ)2.

Visualizing Log-Likelihood#

We consider a sample of size 100 from a normal distribution. Our goal is to visualize the corresponding log-likelihood function. To do so, first we create the log-likelihood function.

# log-likelihood function for a normal distribution
# input: sample, parameters' domains
def normloglik(X,muV,sigmaV):
    m = len(X)
    return -m/2*np.log(2*np.pi) - m*np.log(sigmaV)\
    - (np.sum(np.square(X))- 2*muV*np.sum(X)+m*muV**2)/(2*sigmaV**2)

Note: The log-likelihood function of the normal distribution can also be written as

logp(Xs;μ,σ)=m2log(2π)mlogσ12σ2(i=1m(xi)22μi=1mxi+mμ2).

If neither of the parameters is known, we can use a surface plot to visualize the log-likelihood as a function of both parameters.

Alternatively, if one of the parameters is known, we visualize the log-likelihood as function of the unknown parameter.

_images/10-MLE-1_75_0.png