Method of Moments Estimation#

Maximum likelihood estimation (MLE) that was discussed in the previous lecture had a nice intuition but could be a bit tedious to solve.

Today we will learn a different technique for estimating parameters called the method of moments. It was introduced in 1894 by Karl Pearson and is the oldest method for deriving point estimators. This technique is consistent and often (but not always!) provides the same estimates as MLE but is significantly easier to use.

Moments#

Definition. Let \(X\) be a random variable. Then, the \(k\)th moment of \(X\) is defined as

\[\mu_k = E[X^k].\]

The \(k\)th moment of \(X\) about scalar \(c\) is equal to

\[E[(X-c)^k].\]

Usually, we are interested in the first moment of \(X\): \(\mu_1 = E[X]\), and the second moment of \(X\) about \(\mu_1\): \(\operatorname{Var}(X) = E[(X − \mu_1)^2].\)

Definition. Let \(X\) be a random variable and \(x_1, x_2, ..., x_m\) be i.i.d. realizations from \(X\). Then the
\(k\)th sample moment of \(X\) is

\[\hat{\mu}_k = \frac{1}{m} \sum_{i=1}^m x_i^k.\]

The \(k\)th sample moment of \(X\) about scalar \(c\) is equal to

\[\frac{1}{m} \sum_{i=1}^m (x_i-c)^k.\]

For example, the first sample moment is just the sample mean. The second sample moment about the sample mean is the sample variance.

Estimation Procedure#

Suppose we only need to estimate one parameter \(\theta\).

The method of moments assumes that to find a good estimator, we should have the true and sample moments match as best we can. That is, we should choose the parameter \(\theta\) such that the first true moment \(\mu_1\) is equal to the first sample moment \(\hat{\mu}_1\).

Let us look next at some examples.

Example. Let \(x_1, x_2, ..., x_m\) be an i.i.d. sample from a continuous uniform distribution on \((0,\theta)\). What is the method of moments estimate of \(\theta\)?

Solution. We will set the true and sample moments equal to each other to solove for \(\theta\). Let \(X\) be continuous random variable that is uniformly distributed on \((0,\theta)\). Then \(x_1, x_2, ..., x_m\) are i.i.d. realizations from \(X\).

The first true moment \(\mu_1 = E[X].\)

We know that \(E[X] = \frac{0+\theta}{2} = \frac{\theta}{2}.\) Thus, \(\frac{\theta}{2}\) is the first true moment.

The first sample moment is \(\hat{\mu}_1 = \frac{1}{m} \sum_{i=1}^m x_i.\)

The method of moments approach requires us to match the first true moment with the first sample moment:

\[\frac{\theta}{2} = \frac{1}{m} \sum_{i=1}^m x_i.\]

Solving for \(\theta\) we obtain

\[\hat{\theta} = \frac{2}{m} \sum_{i=1}^m x_i.\]

Example. Suppose that \(x_1, x_2,...,x_m\) are i.i.d. realizations from \(X\), a random variable having an exponential distribution with parameter \(\theta\). Find the estimate of \(\theta\) using the method of moments.

Solution. Observe that

\[\mu_1 = E[X] = \frac{1}{\theta}\]

and

\[\hat{\mu}_1 = \frac{1}{m} \sum_{i=1}^m x_i.\]

To obtain the method of moments estimate we set \(\mu_1 = \hat{\mu}_1:\)

\[\frac{1}{\theta} = \frac{1}{m} \sum_{i=1}^m x_i.\]

Hence, we find \(\hat{\theta} = \frac{m}{\sum_{i=1}^m x_i}.\)

Intuitively the above estimators make perfect sense. If we take the sample mean of a large number of identically distributed random variables, we expect to get close to the true mean. We know this from the law of large numbers.

Let us look next at a more practical example, where the realizations are actually given to us.

Example. Asbestos fibers on filters were counted as part of a project to develop measurement standards for asbestos concentration. Asbestos dissolved in water was spread on a filter, and 3-mm diameter punches were taken from the filter and mounted on a transmission electron microscope. An operator counted the number of fibers in each of 23 grid squares, yielding the following counts:

\[\begin{align*} & 31 \quad 29 \quad 19 \quad 18 \quad 31 \quad 28 \\ & 34 \quad 27 \quad 34 \quad 30 \quad 16 \quad 18 \\ & 26 \quad 27 \quad 27 \quad 18 \quad 24 \quad 22 \\ & 28 \quad 24 \quad 21 \quad 17 \quad 24 \end{align*}\]

The Poisson distribution would be a plausible model for describing the variability from grid square to grid square in this situation and could be used to characterize the inherent variability in future measurements.

Since for the Poisson distribution with parameter \(\theta\) the expected value is also \(\theta\), we find that

\[\mu_1 = E[X] = \theta.\]

The method of moments tells us that \(\theta = \mu_1 = \hat{\mu}_1\), where \(\hat{\mu}_1\) is the sample mean. That is, to find the method of moments estimate of \(\theta\) we simply need to find the mean of the counts.

counts = [31, 29, 19, 18, 31, 28, 34, 27, 34, 30, 16, 18, 26, 27, 27, 18,
          24, 22, 28, 24, 21, 17, 24]
mean_counts = np.mean(counts)

print(f"The mean of the given counts is {mean_counts:.3g}.")

The mean of the given counts is 24.9.

Thus, the estimate of \(\theta\) for this particular dataset is simply 24.9.

Note that when we considered the Bortkeiwicz’s horse-kick problem with MLE we found the same estimator for the parameter of a Poisson distribution, namely the sample mean.

A Poisson distribution is an example when the method of moments estimator agrees with the MLE. However, the computation this time is much easier.

Bonus material:

Multiple Parameters#

What if you had two parameters instead of just one?

In that case, you would set the first true moment equal to the first sample moment (as we just did), and then set the second true moment equal to the second sample moment.

This idea expends to more than two parameters. Basically, if we have \(k\) parameters to estimate, we need \(k\) equations to solve for these \(k\) unknowns. That is, to estimate \(\theta_1, \theta_2, \dots, \theta_k\) we need to solve

\[\begin{align*} \mu_1 &= \hat{\mu}_1,\\ \mu_2 &= \hat{\mu}_2,\\ & \: \: \vdots \\ \mu_k &= \hat{\mu}_k.\\ \end{align*}\]

The estimation procedure follows 3 basic steps:

Step 1. Calculate low order true moments, finding the expressions for the moments in terms of the parameters:

\[\begin{align*} \mu_1 &= E[X] = g_1(\theta_1, \theta_2, \dots, \theta_k),\\ \mu_2 &= E[X^2] = g_2(\theta_1, \theta_2, \dots, \theta_k),\\ & \: \: \vdots \\ \mu_k &= E[X^k] = g_k(\theta_1, \theta_2, \dots, \theta_k).\\ \end{align*}\]

Here, \(g_i\) represents a function for the \(i\)th moment, \(1 \leq i \leq k\).

Step 2. Invert the expressions found in the preceding step, finding new expressions for the parameters in terms of moments:

\[\begin{align*} \theta_1 & = h_1(\mu_1, \mu_2, \dots, \mu_k),\\ \theta_2 & = h_2(\mu_1, \mu_2, \dots, \mu_k),\\ & \: \: \vdots \\ \theta_k & = h_k(\mu_1, \mu_2, \dots, \mu_k),\\ \end{align*}\]

where \(h_i\) represents a function for the \(i\)th parameter, \(1 \leq i \leq k\).

Step 3. Insert the sample moments into the expressions obtained in the second step, obtaining estimates of the parameters in terms of the sample moments.

We illustrate this procedure by means of an example.

Example. Let \(x_1, x_2, \dots , x_m\) be i.i.d. realizations from \(X\) that is normally distributed with parameters \(\theta_1\) and \(\theta_2\). \(\theta_1\) represents the mean, and \(\theta_2\) represents the variance. What is the method of moments estimate of \(\theta = (\theta_1, \theta_2)\)?

Solution.
Step 1. We have 2 parameters, so we need to consider the first two moments.

\[\begin{align*} \mu_1 &= E[X] = \theta_1, \\ \mu_2 &= E[X^2] = \operatorname{Var}(X) + E[X]^2 = \theta_2 + \theta_1^2. \end{align*}\]

Here, we used the fact that \(\operatorname{Var}(X) = E[X^2] - E[X]^2.\)

Step 2. Now we simply rewrite the equations putting the parameters to the left of the equal sign.

\[\begin{align*} \theta_1 & = \mu_1, \\ \theta_2 & = \mu_2 - \mu_1^2. \end{align*}\]

Step 3. It remains to replace the true moments by the sample moments.

\[\begin{align*} \theta_1 & = \frac{1}{m} \sum_{i=1}^m x_i, \\ \theta_2 & = \frac{1}{m} \sum_{i=1}^m x_i^2 - \left(\frac{1}{m} \sum_{i=1}^m x_i\right)^2. \end{align*}\]

The above expressions are our method of moment estimates.

If you were to use maximum likelihood to estimate the mean and variance of a normal distribution, you would get the same result. We are going to show this.

Both estimates are the same for the mean, \(\theta_1\). So there is nothing we need to do for \(\theta_1\).

The ML estimate for \(\theta_2\) can be written as follows (see the previous lecture for the derivation):

\[\theta_2 = \frac{1}{m} \sum_{i=1}^m (x_i - \bar{x})^2,\]

where \(\bar{x}\) represents the sample mean (i.e., \(\bar{x} = \theta_1 = \frac{1}{m} \sum_{i=1}^m x_i\)).

Using the same notation the method of moments estimate can be written as

\[\theta_2 = \frac{1}{m} \sum_{i=1}^m x_i^2 - \bar{x}^2.\]

That is, we need to prove that

\[ \frac{1}{m} \sum_{i=1}^m (x_i - \bar{x})^2 = \frac{1}{m} \sum_{i=1}^m x_i^2 - \bar{x}^2.\]

We can do this as follows

\[\begin{align*} \frac{1}{m} \sum_{i=1}^m (x_i - \bar{x})^2 & = \frac{1}{m} \sum_{i=1}^m (x_i^2 - 2x_i \bar{x} + \bar{x}^2)\\ & = \frac{1}{m} \sum_{i=1}^m x_i^2 + \frac{2 \bar{x}}{m} \sum_{i=1}^m x_i - \frac{\bar{x}^2}{m} \sum_{i=1}^m 1\\ & = \frac{1}{m} \sum_{i=1}^m x_i^2 - 2\bar{x}^2 + \bar{x}^2\\ & = \frac{1}{m} \sum_{i=1}^m x_i^2 - \bar{x}^2. \end{align*}\]

Foundations of Data Science III

Method of Moments Estimation

Contents

Method of Moments Estimation#

Moments#

Estimation Procedure#

Multiple Parameters#