Sampling#

We will now turn to the study of sampling distributions.

Population and Sample#

A setting that comes up repeatedly is the following:

Imagine we have a very large population of “objects.” The objects could be people, things, or just about anything. The size of the population is so large that if we take one object out of the population, nothing about the population changes in any perceptible way.

Associated with each object is a measurement or quantity of some kind.

So for example, we could be talking about the heights of individuals, or the costs of houses, or the weights of molecules.

We further imagine that we can select members of the population uniformly at random. The selections form a sample, and the sample is small enough compared to the population that we do not worry about the difference between sampling with replacement and sampling without replacement.

Because sampling with replacement is easier to think about in general, we will treat all samples as if they were sampled with replacement.

Example. For example, suppose you assessing success on the SAT exam. You would like to know the mean SAT score for students in your state. You choose 15 people at random from students in your state who took the exam, and you determine their scores.

The \(n = 15\) students are your sample, and the mean of their scores is a sample statistic.

Question. Should you trust your sample statistic as a reflection of the “true” population mean?

There are many things that could go wrong with your strategy:

  • How do you actually choose students at random?

  • How do you find students that actually took the exam?

  • How do you find out the scores of the students who took the exam?

These are all valid concerns!

However, for now, we will focus our attention more narrowly on the relationship between the size of your sample and the population. So let’s assume that somehow, your sample is really random and the scores are truly reported.

Nonetheless, the mean of your sample will probably not be equal to the mean of the population.

Let’s look at the problem visually.

_images/7f57081b89198b030666b26e1147e4266fc8be267459aa231bbc0894673321c1.png

The Distribution of the Sample Mean#

Now, let’s make the assumption that the population has a normal distribution.

The population’s mean is \(\mu\) and its variance is \(\sigma^2\).

Note: both \(\mu\) and \(\sigma^2\) are unknown.

When we take \(n\) items from this distribution, then by the Central Limit Theorem, the sum of these items will be

  • normally distributed

  • with mean \(n\mu\)

  • and variance \(n\sigma^2\).

The variances sum in this case because the items are independent – recall that was an assumption (we chose students at random).

Let’s call the sum \(S\), so

\[ S = \sum_{i=1}^n x_i \]

where \(x_i\) is one data point from the population.

Keep in mind that \(S\) is a random variable that describes a random sample.

Then \(S \sim \mathcal{N}(n\mu, n\sigma^2)\).

Now to estimate the mean of the population, we will use the mean of our sample.

Let

\[ M = \frac{1}{n} S.\]

Now, what is the distribution of \(M\)?

We are simply scaling down \(S\) by a constant. So the mean of \(M\) is

\[ \overline{M} = \frac{1}{n} \overline{S} = \frac{1}{n} \cdot n\mu = \mu. \]

What about the variane of \(M\)?

The variance is the expectation of the square of \(M - \overline{M}\).

\[ \operatorname{Var}(M) = E[(M - \overline{M})^2] = E[(S/n - \overline{S/n})^2] = \frac{1}{n^2} E[(S - \overline{S})^2] = \frac{1}{n^2} \operatorname{Var}(S)\]

So the variance goes down by a factor of \(n^2\).

\(\frac{1}{n^2} \operatorname{Var}(S) = \frac{1}{n^2}n\sigma^2 = \frac{\sigma^2}{n}\), so we find that

\[ M \sim \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right). \]

That is, the variance of the mean of the sample

  • is smaller than the variance of the population,

  • by a factor that depends on the number of data points in the sample.

Expressed in terms of standard deviations, the standard deviation of \(M\) is \(\sigma/\sqrt{n}\).

This is important enough that we give it a name: the standard deviation of \(M\) is called the standard error.

The stardard error is the standard deviation of the population divided by \(\sqrt{n}.\)

Let’s see how this works. Here is the mean \(M\) of our first sample:

_images/3c918df8dd17e4403392bec7e6c02263b6d29bc7fda58baba223903cf4f95011.png

Now let’s repeat the procedure of taking \(n = 15\) samples and computing the sample mean \(M\). We’ll do this many times and show the results as a histogram.

_images/5aeedac2a62ab2d2ec70238b71b47cb44cb99c483adae7237ba9cd9880fb6f54.png

This figure shows how the sampling distribution of the mean differs from the population sample.

The sampling distribution of the mean is more concentrated around the population mean than is the population sample.

In fact, the green histogram above is showing samples from a normal distribution with mean \(\mu\) and standard deviation \(\frac{\sigma}{\sqrt{n}}\):

_images/53f032258e70cfa8e52fba80881adbcfb611115d643196fd871563f53a0c8c29.png

Effect of Sample Size#

Now, the previous results showed how the sample mean \(M\) relates to the population, when \(M\) is the average of \(n = 15\) data points.

What would happen if we had taken more data points in computing \(M\)?

Intuitively, taking more data points would cause \(M\) to be a “better” estimate of the true mean \(\mu\).

We can see that in practice. Here is the histogram we get if we used 100 data points each time we compute \(M\):

_images/2ab9a6256ecaa09dd70d5448e5dd4707153fc1d0181c777046b1d8703f1aa0bc.png

Now, the distribution of \(M\) is even more concentrated around the true mean \(\mu\).

We can be precise about this change: the standard error went from \(\frac{\sigma}{\sqrt{15}}\) to \(\frac{\sigma}{\sqrt{100}}\).

We will return to this effect in the next lecture, where we will take up the important topic of confidence intervals.