Bayesian Statistics#
On a visit to the doctor, we may ask, “What is the probability that I have disease X?” but what we really mean is “How certain should I be that I have disease X?”
Or, before digging a well, we may ask, “What is the probability that I will strike water?” but what we are really asking is “How certain should I be that I will strike water?”
The key insight is that either we do, or do not, have disease X and either will, or will not, strike water, and what we can determine is the level of certainty.
Bayes’ Theorem will be the tool we use to answer questions like this and will form the foundation of the next 4 weeks of this course! It is the basis of the Bayesian view of probability and statistics.
In this view, we are using probability to encode a “degree of belief” or a “state of knowledge.”
Review of Probability#
The Chain Rule#
Recall from many weeks ago, the definition of conditional probaility:
From there we derived the chain rule by multiplying both sides by \(P(B)\):
Bayes’ Theorem#
Now recall the fact that the conjuction (joint probability) is commutative:
Let’s then use the chain rule to expand both sides:
And then divide through by \(P(B)\):
This is Bayes’ Theorem (or Bayes’ Rule), and we’re about to find out why it’s so powerful.
Bayes’ Theorem to update probability (Diachronic Bayes)#
While being able to solve questions like the ones raised in the problems above is useful, the main way Bayes’ Theorem is used in Bayesian probability and statistics is as a way to update the probability of a hypothesis H, given some data D. This is sometimes called “Diachronic Bayes” which means “related to change over time.”
As a general framework, we re-write Bayes’ Theorem as:
In this diachronic framework, these terms are known by specific names (these terms will come up over and over again, make sure to remember them):
\(P(H)\), the probability of the hypothesis before we see the data, called the prior probability
\(P(H \vert D)\), the probability of the hypothesis after we see the data, called the posterior probability
\(P(D \vert H)\), the probabilitiy of the data under the hypothesis, called the likelihood
\(P(D)\), the total probability of the data under any hypothesis
The Prior#
The prior is often the trickiest portion of this equation to pin down.
Sometimes it can be computed exactly as in the cookie bowl problem, where we were equally likely to pick each bowl.
But what if we chose bowls proportinally to their size? We would need to guess at their sizes to establish the prior.
Other times, people might disagree about which background information is relevent to the problem at hand. In fact in many cases, people will use a uniform prior!
The Likelihood#
The likelihood is usually well defined and can be computed directly. In the cookie problem we know the numbers of different cookies in each bowl, so we can compute the probabilites under each hypothesis.
The Total Probability#
Finally, determining the total probability of the data is often a lot more difficult than you might expect because we often don’t know what every possible hypothesis is.
However, usually the goal is to pin down a set of mutually exclusive and collectively exhaustive hypotheses, meaning a set of hypotheses where only one of them can be true, and one of them must be true.
Then, over a set of i hypotheses, we say:
The Bayesian Update#
Overall, the process of generating a posterior probability from a prior probability using data is called a Bayesian update.
Bayes Tables#
Often when we work with Bayes’ theorem to update probabilities (i.e., when we do a Bayesian update), we want to keep track of the probabilities of hypotheses as we update them using our data.
The table that keeps track of the probabilities of all hypotheses as we update them using our data is called a Bayes table.
Let’s go step by step to create a Bayes table for the cookie problem.
First we need a table of all our hyptheses, one row for each hypothesis:
table = pd.DataFrame(index=['first bowl', 'second bowl'])
table
first bowl |
---|
second bowl |
Next we add the prior probabilities of each hypothesis. Our prior is that it was equally likely to get a vanilla cookie from either bowl:
table['prior'] = 1/2, 1/2
table
prior | |
---|---|
first bowl | 0.5 |
second bowl | 0.5 |
The likelihood of each hypothesis is the fraction of cookies in each bowl that is vanilla. For example,
table['likelihood'] = 30/40, 20/40
table
prior | likelihood | |
---|---|---|
first bowl | 0.5 | 0.75 |
second bowl | 0.5 | 0.50 |
Now lets compute the “unnormalized posteriors.” This is just a term for the top half of the Bayes’ theorem formula: the prior multiplied by the likelihood.
So the first term of the unnormalized posteriors is:
table['unnormalized'] = table['prior'] * table['likelihood']
table
prior | likelihood | unnormalized | |
---|---|---|---|
first bowl | 0.5 | 0.75 | 0.375 |
second bowl | 0.5 | 0.50 | 0.250 |
The final missing piece is to divide by the total probability of the data.
What we are doing is normalizing the posteriors so that they sum up to 1.
To find the total probability of the data we directly sum over the unnormalized posteriors:
prob_data = table['unnormalized'].sum()
prob_data
0.625
This gives us 5/8, just like we calculated before.
Finally, we can use the total probability of the data to get the posterior probability of each hypthesis.
table['posterior'] = table['unnormalized'] / prob_data
table
prior | likelihood | unnormalized | posterior | |
---|---|---|---|---|
first bowl | 0.5 | 0.75 | 0.375 | 0.6 |
second bowl | 0.5 | 0.50 | 0.250 | 0.4 |
And so we see, the posterior probability of the first bowl, given that we observed a vanilla cokie, is 0.6, the same as when we used Bayes’ theorem directly before.
Notice that the posteriors add up to 1 (as we should expect given mutually exclusive and collectivly exhaustive hypotheses), which is why the total probability of the data is sometimes called the “normalizing constant.”
Example Problems#
Being able to apply a Bayesian update is incredibly important, so today we’ll work through two more example problems as a class.
Example #1: The Dice Problem#
First, we’ll try something a just a little bit harder than the cookie problem, an example with more than just two hyptheses.
Suppose we have a box with a 6-sided die, an 8-sided die, and a 12-sided die. We choose one of the dice at random, roll it, and report that the outcome is a 1. What is the probability that we chose the 6-sided die?
Let’s start by thinking intiuitvely about this. Which dice should be most likely?
The 6 sided die is most likely because it has the highest chance of producing a 1 (1/6).
Now let’s set up our Bayes table. What are the hypotheses and what are their prior probabilities?
table = pd.DataFrame(index=['6-sided', '8-sided', '12-sided'])
table['prior'] = 1/3, 1/3, 1/3
table
prior | |
---|---|
6-sided | 0.333333 |
8-sided | 0.333333 |
12-sided | 0.333333 |
Next we need to compute the likelihood of the data under each hypothesis.
In other words, what is the probability of rolling a one, given each die?
table['likelihood'] = 1/6, 1/8, 1/12
table
prior | likelihood | |
---|---|---|
6-sided | 0.333333 | 0.166667 |
8-sided | 0.333333 | 0.125000 |
12-sided | 0.333333 | 0.083333 |
Now that we have the prior and likelihood for each hypothesis, what do we calculate next?
The “unnormalized posteriors”:
table['unnormalized'] = table['prior'] * table['likelihood']
table
prior | likelihood | unnormalized | |
---|---|---|---|
6-sided | 0.333333 | 0.166667 | 0.055556 |
8-sided | 0.333333 | 0.125000 | 0.041667 |
12-sided | 0.333333 | 0.083333 | 0.027778 |
And now there is just one last step to calculate the posterior probabilities. Normalization!
prob_data = table['unnormalized'].sum()
table['posterior'] = table['unnormalized'] / prob_data
table
prior | likelihood | unnormalized | posterior | |
---|---|---|---|---|
6-sided | 0.333333 | 0.166667 | 0.055556 | 0.444444 |
8-sided | 0.333333 | 0.125000 | 0.041667 | 0.333333 |
12-sided | 0.333333 | 0.083333 | 0.027778 | 0.222222 |
And there we have it! The probability that we chose the 6-sided die given that we rolled a one is 4/9.
You may have noticed by now that every time we calculate the posterior from the prior and the likelihood we do the exact same steps to caculate the Bayesian update. We can simplify things going forward by introducing an update function to calculate those parts of the table:
def update(table):
table['unnormalized'] = table['prior'] * table['likelihood']
prob_data = table['unnormalized'].sum()
table['posterior'] = table['unnormalized'] / prob_data
return table
table = pd.DataFrame(index=['6-sided', '8-sided', '12-sided'])
table['prior'] = 1/3, 1/3, 1/3
table['likelihood'] = 1/6, 1/8, 1/12
update(table)
prior | likelihood | unnormalized | posterior | |
---|---|---|---|---|
6-sided | 0.333333 | 0.166667 | 0.055556 | 0.444444 |
8-sided | 0.333333 | 0.125000 | 0.041667 | 0.333333 |
12-sided | 0.333333 | 0.083333 | 0.027778 | 0.222222 |
Example #2: The Monty Hall Problem#
Now let’s consider a famously unintuitive problem in probability, the Monty Hall Problem. Many of you may have heard of this one.
On his TV Show “Let’s Make a Deal”, Monty Hall would present contestants with three doors. Behind one was a prize, and behind the other two were gag gifts such as goats. The goal is to pick the door with the prize. After picking one of the three doors, Monty will open one of the other two doors revealing a gag prize, and then ask if you’d like to switch doors now.
What do you think, should we switch doors or stick with our original choice? Or does it make no difference?
Most people will say there’s now a 50/50 chance the remaining doors have the prize, so it doesn’t matter.
But it turns out that’s wrong! You actually have a 2/3 chance of finding the prize if you switch doors.
Let’s see why using a Bayes table.
Each door starts with an equal prior probability of holding the prize:
table = pd.DataFrame(index=['Door 1', 'Door 2', 'Door 3'])
table['prior'] = 1/3, 1/3, 1/3
table
prior | |
---|---|
Door 1 | 0.333333 |
Door 2 | 0.333333 |
Door 3 | 0.333333 |
What is our data in this scenario? Without loss of generality, suppose we originally picked door 1. Now Monty opens a door (let’s say door 3, again without loss of generality) to reveal a gag prize. So what is the likelihood of the data under each hypothesis?
Hypothesis 1: The prize is behind door 1
In this case Monty chose door 2 or door 3 at random, so he was equally likely to open door 2 and 3, so the observation that he opened door 3 had a 50/50 chance of occuring.
Hypothesis 2: The prize is behind door 2
In this case Monty must open door 3, so the observation that he opened door 3 was guaranteed to happen.
Hypothesis 3: The prize is behind door 3
Monty could not have opened a door with the prize behind it, so the probability of seeing him open door 3 under this hypothesis is 0.
table['likelihood'] = 1/2, 1, 0
table
prior | likelihood | |
---|---|---|
Door 1 | 0.333333 | 0.5 |
Door 2 | 0.333333 | 1.0 |
Door 3 | 0.333333 | 0.0 |
And now let’s run our update function and see what the posterior probabilities are:
update(table)
prior | likelihood | unnormalized | posterior | |
---|---|---|---|---|
Door 1 | 0.333333 | 0.5 | 0.166667 | 0.333333 |
Door 2 | 0.333333 | 1.0 | 0.333333 | 0.666667 |
Door 3 | 0.333333 | 0.0 | 0.000000 | 0.000000 |
Turns out there is a 2/3 probability the prize is behind door 2! We should switch doors.