Bayesian Statistics#
On a visit to the doctor, we may ask, “What is the probability that I have disease X?” but what we really mean is “How certain should I be that I have disease X?”
Or, before digging a well, we may ask, “What is the probability that I will strike water?” but what we are really asking is “How certain should I be that I will strike water?”
The key insight is that either we do, or do not, have disease X and either will, or will not, strike water, and what we can determine is the level of certainty.
Bayes’ Theorem will be the tool we use to answer questons like this and will form the foundation of the next 6 weeks of this course! It is the basis of the Bayesian view of probability.
In this view, we are using probability to encode a “degree of belief” or a “state of knowledge.”
Review of Probability#
The Chain Rule#
Recall from many weeks ago, the definition of conditional probaility:
From there we derived the chain rule by multiplying both sides by \(P(B)\):
Bayes’ Theorem#
Now recall the fact that the conjuction (joint probability) is commutative:
Let’s then use the chain rule to expand both sides:
And then divide through by \(P(B)\):
This is Bayes’ Theorem (or Bayes’ Rule), and we’re about to find out why it’s so powerful.
Bayes’ Theorem to update probability (Diachronic Bayes)#
While being able to solve questions like the ones raised in the problems above is useful, the main way Bayes’ Theorem is used in Bayesian probability is as a way to update the probability of a hypothesis H, given some data D. This is sometimes called “Diachronic Bayes” which means “related to change over time”
As a general framework, we re-write Bayes’ Theorem as:
In this diachronic framework, these terms are known by specific names (these terms will come up over and over again, make sure to remember them):
\(P(H)\), the probability of the hypothesis before we see the data, called the prior probability
\(P(H \vert D)\), the probability of the hypothesis after we see the data, called the posterior probability
\(P(D \vert H)\), the probabilitiy of the data under the hypothesis, called the likelihood
\(P(D)\), the total probability of the data under any hypothesis
The Prior#
The prior is often the trickiest portion of this equation to pin down.
Sometimes it can be computed exactly as in the cookie bowl problem, where we were equally likely to pick each bowl.
But what if we chose bowls proportinally to their size? We would need to guess at their sizes to establish the prior.
Other times, people might disagree about which background information is relevent to the problem at hand. In fact in many cases, people will use a uniform prior!
The Likelihood#
The likelihood is usually well defined and can be computed directly. In the cookie problem we know the cookies in each bowl, so we can compute the probabilites under each hypothesis.
The Total Probability#
Finally, determining the total probability of the data is often a lot more difficult than you might expect because we often don’t know what every possible hypothesis is.
However, usually the goal is to pin down a set of mutually exclusive and collectively exhaustive hypotheses, meaning a set of hypotheses where only one of them can be true, and one of them must be true.
Then, over a set of i hypotheses, we say:
The Bayesian Update#
Overall, the process of generating a posterior probability from a prior probability using data is called a Bayesian update.
Bayes Tables#
Often when we work with Bayes’ theorem to update probabilities (i.e. when we do a Bayesian update), we’ll want to keep track of the probabilities of hypotheses as we update them using our data.
This table that keeps track of the probabilities of all hypotheses as we update them using our data is called a Bayes table
Let’s go step by step to create a Bayes table for the cookie problem.
First we need a table of all our hyptheses, one row for each hypothesis:
table = pd.DataFrame(index=['first bowl', 'second bowl'])
table
first bowl |
---|
second bowl |
Next we add the prior probabilities of each hypothesis. Our prior is that it was equally likely to get a vanilla cookie from either bowl:
table['prior'] = 1/2, 1/2
table
prior | |
---|---|
first bowl | 0.5 |
second bowl | 0.5 |
The likelihood of each hypothesis is the fraction of cookies in each bowl that is vanilla. For example,
table['likelihood'] = 30/40, 20/40
table
prior | likelihood | |
---|---|---|
first bowl | 0.5 | 0.75 |
second bowl | 0.5 | 0.50 |
Now lets compute the “unnormalized posteriors.” This is just a term for the top half of the Bayes’ theorem formula: the prior multiplied by the likelihood.
So the first term of the unnormalized posteriors is:
table['unnormalized'] = table['prior'] * table['likelihood']
table
prior | likelihood | unnormalized | |
---|---|---|---|
first bowl | 0.5 | 0.75 | 0.375 |
second bowl | 0.5 | 0.50 | 0.250 |
The final missing piece is to divide by the total probability of the data.
What we are doing is to normalize the posteriors so that they sum up to 1.
To find the total probability of the data we directly sum over the unnormalized posteriors:
prob_data = table['unnormalized'].sum()
prob_data
0.625
This gives us 5/8, just like we calculated before.
Finally, we can use the total probability of the data to get the posterior probability of each hypthesis.
table['posterior'] = table['unnormalized'] / prob_data
table
prior | likelihood | unnormalized | posterior | |
---|---|---|---|---|
first bowl | 0.5 | 0.75 | 0.375 | 0.6 |
second bowl | 0.5 | 0.50 | 0.250 | 0.4 |
And so we see, the posterior probability of the first bowl, given that we observed a vanilla cokie, is 0.6, the same as when we used Bayes’ theorem directly before.
Notice that the posteriors add up to 1 (as we should expect given mutually exclusive and collectivly exhaustive hypotheses), which is why the total probability of the data is sometimes called the “normalizing constant”