MCMC in Practice#
For most of our Bayesian statistics we’ve been using grid methods to approximate posterior distributions. For models with one or two parameters, grid algorithms are fast and the results are precise enough for most practical purposes. With three parameters, they start to be slow, and with more than three they are usually not practical.
In a previous lecture we saw that we can solve some problems using conjugate priors. But the problems we can solve this way tend to be the same ones we can solve with grid algorithms.
For problems with more than a few parameters, the most powerful tool we have is MCMC, as introduced in the last lecture. As a review, unlike grid methods, MCMC methods don’t try to compute the posterior distribution; they sample from it instead.
It might seem strange that you can generate a sample without ever computing the distribution, but that’s the magic of MCMC.
In the last lecture we introduce the theory behind MCMC and showed a simple implementation. However, in practice there are enough technicalities to implementing MCMC in general that one would generally want to use a library that implements MCMC for you.
In today’s lecture, we’ll solve MCMC problems using PyMC3,
which is a library that provides implementations of several MCMC methods.
High Dimensional Problems#
For our examples today we’ll look at various examples of Bayesian regression. We’ll study both low-dimensional problems (that could be handled with grid methods) and high-dimensional problems (for which grid methods would fail, and MCMC is our only option.)
Happiness#
The book “Happiness and Life Satisfaction” by Esteban Ortiz-Ospina and Max Roser, discusses (among many other things) the relationship between income and happiness, both between countries, within countries, and over time.
It cites the “World Happiness Report”, which includes results of a multiple regression analysis that explores the relationship between happiness and six potentially predictive factors:
Income as represented by per capita GDP
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
The dependent variable is the national average of responses to the “Cantril ladder question” used by the Gallup World Poll:
Please imagine a ladder with steps numbered from zero at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?
We’ll refer to the responses as “happiness”, but it might be more precise to think of them as a measure of satisfaction with quality of life.
In the next few sections we’ll replicate the analysis in this report using Bayesian regression.
from os.path import basename, exists
def download(url):
filename = basename(url)
if not exists(filename):
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename)
print('Downloaded ' + local)
# Get the data file
download('https://happiness-report.s3.amazonaws.com/2020/WHR20_DataForFigure2.1.xls')
We will use Pandas to read the data into a DataFrame
.
import pandas as pd
filename = 'WHR20_DataForFigure2.1.xls'
df = pd.read_excel(filename)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
File ~/anaconda3/envs/ds122/lib/python3.10/site-packages/pandas/compat/_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
141 try:
--> 142 module = importlib.import_module(name)
143 except ImportError:
File ~/anaconda3/envs/ds122/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
125 level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)
File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)
File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)
File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)
ModuleNotFoundError: No module named 'xlrd'
During handling of the above exception, another exception occurred:
ImportError Traceback (most recent call last)
Cell In[3], line 4
1 import pandas as pd
3 filename = 'WHR20_DataForFigure2.1.xls'
----> 4 df = pd.read_excel(filename)
File ~/anaconda3/envs/ds122/lib/python3.10/site-packages/pandas/io/excel/_base.py:478, in read_excel(io, sheet_name, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, date_format, thousands, decimal, comment, skipfooter, storage_options, dtype_backend)
476 if not isinstance(io, ExcelFile):
477 should_close = True
--> 478 io = ExcelFile(io, storage_options=storage_options, engine=engine)
479 elif engine and engine != io.engine:
480 raise ValueError(
481 "Engine should not be specified when passing "
482 "an ExcelFile - ExcelFile already has the engine set"
483 )
File ~/anaconda3/envs/ds122/lib/python3.10/site-packages/pandas/io/excel/_base.py:1513, in ExcelFile.__init__(self, path_or_buffer, engine, storage_options)
1510 self.engine = engine
1511 self.storage_options = storage_options
-> 1513 self._reader = self._engines[engine](self._io, storage_options=storage_options)
File ~/anaconda3/envs/ds122/lib/python3.10/site-packages/pandas/io/excel/_xlrd.py:34, in XlrdReader.__init__(self, filepath_or_buffer, storage_options)
24 """
25 Reader using xlrd engine.
26
(...)
31 {storage_options}
32 """
33 err_msg = "Install xlrd >= 2.0.1 for xls Excel support"
---> 34 import_optional_dependency("xlrd", extra=err_msg)
35 super().__init__(filepath_or_buffer, storage_options=storage_options)
File ~/anaconda3/envs/ds122/lib/python3.10/site-packages/pandas/compat/_optional.py:145, in import_optional_dependency(name, extra, errors, min_version)
143 except ImportError:
144 if errors == "raise":
--> 145 raise ImportError(msg)
146 return None
148 # Handle submodules: if we have submodule, grab parent module from sys.modules
ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 2.0.1 for xls Excel support Use pip or conda to install xlrd.
df.head(3)
Country name | Regional indicator | Ladder score | Standard error of ladder score | upperwhisker | lowerwhisker | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Ladder score in Dystopia | Explained by: Log GDP per capita | Explained by: Social support | Explained by: Healthy life expectancy | Explained by: Freedom to make life choices | Explained by: Generosity | Explained by: Perceptions of corruption | Dystopia + residual | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Finland | Western Europe | 7.8087 | 0.031156 | 7.869766 | 7.747634 | 10.639267 | 0.954330 | 71.900826 | 0.949172 | -0.059482 | 0.195445 | 1.972317 | 1.285190 | 1.499526 | 0.961271 | 0.662317 | 0.159670 | 0.477857 | 2.762835 |
1 | Denmark | Western Europe | 7.6456 | 0.033492 | 7.711245 | 7.579955 | 10.774001 | 0.955991 | 72.402504 | 0.951444 | 0.066202 | 0.168489 | 1.972317 | 1.326949 | 1.503449 | 0.979333 | 0.665040 | 0.242793 | 0.495260 | 2.432741 |
2 | Switzerland | Western Europe | 7.5599 | 0.035014 | 7.628528 | 7.491272 | 10.979933 | 0.942847 | 74.102448 | 0.921337 | 0.105911 | 0.303728 | 1.972317 | 1.390774 | 1.472403 | 1.040533 | 0.628954 | 0.269056 | 0.407946 | 2.350267 |
df.shape
(153, 20)
The DataFrame
has one row for each of 153 countries and one column for each of 20 variables.
The column called 'Ladder score'
contains the measurements of happiness we will try to predict.
score = df['Ladder score']
Simple Regression#
To get started, let’s look at the relationship between happiness and income as represented by gross domestic product (GDP) per person.
The column named 'Logged GDP per capita'
represents the natural logarithm of GDP for each country, divided by population, corrected for purchasing power parity (PPP).
log_gdp = df['Logged GDP per capita']
The following figure is a scatter plot of score
versus log_gdp
, with one marker for each country.

Now let’s estimate the parameters of a regression model using PyMC3:
where \(y\) is the dependent variable (ladder score), \(x\) is the predictive variable (log GDP) and \(\epsilon\) is a series of values from a normal distribution with mean 0 standard deviation \(\sigma\).
\(a\) and \(b\) are the slope and intercept of the regression line. They are unknown parameters, so we will use the data to estimate them.
Introducing PyMC3#
PyMC3 is a Python library that provides several MCMC methods.
Here’s how we specify this model in PyMC3. After importing pymc3
, we create a Model
object named model
and specify a model of the process that generates the data:
import pymc3 as pm
x_data = log_gdp
y_data = score
with pm.Model() as model:
a = pm.Uniform('a', 0, 4)
b = pm.Uniform('b', -4, 4)
sigma = pm.Uniform('sigma', 0, 2)
y_est = a * x_data + b
y = pm.Normal('y',
mu=y_est, sd=sigma,
observed=y_data)
first we specify the prior distributions on the parameters \(a\), \(b\), and \(\sigma\), so we set the prior distributions for the parameters
a
,b
, andsigma
as uniform with ranges that are wide enough to cover the posterior distributions.
Then we calculate
y_est
, the estimated value of the dependent variable, based on the regression equation \(ax + b\)
Finally, we specify the likelihood, the probability of observing our data
y_data
given a normal distribution with meany_est
and standard deviation \(\sigma\) (equivalent to looking at the likelihood of the residualy_data-y_est
for a normal distribution w/ mean 0 and s.d. \(\sigma\))
Note: if you are not familiar with the with
statement in Python, it is a way to associate a block of statements with an object. In this example, the indented statements are associated with the new Model
object.
Notice how the data are included in the model:
The values of the predictive variable,
x_data
, are used to computey_est
.The values of the dependent variable,
y_data
, are provided as the observed values ofy
.
Now we can use this model to generate a sample from the posterior distribution.
options = dict(return_inferencedata=False)
with model:
trace = pm.sample(500, **options)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, b, a]
/Users/markcrovella/opt/anaconda3/lib/python3.9/site-packages/scipy/stats/_continuous_distns.py:624: RuntimeWarning: overflow encountered in _beta_ppf
return _boost._beta_ppf(q, a, b)
/Users/markcrovella/opt/anaconda3/lib/python3.9/site-packages/scipy/stats/_continuous_distns.py:624: RuntimeWarning: overflow encountered in _beta_ppf
return _boost._beta_ppf(q, a, b)
/Users/markcrovella/opt/anaconda3/lib/python3.9/site-packages/scipy/stats/_continuous_distns.py:624: RuntimeWarning: overflow encountered in _beta_ppf
return _boost._beta_ppf(q, a, b)
Sampling 4 chains for 1_000 tune and 500 draw iterations (4_000 + 2_000 draws total) took 22 seconds.
PyMC3 works by running MCMC on your model.
Depending on the model, PyMC3 uses one of several MCMC methods; in this example, it uses the No U-Turn Sampler (NUTS), which is one of the most efficient and reliable methods we have.
As described in the last lecture, when the sampler starts, the first values it generates are usually not a representative sample from the posterior distribution, so these values are discarded.
Instead of using a single Markov chain, PyMC3 uses multiple chains. Then we can compare results from multiple chains to make sure they are consistent.
Although we asked for a sample of 500, PyMC3 generated two samples of 1000, discarded half of each, and returned the remaining 1000.
When you run the sampler, you might get warning messages about “divergences” and the “acceptance probability”. Our previous lessons in the theory behind MCMC might give you some intution into these warnings.
The result is an object that contains samples from the joint posterior distribution of a
, b
, and sigma
.
Show code cell source
import arviz as az
with model:
az.plot_posterior(trace, var_names=['a', 'b', 'sigma']);

The graphs show the distributions of the samples, estimated by KDE, and 94% credible intervals. In the figure, “HDI” stands for “highest-density interval”.
print('Slope MMSE: {:0.3f}'.format(trace['a'].mean()))
print('Intercept MMSE: {:0.3f}'.format(trace['b'].mean()))
Slope MMSE: 0.720
Intercept MMSE: -1.220
The estimated slope is about 0.72, which suggests that an increase of one unit in log-GDP is associated with an increase of 0.72 units on the happiness ladder.
The values in the posterior distribution of sigma
seem plausible.
The simple regression model has only three parameters, so we could have used a grid algorithm. But the regression model in the happiness report has six predictive variables, so it has eight parameters in total, including the intercept and sigma
.
It is not practical to compute a grid approximation for a model with eight parameters. Even a coarse grid, with 20 points along each dimension, would have more than 25 billion points. And with 153 countries, we would have to compute almost 4 trillion likelihoods.
But PyMC3 can handle a model with eight parameters comfortably, as we’ll see next.
Multiple Regression#
Before we implement the multiple regression model, we’ll select the columns we need from the DataFrame
.
columns = ['Ladder score',
'Logged GDP per capita',
'Social support',
'Healthy life expectancy',
'Freedom to make life choices',
'Generosity',
'Perceptions of corruption']
subset = df[columns]
subset.head(3)
Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | |
---|---|---|---|---|---|---|---|
0 | 7.8087 | 10.639267 | 0.954330 | 71.900826 | 0.949172 | -0.059482 | 0.195445 |
1 | 7.6456 | 10.774001 | 0.955991 | 72.402504 | 0.951444 | 0.066202 | 0.168489 |
2 | 7.5599 | 10.979933 | 0.942847 | 74.102448 | 0.921337 | 0.105911 | 0.303728 |
The predictive variables have different units: log-GDP is in log-dollars, life expectancy is in years, and the other variables are on arbitrary scales. To make these factors comparable, We’ll standardize the data so that each variable has mean 0 and standard deviation 1.
standardized = (subset - subset.mean()) / subset.std()
Now let’s build the model. We’ll extract the dependent variable.
y_data = standardized['Ladder score']
And the dependent variables.
x1 = standardized[columns[1]]
x2 = standardized[columns[2]]
x3 = standardized[columns[3]]
x4 = standardized[columns[4]]
x5 = standardized[columns[5]]
x6 = standardized[columns[6]]
And here’s the model. b0
is the intercept; b1
through b6
are the parameters associated with the predictive variables.
with pm.Model() as model2:
b0 = pm.Uniform('b0', -4, 4)
b1 = pm.Uniform('b1', -4, 4)
b2 = pm.Uniform('b2', -4, 4)
b3 = pm.Uniform('b3', -4, 4)
b4 = pm.Uniform('b4', -4, 4)
b5 = pm.Uniform('b5', -4, 4)
b6 = pm.Uniform('b6', -4, 4)
sigma = pm.Uniform('sigma', 0, 2)
y_est = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x4 + b5*x5 + b6*x6
y = pm.Normal('y',
mu=y_est, sd=sigma,
observed=y_data)
with model2:
trace2 = pm.sample(500, **options)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, b6, b5, b4, b3, b2, b1, b0]
/Users/markcrovella/opt/anaconda3/lib/python3.9/site-packages/scipy/stats/_continuous_distns.py:624: RuntimeWarning: overflow encountered in _beta_ppf
return _boost._beta_ppf(q, a, b)
/Users/markcrovella/opt/anaconda3/lib/python3.9/site-packages/scipy/stats/_continuous_distns.py:624: RuntimeWarning: overflow encountered in _beta_ppf
return _boost._beta_ppf(q, a, b)
/Users/markcrovella/opt/anaconda3/lib/python3.9/site-packages/scipy/stats/_continuous_distns.py:624: RuntimeWarning: overflow encountered in _beta_ppf
return _boost._beta_ppf(q, a, b)
/Users/markcrovella/opt/anaconda3/lib/python3.9/site-packages/scipy/stats/_continuous_distns.py:624: RuntimeWarning: overflow encountered in _beta_ppf
return _boost._beta_ppf(q, a, b)
Sampling 4 chains for 1_000 tune and 500 draw iterations (4_000 + 2_000 draws total) took 19 seconds.
From trace2
we can extract samples from the posterior distributions of the parameters and compute their means.
param_names = ['b1', 'b3', 'b3', 'b4', 'b5', 'b6']
means = [trace2[name].mean()
for name in param_names]
We can also compute 94% credible intervals (between the 3rd and 97th percentiles).
def credible_interval(sample):
"""Compute 94% credible interval."""
ci = np.percentile(sample, [3, 97])
return np.round(ci, 3)
cis = [credible_interval(trace2[name])
for name in param_names]
And summarize the results:
index = columns[1:]
table = pd.DataFrame(index=index)
table['Posterior mean'] = np.round(means, 3)
table['94% CI'] = cis
table
Posterior mean | 94% CI | |
---|---|---|
Logged GDP per capita | 0.250 | [0.076, 0.418] |
Social support | 0.222 | [0.06, 0.372] |
Healthy life expectancy | 0.222 | [0.06, 0.372] |
Freedom to make life choices | 0.186 | [0.092, 0.283] |
Generosity | 0.057 | [-0.031, 0.146] |
Perceptions of corruption | -0.098 | [-0.192, -0.004] |
It looks like GDP has the strongest association with happiness (or satisfaction), followed by social support, life expectancy, and freedom.
After controlling for those other factors, the parameters of the other factors are substantially smaller. Since the CI for generosity includes 0, it is plausible that generosity is not substantially related to happiness, at least as they were measured in this study.
This example demonstrates the power of MCMC to handle models with more than a few parameters.
Summary#
In this lecture we first used PyMC3 to implement a simple regression model.
Then we implemented a multiple regression model that would not have been possible to compute with a grid approximation.
MCMC is more powerful than grid methods, but that power comes with some disadvantages. These disadvantages come from the need for the underlying Markov chain to converge as discussed in the previous two lectures.
If you are using a package like PyMC3, it will try to alert you to convergence issues with warnings about tuning steps, divergences, and effective samples. You will have to use your knowledge of the underlying MCMC algorithm and think about how to address issues around convergence.