Classification#
Is this a cat or a dog?#
Classification might be the most well-known application of Bayesian methods, made famous in the 1990s as the basis of the first generation of spam filters.
Classifying Penguins#
Today we’ll learn to use Bayesian classification to classify penguins by species.
Approaching the problem#
Using a dataset of measurements of three types of penguins, we’ll learn to classify a penguin as one of those three types: Adélie, Chinstrap and Gentoo.
We’ll consider a dataset with two measurements:
Flipper Length in millimeters
Culmen (beak) Length in millimeters
And we’ll begin to approach the problem in our typical way:
Define a prior distribution for how likely we think our unknown example is to belong to each of the three possible species
Compute the likelihood of the data for each species
Compute the posterior probability of each hypothesis
The Prior#
First let’s read in the data to get the possible classification options:
#read in the data
df = pd.read_csv('images/penguins.csv')
#check species names
np.sort(df.Species.unique())
array(['Adelie Penguin (Pygoscelis adeliae)',
'Chinstrap penguin (Pygoscelis antarctica)',
'Gentoo penguin (Pygoscelis papua)'], dtype=object)
We don’t know anything about our sample, so we can say it is equally likely to belong to each of the three species.
#Set equal priors
p_dist = pd.DataFrame(np.sort(df.Species.unique()), columns = ["Species"])
p_dist['probs'] = 1/3
p_dist
Species | probs | |
---|---|---|
0 | Adelie Penguin (Pygoscelis adeliae) | 0.333333 |
1 | Chinstrap penguin (Pygoscelis antarctica) | 0.333333 |
2 | Gentoo penguin (Pygoscelis papua) | 0.333333 |
The Likelihood#
For any given observation, we’ll need to calculate it’s likelihood for each of the three species. We’ll assume each trait is normally distributed for each species.
This will require making 6 normal distributions (3 species and two traits) based on the mean and standard deviation of the observed data for each. We’ll write a helper function to make those distributions for us from a data frame:
from scipy.stats import norm
def make_norm_map(df, colname, by='Species'):
"""Make a map from species to norm object."""
norm_map = {}
grouped = df.groupby(by)[colname]
for species, group in grouped:
mean = group.mean()
std = group.std()
norm_map[species] = norm(mean, std)
return norm_map
flipper_map = make_norm_map(df, 'Flipper Length (mm)')
flipper_map.keys()
dict_keys(['Adelie Penguin (Pygoscelis adeliae)', 'Chinstrap penguin (Pygoscelis antarctica)', 'Gentoo penguin (Pygoscelis papua)'])
Let’s make sure it works but checking the probability that a single flipper measure length for a single species has some associated probability:
data = 193
flipper_map['Adelie Penguin (Pygoscelis adeliae)'].pdf(data)
0.054732511875530694
Now our likelihood for a single observation is just that calculation for all three species:
likelihood = [flipper_map[hypo].pdf(data) for hypo in flipper_map.keys()]
likelihood
[0.054732511875530694, 0.05172135615888163, 5.866045366199098e-05]
The Update#
Again, like usual, the update is just the prior times the likelihood divided by a normalizer:
# our usual update function
def update(distribution, likelihood):
distribution['probs'] = distribution['probs'] * likelihood
prob_data = distribution['probs'].sum()
distribution['probs'] = distribution['probs'] / prob_data
return distribution
update(p_dist,likelihood)
Species | probs | |
---|---|---|
0 | Adelie Penguin (Pygoscelis adeliae) | 0.513860 |
1 | Chinstrap penguin (Pygoscelis antarctica) | 0.485589 |
2 | Gentoo penguin (Pygoscelis papua) | 0.000551 |
It seems like that flipper length does not distinguish strongly between Adélie and Chinstrap penguins.
But maybe culmen length can make this distinction, so let’s use it to do a second round of classification
Repeating our steps for the flipper length with culmen length:
# make normal distributions for each species from data
culmen_map = make_norm_map(df, 'Culmen Length (mm)')
# lets say we observed a culmen length of 48 mm
likelihood = [culmen_map[hypo].pdf(48) for hypo in culmen_map.keys()]
# update
update(p_dist,likelihood)
Species | probs | |
---|---|---|
0 | Adelie Penguin (Pygoscelis adeliae) | 0.003455 |
1 | Chinstrap penguin (Pygoscelis antarctica) | 0.995299 |
2 | Gentoo penguin (Pygoscelis papua) | 0.001246 |
Naive Bayesian Classifier#
What we just implemented is the famous Naive Bayes Classifier! It simply ignores any correlation between features and use independent updates for each observation.
Naive Bayes Classifiers, while perhaps “naive” in ignoring correlations between featues, are surprisngly accurate. This classifer alone can classify 94.7% of the penguins in our dataset correctly!
Making the Classifier Slightly Less Naive#
We actually already have all the tools to make a less Naive classifier. We were previosuly using univariate normal distributions for each feature for each species. Instead, we can just use a single multivariate normal for each species.
To review, a multivariate normal is captured by a vector of means and a covariance matrix.
var1 = 'Flipper Length (mm)'
var2 = 'Culmen Length (mm)'
features = df[[var1, var2]]
# the vector of means
mean = features.mean()
mean
Flipper Length (mm) 200.915205
Culmen Length (mm) 43.921930
dtype: float64
# the covariance matrix
cov = features.cov()
cov
Flipper Length (mm) | Culmen Length (mm) | |
---|---|---|
Flipper Length (mm) | 197.731792 | 50.375765 |
Culmen Length (mm) | 50.375765 | 29.807054 |
Luckily, in addition to the usual normal distribution in SciPy, there is also a multivariate normal. We can plug our means and covariance into that to get back probabilities of observations:
from scipy.stats import multivariate_normal
# make multivariate normal from means and covariance
multinorm = multivariate_normal(mean, cov)
# try an example datapoint
multinorm.pdf([193,48])
0.0007850305643354554
Like before, lets write a function that creates a distribution for each species from the data. This time we just need one multivariate normal per species.
def make_multinorm_map(df, colnames):
"""Make a map from each species to a multivariate normal."""
multinorm_map = {}
grouped = df.groupby('Species')
for species, group in grouped:
features = group[colnames]
mean = features.mean()
cov = features.cov()
multinorm_map[species] = multivariate_normal(mean, cov)
return multinorm_map
multinorm_map = make_multinorm_map(df, [var1, var2])
And now lets run our whole procedure: prior, likelihood, and update:
#Set equal priors
p_dist = pd.DataFrame(np.sort(df.Species.unique()), columns = ["Species"])
p_dist['probs'] = 1/3
# Like before, we observed a flipper length of 193 and a culmen length of 48 mm
likelihood = [multinorm_map[hypo].pdf([193,48]) for hypo in multinorm_map.keys()]
# update
update(p_dist,likelihood)
Species | probs | |
---|---|---|
0 | Adelie Penguin (Pygoscelis adeliae) | 0.002740 |
1 | Chinstrap penguin (Pygoscelis antarctica) | 0.997257 |
2 | Gentoo penguin (Pygoscelis papua) | 0.000003 |
A penguin with those measurements is almost certainly a Chinstrap!
Interestingly, it turns out this more complicated classifer can classify 95.3% of the penguins in our dataset correctly, just barely more than the 94.7% of the naive classifier.
In fact it turns out simple Naive Bayes classifiers are great for many classification tasks. This is great news: they’re very easy to implement and take very little computational power.