Classification#

Is this a cat or a dog?#

Classification might be the most well-known application of Bayesian methods, made famous in the 1990s as the basis of the first generation of spam filters.

Classifying Penguins#

Today we’ll learn to use Bayesian classification to classify penguins by species.

_images/DALLE-A-penguin-eating-a-can-of-spam.png

Approaching the problem#

Using a dataset of measurements of three types of penguins, we’ll learn to classify a penguin as one of those three types: Adélie, Chinstrap and Gentoo.

We’ll consider a dataset with two measurements:

Flipper Length in millimeters
Culmen (beak) Length in millimeters

And we’ll begin to approach the problem in our typical way:

Define a prior distribution for how likely we think our unknown example is to belong to each of the three possible species

Compute the likelihood of the data for each species

Compute the posterior probability of each hypothesis

The Prior#

First let’s read in the data to get the possible classification options:

#read in the data 
df = pd.read_csv('images/penguins.csv')

#check species names
np.sort(df.Species.unique())

array(['Adelie Penguin (Pygoscelis adeliae)',
       'Chinstrap penguin (Pygoscelis antarctica)',
       'Gentoo penguin (Pygoscelis papua)'], dtype=object)

We don’t know anything about our sample, so we can say it is equally likely to belong to each of the three species.

#Set equal priors
p_dist = pd.DataFrame(np.sort(df.Species.unique()), columns = ["Species"])
p_dist['probs'] = 1/3
p_dist

	Species	probs
0	Adelie Penguin (Pygoscelis adeliae)	0.333333
1	Chinstrap penguin (Pygoscelis antarctica)	0.333333
2	Gentoo penguin (Pygoscelis papua)	0.333333

The Likelihood#

For any given observation, we’ll need to calculate it’s likelihood for each of the three species. We’ll assume each trait is normally distributed for each species.

This will require making 6 normal distributions (3 species and two traits) based on the mean and standard deviation of the observed data for each. We’ll write a helper function to make those distributions for us from a data frame:

from scipy.stats import norm

def make_norm_map(df, colname, by='Species'):
    """Make a map from species to norm object."""
    norm_map = {}
    grouped = df.groupby(by)[colname]
    for species, group in grouped:
        mean = group.mean()
        std = group.std()
        norm_map[species] = norm(mean, std)
    return norm_map

flipper_map = make_norm_map(df, 'Flipper Length (mm)')
flipper_map.keys()

dict_keys(['Adelie Penguin (Pygoscelis adeliae)', 'Chinstrap penguin (Pygoscelis antarctica)', 'Gentoo penguin (Pygoscelis papua)'])

Let’s make sure it works but checking the probability that a single flipper measure length for a single species has some associated probability:

data = 193
flipper_map['Adelie Penguin (Pygoscelis adeliae)'].pdf(data)

0.054732511875530694

Now our likelihood for a single observation is just that calculation for all three species:

likelihood = [flipper_map[hypo].pdf(data) for hypo in flipper_map.keys()]
likelihood

[0.054732511875530694, 0.051721356158881626, 5.866045366199098e-05]

The Update#

Again, like usual, the update is just the prior times the likelihood divided by a normalizer:

# our usual update function
def update(distribution, likelihood):
    distribution['probs'] = distribution['probs'] * likelihood
    prob_data = distribution['probs'].sum()
    distribution['probs'] = distribution['probs'] / prob_data
    return distribution

update(p_dist,likelihood)

	Species	probs
0	Adelie Penguin (Pygoscelis adeliae)	0.513860
1	Chinstrap penguin (Pygoscelis antarctica)	0.485589
2	Gentoo penguin (Pygoscelis papua)	0.000551

It seems like that flipper length does not distinguish strongly between Adélie and Chinstrap penguins.

But maybe culmen length can make this distinction, so let’s use it to do a second round of classification

Repeating our steps for the flipper length with culmen length:

# make normal distributions for each species from data
culmen_map  = make_norm_map(df, 'Culmen Length (mm)')

# lets say we observed a culmen length of 48 mm
likelihood = [culmen_map[hypo].pdf(48) for hypo in culmen_map.keys()]

# update 
update(p_dist,likelihood)

	Species	probs
0	Adelie Penguin (Pygoscelis adeliae)	0.003455
1	Chinstrap penguin (Pygoscelis antarctica)	0.995299
2	Gentoo penguin (Pygoscelis papua)	0.001246

Naive Bayesian Classifier#

What we just implemented is the famous Naive Bayes Classifier! It simply ignores any correlation between features and use independent updates for each observation.

Naive Bayes Classifiers, while perhaps “naive” in ignoring correlations between featues, are surprisngly accurate. This classifer alone can classify 94.7% of the penguins in our dataset correctly!

Bonus material:

Making the Classifier Slightly Less Naive#

We were previosuly using univariate normal distributions for each feature for each species. Instead, we can just use a single multivariate normal for each species.

A multivariate normal is captured by a vector of means and a covariance matrix.

var1 = 'Flipper Length (mm)'
var2 = 'Culmen Length (mm)'
features = df[[var1, var2]]

# the vector of means
mean = features.mean()
mean

Flipper Length (mm)    200.915205
Culmen Length (mm)      43.921930
dtype: float64

# the covariance matrix
cov = features.cov()
cov

	Flipper Length (mm)	Culmen Length (mm)
Flipper Length (mm)	197.731792	50.375765
Culmen Length (mm)	50.375765	29.807054

Luckily, in addition to the usual normal distribution in SciPy, there is also a multivariate normal. We can plug our means and covariance into that to get back probabilities of observations:

from scipy.stats import multivariate_normal

# make multivariate normal from means and covariance
multinorm = multivariate_normal(mean, cov)

# try an example datapoint
multinorm.pdf([193,48])

0.0007850305643354616

Like before, lets write a function that creates a distribution for each species from the data. This time we just need one multivariate normal per species.

def make_multinorm_map(df, colnames):
    """Make a map from each species to a multivariate normal."""
    multinorm_map = {}
    grouped = df.groupby('Species')
    for species, group in grouped:
        features = group[colnames]
        mean = features.mean()
        cov = features.cov()
        multinorm_map[species] = multivariate_normal(mean, cov)
    return multinorm_map

multinorm_map = make_multinorm_map(df, [var1, var2])

And now lets run our whole procedure: prior, likelihood, and update:

#Set equal priors
p_dist = pd.DataFrame(np.sort(df.Species.unique()), columns = ["Species"])
p_dist['probs'] = 1/3

# Like before, we observed a flipper length of 193 and a culmen length of 48 mm
likelihood = [multinorm_map[hypo].pdf([193,48]) for hypo in multinorm_map.keys()]

# update 
update(p_dist,likelihood)

	Species	probs
0	Adelie Penguin (Pygoscelis adeliae)	0.002740
1	Chinstrap penguin (Pygoscelis antarctica)	0.997257
2	Gentoo penguin (Pygoscelis papua)	0.000003

A penguin with those measurements is almost certainly a Chinstrap!

Interestingly, it turns out this more complicated classifer can classify 95.3% of the penguins in our dataset correctly, just barely more than the 94.7% of the naive classifier.

In fact it turns out simple Naive Bayes classifiers are great for many classification tasks. This is great news: they’re very easy to implement and take very little computational power.

Foundations of Data Science III

Classification

Contents