Naive Bayes and Support Vector Machines

Naive Bayes and Support Vector Machines#

Today we’ll look at two more very commonly-used, widely-applicable classification methods.

the Naive Bayes Classifier
the Support Vector Machine

Naive Bayes#

The classification problem most generally seeks to associate a label $y$ with a set of features $\mathbf{x}$.

Perhaps the most fundamental way to attack this problem is via probability.

We’ll now consider how to use probability to directly address the classification problem.

A key tool will be Bayes Rule, so let’s review that.

Bayes Rule#

We’ll start from the definition of conditional probability:

\[ P[A\,|\,C] = \frac{P[A \text{ and } C]}{P[C]} \]

So:

\[ P[A \text{ and } C] = P[A\,|\,C] \cdot P[C]. \]

_images/001f01475f450d1aec66c3dd7c2e2f19c4214430efa11cb71507a6b30a813f24.png

Switching the roles of $A$ and $C$, we can equally say:

\[ P[C \text{ and } A] = P[C\,|\,A] \cdot P[A]. \]

Now $P[C \text{ and } A] = P[A \text{ and } C]$, so:

\[ P[A\,|\,C] \cdot P[C] = P[C\,|\,A] \cdot P[A] \]

Rearranging, we get Bayes’ Rule:

\[ P[A\,|\,C] = \frac{P[C\,|\,A] \cdot P[A]}{P[C]}. \]

So there is nothing “magic” about Bayes’ Rule: it simply follows from the basic laws of probability.

However, it is very useful!

The reason is that often, $P[A\,|\,C]$ is not known, but the expressions on the right are known.

(We’ve already seen how this is used in the EM algorithm for Gaussian Mixture Modeling.)

Using Bayes Rule in a Classifier#

Let’s start with a simple example:

A doctor knows that meningitis causes a stiff neck 75% of the time.
This is $P[S\,|\,M]$.

A patient presents with a stiff neck. What is the probability she has meningitis?

Well, we can’t say.

What we know is $P[S\,|\,M]$, but what we want to know is $P[M\,|\,S]$!

Now perhaps the doctor also knows that

1 in 20 people in the population have a stiff neck at any given time
- $P[S] = 1/20$
1 in 10,000 people in the population have meningitis at any given time
- $P[M] = 1/10000$

Now, we can calculate the answer needed:

\[ P[M\,|\,S] = \frac{P[S\,|\,M]}{P[S]}\cdot P[M] = \frac{3/4}{1/20} \cdot 1/10000 = \fbox{$15 \cdot 1/10000$} = 0.0015. \]

I wrote the expressions above that way to point out the essence of Bayesian reasoning:

A random person has probability 1/10000 of having meningitis
When we learn that the person has a stiff neck, it increases their probability by a factor of 15.

We give these probabilities names according to their roles:

The random person’s probability (1/10000) is called the prior probability
The specific patient’s probability (15 $\cdot$ 1/10000) is called the posterior probability.

We are going to use this same principle to construct a classifier.

Given a set of items with features and class labels:

The class label will play the role of “meningitis” and the various attributes of the item will play the role of “stiff neck.”

We will then ask “how does the value of each feature change the probability of the class label?”

More formally:

Consider an item $\mathbf{x}$ having attributes $x_1, x_2, \dots, x_n$.

There are various classes (labels) for items: $C_1, \dots, C_k$.

Our goal is to predict the class of $\mathbf{x}$.

To do that, we will compute $P[C_1\,|\,\mathbf{x}], P[C_2\,|\,\mathbf{x}], \dots, P[C_k\,|\,\mathbf{x}]$.

These form a soft classification of $\mathbf{x}$.

From them, we can form a hard classification. One way would simply be to choose the class with highest probability.

This is the MAP (Maximum A Posteriori) Estimate:

\[\hat{C} = \arg\max_{\{C_i\}} P[C_i\,|\,\mathbf{x}].\]

Now $P[C_i\,|\,\mathbf{x}] = P[C_i\,|\,x_1, x_2, \dots, x_n]$

How can we approach the problem of computing $P[C_i\,|\,\mathbf{x}]$?

The key idea is that Bayes Rule makes clear that

\[ P[C_i\,|\,\mathbf{x}] = \frac{P[\mathbf{x}\,|\,C_i]}{P[\mathbf{x}]} \cdot P[C_i] \]

Now, when we vary $C_i$ in the above expression, $P[\mathbf{x}]$ is not changing.

So …

The $\hat{C}$ that maximizes

\[P[C_i\,|\,x_1, x_2, \dots, x_n]\]

is the same as the $\hat{C}$ that maximizes

\[P[x_1, x_2, \dots, x_n\,|\,C_i]\cdot P[C_i].\]

This gives us an angle of attack on the problem.

The difficult problem that remains is how to estimate

\[ P[x_1, x_2, \dots, x_n\,|\,C_i]. \]

To see the challenge, imagine if we tried to compute $P[x_1, x_2, \dots, x_n\,|\,C_i]$ directly from data.

Attacking head-on, we could use a histogram to estimate the necessary distribution. In other words, simply count up how many times we see each combination of feature values.

Let’s say there are 20 features ($n = 20$), and 10 possible values for each feature.

Then for each class $C_i$ we need to construct a histogram with $10^{20}$ bins!

We would never have enough data to fill all those bins.

The underlying problem we face is the high dimensionality of the feature space.

The size of our histogram is exponential in the number of features.

So, we need to find a way to reduce the exponential size of the estimation problem.

We will do that by factoring the distribution $P[x_1, x_2, \dots, x_n\,|\,C_i]$.

Here is where the “Naive” part comes in.

We will assume that attributes are independent in their assignment to items.

That is, for two sets of attributes, the values of the attributes in one set do not affect the values in the other set. So all correlations among attributes are zero.

This is indeed a “naive” assumption … but it can be surprisingly effective in practice.

That implies that:

\[ P[x_1, x_2, \dots, x_n\,|\,C_i] = P[x_1\,|\,C_i] \cdot P[x_2\,|\,C_i] \cdot \dots P[x_n\,|\,C_i] \]

This is very helpful computationally, because the factors $P[x_j\,|\,C_i]$ are individually much lower-dimensional than the full distribution.

In a naive Bayes model, the quantity we calculate for each class $C_i$ is

\[ \left(P[x_1\,|\,C_i] \cdot P[x_2\,|\,C_i] \cdot \dots P[x_n\,|\,C_i]\right) \cdot P[C_i] \]

You can see each conditional probability as a “correction factor” to $P[C_i]$.

Each factor $P[x_j\,|\,C_i]$ tells us how we should update our confidence in $C_i$ based on the value of a particular feature $x_j$.

So, what remains then is to estimate $P[x_j\,|\,C_i]$ for all $x_j$ and $C_i$.

We will estimate these quantities from our training data.

To summarize the steps of Naive Bayes:

Training

Compute all the per-class attribute probabilities $P[x_j\,|\,C_i]$ from training data.
Compute all the class probabilities $P[C_i]$ from the training data.

Predicting

For a given item $\mathbf{x} = (x_1, x_2, \dots, x_n)$,
- For each class $C_i,$ compute $P[x_1\,|\,C_i] \cdot P[x_2\,|\,C_i] \cdot \dots P[x_n\,|\,C_i] \cdot P[C_i]$
- For a hard classification, return the class that maximizes the above expression.
  - That is, the MAP estimate.

Computing Attribute Probabilities from Data#

All that remains is to compute the conditional attribute probabilities from data.

The strategy depends on the attribute type: discrete or continuous.

Discrete Attributes.

Discrete attributes, such as categorical or ordinal attributes, can be handled via histograms.

In the above table, to handle the Marital Status attribute for the $\text{Evade} = \text{No}$ class, we need to compute

\[ P[\text{Single}\,|\,\text{Evade }=\text{ No}] = 2 / 7 = 0.29 \]

and

\[ P[\text{Married}\,|\,\text{Evade }=\text{ No}] = 4 / 7 = 0.57 \]

and

\[ P[\text{Divorced}\,|\,\text{Evade }=\text{ No}] = 1 / 7 = 0.14 \]

One problem that can arise is when a histogram bin has zero entries. Then the conditional probability for this attribute value is zero, which overrides all the other factors and yields a zero probability.

There are various strategies for making small corrections to the counts that avoid this problem.

Continuous Attributes.

Continuous attributes can be handled via histograms as well, by binning up the values.

In the above example, we could create bins to hold ranges of values for Taxable Income .

However, another commonly used approach is to assume that the data follow a parametric probability distribution.

Most often the Gaussian is used (of course).

So we might form conditional probabilities for Taxable Income as

\[ P[\text{Taxable Income} = x\,|\,\text{Evade }=\text{ No}] = \mathcal{N}(x; \mu_\text{No}, \sigma_\text{No}) \]

_images/80adcea0f1fbab8c6cd10f25ae4eaa4d80d06996f2806cfd9554200e2c5ab634.png

In summary:

Naive Bayes attacks the classification problem through probability,
- which is perhaps the most natural formal tool for the problem.
Training is very simple, based on estimating class-conditional histograms or parametric densities of features.
Naive Bayes can work well in high-dimensional settings (many features)
- Many times the correct label is the MAP estimate, even if individual probabilities are less accurate.
Its principal drawback is its assumption of independence among the features.

Support Vector Machines#

We now turn to the support vector machine (SVM).

The SVM is based on explicit geometric considerations about how best to build a classifier.

As an example, here is a set of training data, considered as points in $\mathbb{R}^d$:

We will start with the idea of a linear separator.

This is a hyperplane that forms a decision boundary.

Here is one possible separator:

Here is another possible separator:

Which separator is better?

Well, they both perfectly separate the two classes in the training data.

But what we really care about is accuracy on the test data – generalization ability.

It seems that $B_1$ is a better choice, because it is farther from both classes.

So, new data falling in the region between training classes is more likely to be correctly classified by $B_1$.

This leads to a principle for choosing the best separator:

We are concerned with the margin between the separator and the data, and
We prefer the separator that maximizes the margin.

In fact, there are theoretical results suggesting that this is an optimal strategy for choosing a separator that has good generalization ability.

Linear SVM: Separable Case#

Let’s see how we can train an SVM.

As usual, our training data consists of tuples $(\mathbf{x}_i, y_i)$, where $\mathbf{x}_i \in \mathbb{R}^d$, and (by convention) $y_i \in \{-1, 1\}$.

We’re going to assume (for now) that our data can be perfectly separated by a hyperplane.

Any hyperplane (such as $B_1$ below) can be written as:

\[ w_1 x_1 + w_2 x_2 + \dots + w_d x_d + b = 0\]

or more concisely:

\[ \mathbf{w}^T\mathbf{x} + b = 0. \]

So our decision boundary (ie, our classifier) has parameters $\{w_i\}$ and $b$.

For any $\mathbf{x}_+$ from the positive class (circle) located above the decision boundary,

\[ \mathbf{w}^T\mathbf{x}_+ + b = k \]

for some positive $k$.

Likewise for any $\mathbf{x}_-$ from the negative class (square) located below the decision boundary,

\[ \mathbf{w}^T\mathbf{x}_- + b = -k \]

for the same $k$.

We’ll rescale the parameters $\mathbf{w}$ and $b$ by dividing through by $k$, thus obtaining new equations for the same hyperplanes:

\[ b_{11}: \mathbf{w}^T\mathbf{x}_+ + b = 1 \]

\[ b_{12}: \mathbf{w}^T\mathbf{x}_- + b = -1 \]

How far apart are these hyperplanes?

The vector $\mathbf{w}$ is orthogonal to the hyperplanes $b_{11}$ and $b_{12}$.

So the distance between the two hyperplanes is the component of $(\mathbf{x}_+ - \mathbf{x}_-)$ that is in the direction of $\mathbf{w}$.

In other words, the projection of $(\mathbf{x}_+ - \mathbf{x}_-)$ onto $\mathbf{w}$.

Which is given by $$d = \frac{\mathbf{w}^T}{\Vert\mathbf{w}\Vert}(\mathbf{x}_+ - \mathbf{x}_-).$$

But subtracting the equations for the hyperplanes, we get:

\[ \mathbf{w}^T(\mathbf{x}_+ - \mathbf{x}_-) = 2 \]

So we conclude:

\[ d = \frac{2}{\Vert\mathbf{w}\Vert}. \]

Now, we have a measure for how good a separator is in terms of its parameters $\mathbf{w}$.

We want the separator to maximize $ d = \frac{2}{\Vert\mathbf{w}\Vert},$ or equivalently, minimize $ \Vert\mathbf{w}\Vert$.

What separators should we consider (ie, search through)?

The answer is, we consider all separators that correctly classify each point.

That is, for all training points ($\mathbf{x}_i, y_i)$:

\[ \mathbf{w}^T\mathbf{x}_i + b \ge 1 \text{ if } y_i = 1 \]

and

\[ \mathbf{w}^T\mathbf{x}_i + b \le -1 \text{ if } y_i = -1 \]

So now we can formally state the problem of defining the maximum margin separator:

\[ \mathbf{w}^* = \arg\min_\mathbf{w} \Vert\mathbf{w}\Vert\]

Subject to:

\[ \mathbf{w}^T\mathbf{x}_i + b \ge 1 \text{ if } y_i = 1 \]

\[ \mathbf{w}^T\mathbf{x}_i + b \le -1 \text{ if } y_i = -1 \]

This is a constrained obtimization problem with a quadratic objective function.

The quadratic form $\Vert \mathbf{w}\Vert$ is positive definite, so it is strictly convex (it has a unique global minimum).

Such problems are called quadratic programs.

There are standard methods for solving them. The methods are effective but can be slow.

The complexity of the problem grows with the number of constraints (ie, the number of training points).

Ultimately, only a subset of the training points (constraints) will determine the final solution.

The points that determine the solution “support” it. They are the support vectors.

Linear SVM: Non-Separable Case#

It may well happen that there is no hyperplane that perfectly separates the classes.

In this case, we allow points to fall on the “wrong” side of their separator, but we add a penalty for this occurrence.

To express this formally, we introduce slack variables, one per data point: $\xi_i$.

Each $\xi_i$ measures how far the data point is on the “wrong side” of its separator.

So the constraints are loosened:

\[ \mathbf{w}^T\mathbf{x}_i + b \ge 1-\xi_i \text{ if } y_i = 1 \]

\[ \mathbf{w}^T\mathbf{x}_i + b \le -1+\xi_i \text{ if } y_i = -1 \]

\[ \xi_i \ge 0.\]

And the new problem is:

\[ \mathbf{w}^* = \arg\min_\mathbf{w} \left(\Vert\mathbf{w}\Vert + C (\sum_{i=1}^N \xi_i)^k\right),\]

subject to the above constraints.

Notice we have introduced a hyperparameter: $C$.

$C$ essentially controls the complexity of the SVM.

Hence it needs to be set by cross-validation.

A small value of $C$ will result in allowing many points to fall on the wrong side of their corresponding hyperplane (having a nonzero $\xi_i$) resulting in a large number of support vectors. So small $C$ results in a more stable decision boundary.

Large $C$ values will penalize violations heavily, and will result in fewer nonzero $\xi_i$s, leading to fewer support vectors.

Nonlinear SVM#

Finally, we consider the case in which the decision boundary is strongly nonlinear.

The basic idea here is that we take the data and transform it into another, higher-dimensional space.

Here is the same dataset on transformed coordinates:

\[ x_1 \rightarrow x_1 \]

\[ x_2 \rightarrow (x_1 + x_2)^4 \]

(We are not showing more dimensions here, but the principle is the same.)

In the higher dimensional space, the points may be (approximately) separable.

To achieve this using the framework of the SVM, we use a kernel.

A kernel is a similarity function $K(\mathbf{x}, \mathbf{y})$.

There are many ways to define kernels.

The most popular kernels are:

Linear (aka Inner Product): $K(\mathbf{x}, \mathbf{y}) = \mathbf{x}^T\mathbf{y}$
Polynomial: $K(\mathbf{x}, \mathbf{y}) = (\mathbf{x}^T\mathbf{y})^d$
Gaussian: $K(\mathbf{x}, \mathbf{y}) = \text{exp}(-\gamma\Vert\mathbf{x}-\mathbf{y}\Vert^2)$

The Gaussian kernel is also called a “Radial Basis Function”.

There is an efficient way to train an SVM using a kernel for a similarity function.

The basic idea is that the standard linear SVM is trained using Euclidean distance.

However, the squared Euclidean distance between two vectors is:

\[ \Vert\mathbf{x} - \mathbf{y}\Vert^2 = (\mathbf{x} - \mathbf{y})^T(\mathbf{x} - \mathbf{y}) = \mathbf{x}^T\mathbf{x} + \mathbf{y}^T\mathbf{y} - 2\mathbf{x}^T\mathbf{y}\]

So the Euclidean distance can be defined entirely in terms of the inner product kernel!

To train an SVM with a different kernel, we just replace all the inner products with calls to our new kernel function.

The result is that we can obtain highly curved decision boundaries (we’ll demonstrate next).

In practice, RBF works well in many cases.

SVM: Summary#

In practice, SVMs have shown good results on many problems.
In particular, it is effective at dealing with high-dimensional data and avoids the curse of dimensionality.
Since all data is represented as vectors, and we are relying on distance functions like Euclidean distance, it is important to pay attention to feature scaling when using SVMs.

SVM and Naive Bayes in Python#

We work with a dataset describing Italian wine samples (https://archive.ics.uci.edu/ml/datasets/Wine).

We will take the alcohol content of the wine, and its Malic acid content, and use it to predict the grape type (cultivar).

import sklearn.utils as utils
import sklearn.svm as svm
import sklearn.model_selection as model_selection
import sklearn.datasets as datasets

We’ll first fit a linear SVM to the data:

np.random.seed(0)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, 
                                                                    y, 
                                                                    test_size = 0.5)

svc = svm.SVC(kernel = 'linear')
svc.fit(X_train, y_train)
y_pred_test = svc.predict(X_test)
print(f'Accuracy of SVM on test set: {svc.score(X_test, y_test):0.3f}')

Accuracy of SVM on test set: 0.764

Let’s visualize the decision boundaries:

_images/dfe11115ed29d70a947ed476538d5f169843045d113b186b6cf47cc8de7638af.png

Note that in practice we should pay attention to feature scaling when using SVMs. We haven’t done that here.

As described already, the SVM gets its name from the samples in the dataset from each class that lie closest to the other class.

These training samples are called “support vectors” because changing their position in the $d$-dimensional feature space would change the location of the decision boundary.

In scikit-learn, the indices of the support vectors for each class can be found in the support_vectors_ attribute of the SVC object.

Here, we will use just two of the three classes for clarity.

The support vectors are circled. Can you visualize the two separator hyperplanes?

_images/8e33ca2c86cacaaccd108f92a94bfc31d1f3e4d109719a98a80715964773e251.png

Regularization#

Since the classes are not linearly separable, there are nonzero slack variables, each of which is associated with a support vector.

Therefore we should consider how regularization is tuned via the $C$ parameter.

In practice, a large $C$ value means that the number of support vectors is small (less regularization, more model complexity), while a small $C$ implies many support vectors (more regularization, less model complexity).

scikit-learn sets a default value of $C=1$.

_images/b3c73612395d51fea1b0aed9173c80bad5f835324c7fb6cc9e15649eb55a89e0.png

_images/f9def08e0c3424293d61342b943a68ad129f5778ec19a3f9b7a5893b4d0852c5.png

Kernels#

We can also choose from a suite of available kernels:

linear,
poly,
rbf,
sigmoid, or
precomputed.

Or, a custom kernel can be passed as a function.

Note that the radial basis function (rbf) kernel is just a Gaussian kernel, but with parameter $\gamma = \frac{1}{\sigma^2}$.

Linear Kernel

_images/8a0fb6854b9eb8355357bbb622f3cfcb3d57a4ab7d9c915b9a2ef8f81e9d44b4.png

Polynomial Kernel

_images/b25ff2e50ab42d18447d2a61287a52e0256eb1e05e6738b82140e8a7279fc4d5.png

RBF kernel

_images/e1b62784e7106e51ab6e78762c272c0503db0511e8c55c0d76d5eb3973a5f669.png

Cross-Validation#

Let’s evaluate our choice of hyperparameter $C$.

We have seen how to tune hyperparameters already using model_selection.train_test_split().

Now we’ll use a utility model_selection.cross_val_score() which will do $k$-fold cross validation for us, for a single hyperparmeter setting, automatically:

f = svm.SVC(kernel = 'linear', C = 1)
scores = model_selection.cross_val_score(f, 
                                         wine[['Alcohol', 'Malic acid']], 
                                         wine['region'], 
                                         cv = 5)

print(f'Scores: {scores}')
print(f'Accuracy: {scores.mean():0.2f} (+/- {scores.std()/np.sqrt(5):0.2f})')

Scores: [0.69444444 0.80555556 0.82857143 0.74285714 0.68571429]
Accuracy: 0.75 (+/- 0.03)

Let’s use this to do a grid search to tune $C$:

means = []
stds = []
folds = 5
C_vals = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]
for C_val in C_vals:
    f = svm.SVC(kernel='linear', C = C_val)
    scores = model_selection.cross_val_score(f, wine[['Alcohol', 'Malic acid']], wine['region'], cv = folds)
    means.append(np.mean(scores))
    stds.append(np.std(scores) / np.sqrt(folds))
acc = np.array(means)
stderr = np.array(stds)
C_s = np.array(C_vals)

_images/582c7f9bc4239ce860cf0fbd636a852aa83bb50668b19afe6026d508a27def01.png

SVM and NB: the Iris Data#

To compare SVM and Naive Bayes, we’ll look at the Iris dataset again, and just using two features for visualization.

We will not hold out data since we’re just interested in the shapes of the decision boundaries.

C = 1.0  

svc = svm.SVC(kernel = 'linear', C = C).fit(X, y)
rbf_svc = svm.SVC(kernel = 'rbf', gamma = 0.7, C = C).fit(X, y)
poly_svc = svm.SVC(kernel = 'poly', degree = 3, C = C).fit(X, y)

To use Naive Bayes “out of the box”, one has to treat all the features as either:

Gaussian
Multinomial (Categorical)
Binary

scikit-learn provides a Naive Bayes classifier for each of these cases.

We’ll use the Gaussian.

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB().fit(X, y)

_images/dfecd7b9f3861c457e960156d6db9e7bb337a244dcccaebb4cf489237bad02ca.png