Here we will explore basic classification with an implementation of the Naive Bayes classifier from scratch.

Data

Summary

We will use the Pima Indians data set from UC Irvine. A copy can be obtained from Kaggle.

This data set is often used as a simple starting point for classification. As described on the Kaggle page, it:

is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Some issues to keep in mind:

  • There are feature attributes of the patients and a label attribute indicating whether or not the patient is diabetic.
  • Values of 0 may indicate missing data.

Load

Let’s load the data into a pandas dataframe and look at some rows.

import pandas as pd
df = pd.read_csv('diabetes.csv')
df.head()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

Split

Here we will split the data into training and testing segments. We will have:

  • X_train, y_train: train data set features(X) and labels(y)
  • X_test, y_test: test data set features(X) and labels(y)
from sklearn.model_selection import train_test_split
SEED = 20212021
y = df['Outcome']
X = df.drop(['Outcome'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
    random_state=SEED)

Here’s what the various data matrices look like now:

print(f'X_train.shape: { X_train.shape }')
print(f'y_train.shape: { y_train.shape }')
print(f'X_test.shape: { X_test.shape }')
print(f'y_test.shape: { y_test.shape }')
X_train.shape: (614, 8)
y_train.shape: (614,)
X_test.shape: (154, 8)
y_test.shape: (154,)

Naive Bayes Classifier

Our reference book will be the excellent “Applied Machine Learning” by David A. Forsyth (2019). As he writes:

One straightforward source of a classifier is a probability model. For the moment, assume we know p(yx)p(y|\mathbf{x}) for our data. Assume also that all errors in classification are equally important. Then the following rule produces smallest possible expected classification error rate:

For a test example x\mathbf{x}, report the class y that has the highest value of (p(yx))(p(y|\mathbf{x})). If the largest value is achieved by more than one class, choose randomly from that set of classes.

Given Bayes’ rule:

p(yx)=p(xy)p(y)p(x)p(y|\mathbf{x}) = \frac{ p(\mathbf{x}|y)p(y) } { p(\mathbf{x}) }

and assuming that the features x\mathbf{x} are independent of each other given the label yy, to classify a data point we can:

Choose yy such that [jlogp(x(j)y)+logp(y)]\bigg[\sum_{j} \log p(x^{(j)}|y) + \log p(y) \bigg] is the largest

which handily drops the p(x)p(\mathbf{x}).

We will implement functions to calculate the p(x(j)y)p(x^{(j)}|y) and p(y)p(y) distributions.

logp(x(j)y)\log p(x^{(j)}|y)

We must choose a distribution for modeling logp(x(j)y)\log p(x^{(j)}|y). The Gaussian may be appropriate in cases where the dataset does not appear to be normally distributed as long as the distributions split the classes well.

For our use case, we can implement the Gaussian probability density function:

1σ2πe12(xμσ)2\frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2}
import numpy as np
def gaussian_means(X, labels):
    uniq = sorted(list(set(labels)))
    ret = [ list(X[labels==u].mean(axis=0)) for u in uniq]
    ret = np.array(ret).T
    assert ret.shape == (X.shape[1], 2)
    return ret

def gaussian_stds(X, labels):
    uniq = sorted(list(set(labels)))
    ret = [ list(X[labels==u].std(axis=0)) for u in uniq]
    ret = np.array(ret).T
    assert ret.shape == (X.shape[1], 2)
    return ret

def binary_log_pxy(X, means, stds):
    def calc(label):
        term1 = np.log( 1 / (stds[:,label] * np.sqrt(2 * np.pi)))
        term2 = -0.5 * ( (X - means[:,label])**2 / stds[:,label]**2)
        return np.sum(term1 + term2, axis=1)
    zeros = calc(0)
    ones = calc(1)
    ret = np.array(list(zip(zeros,ones)))
    assert ret.shape == (X.shape[0], 2)
    return ret

logp(y)\log p(y)

Given that yy is binary in the Pima Indians data set, we can fit a Bernoulli distribution.

To find p(y)p(y) we count the data points matching each class and divide by the total number of data points. We then take the log to return logp(y)\log p(y).

import math
def binary_log_py(labels):
    ones = np.count_nonzero(labels)
    zer = math.log((labels.size - ones) / (labels.size))
    one = math.log(ones / (labels.size))
    ret = np.array([[zer],[one]])
    assert ret.shape == (2,1)
    return ret

logp(y)+logp(x(j)y)\log p(y) + \log p(x^{(j)}|y)

Now that we have logp(y)+logp(x(j)y)\log p(y) + \log p(x^{(j)}|y) , we simply add them together. The resulting matrix has one row for each data point with predicted log probabilities in each column corresponding to the various classes.

def probs(X, means, stds, log_py):
    ret = log_py.T + binary_log_pxy(X, means, stds)
    assert ret.shape == (X.shape[0], 2)
    return ret

Predict

Let’s wrap our code in a class, instantiate it (it is trained upon instantiation), and predict using our test features.

class NBClassifier():
    def __init__(self, X_train=None, y_train=None):
        self.means = gaussian_means(X_train, y_train)
        self.stds = gaussian_stds(X_train, y_train)
        self.log_py = binary_log_py(y_train)
        
    def predict(self, features):
        preds = probs(features, self.means, self.stds, self.log_py)
        return preds.argmax(axis=1)
def run_model(clf=None, X_train=None, y_train=None, X_test=None, y_test=None):
    train_pred = clf.predict(X_train)
    train_acc = (train_pred==y_train).mean()
    print(f'Train accuracy: {train_acc}')

    test_pred = clf.predict(X_test)
    test_acc = (test_pred==y_test).mean()
    print(f'Test accuracy: {test_acc}')

run_model(NBClassifier(X_train, y_train), X_train, y_train, X_test, y_test)
Train accuracy: 0.755700325732899
Test accuracy: 0.7857142857142857

Compare to scikit-learn Naive Bayes

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB().fit(X_train, y_train)
run_model(clf, X_train, y_train, X_test, y_test)
Train accuracy: 0.755700325732899
Test accuracy: 0.7857142857142857

Our implementation matches the output of scikit-learn.