Basic Classification with Naive Bayes
Here we will explore basic classification with an implementation of the Naive Bayes classifier from scratch.
Data
Summary
We will use the Pima Indians data set from UC Irvine. A copy can be obtained from Kaggle.
This data set is often used as a simple starting point for classification. As described on the Kaggle page, it:
is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
Some issues to keep in mind:
- There are feature attributes of the patients and a label attribute indicating whether or not the patient is diabetic.
- Values of 0 may indicate missing data.
Load
Let’s load the data into a pandas dataframe and look at some rows.
import pandas as pd
df = pd.read_csv('diabetes.csv')
df.head()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
Split
Here we will split the data into training and testing segments. We will have:
X_train
,y_train
: train data set features(X) and labels(y)X_test
,y_test
: test data set features(X) and labels(y)
from sklearn.model_selection import train_test_split
SEED = 20212021
y = df['Outcome']
X = df.drop(['Outcome'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=SEED)
Here’s what the various data matrices look like now:
print(f'X_train.shape: { X_train.shape }')
print(f'y_train.shape: { y_train.shape }')
print(f'X_test.shape: { X_test.shape }')
print(f'y_test.shape: { y_test.shape }')
X_train.shape: (614, 8)
y_train.shape: (614,)
X_test.shape: (154, 8)
y_test.shape: (154,)
Naive Bayes Classifier
Our reference book will be the excellent “Applied Machine Learning” by David A. Forsyth (2019). As he writes:
One straightforward source of a classifier is a probability model. For the moment, assume we know for our data. Assume also that all errors in classification are equally important. Then the following rule produces smallest possible expected classification error rate:
For a test example , report the class y that has the highest value of . If the largest value is achieved by more than one class, choose randomly from that set of classes.
Given Bayes’ rule:
and assuming that the features are independent of each other given the label , to classify a data point we can:
Choose such that is the largest
which handily drops the .
We will implement functions to calculate the and distributions.
We must choose a distribution for modeling . The Gaussian may be appropriate in cases where the dataset does not appear to be normally distributed as long as the distributions split the classes well.
For our use case, we can implement the Gaussian probability density function:
import numpy as np
def gaussian_means(X, labels):
uniq = sorted(list(set(labels)))
ret = [ list(X[labels==u].mean(axis=0)) for u in uniq]
ret = np.array(ret).T
assert ret.shape == (X.shape[1], 2)
return ret
def gaussian_stds(X, labels):
uniq = sorted(list(set(labels)))
ret = [ list(X[labels==u].std(axis=0)) for u in uniq]
ret = np.array(ret).T
assert ret.shape == (X.shape[1], 2)
return ret
def binary_log_pxy(X, means, stds):
def calc(label):
term1 = np.log( 1 / (stds[:,label] * np.sqrt(2 * np.pi)))
term2 = -0.5 * ( (X - means[:,label])**2 / stds[:,label]**2)
return np.sum(term1 + term2, axis=1)
zeros = calc(0)
ones = calc(1)
ret = np.array(list(zip(zeros,ones)))
assert ret.shape == (X.shape[0], 2)
return ret
Given that is binary in the Pima Indians data set, we can fit a Bernoulli distribution.
To find we count the data points matching each class and divide by the total number of data points. We then take the log to return .
import math
def binary_log_py(labels):
ones = np.count_nonzero(labels)
zer = math.log((labels.size - ones) / (labels.size))
one = math.log(ones / (labels.size))
ret = np.array([[zer],[one]])
assert ret.shape == (2,1)
return ret
Now that we have , we simply add them together. The resulting matrix has one row for each data point with predicted log probabilities in each column corresponding to the various classes.
def probs(X, means, stds, log_py):
ret = log_py.T + binary_log_pxy(X, means, stds)
assert ret.shape == (X.shape[0], 2)
return ret
Predict
Let’s wrap our code in a class, instantiate it (it is trained upon instantiation), and predict using our test features.
class NBClassifier():
def __init__(self, X_train=None, y_train=None):
self.means = gaussian_means(X_train, y_train)
self.stds = gaussian_stds(X_train, y_train)
self.log_py = binary_log_py(y_train)
def predict(self, features):
preds = probs(features, self.means, self.stds, self.log_py)
return preds.argmax(axis=1)
def run_model(clf=None, X_train=None, y_train=None, X_test=None, y_test=None):
train_pred = clf.predict(X_train)
train_acc = (train_pred==y_train).mean()
print(f'Train accuracy: {train_acc}')
test_pred = clf.predict(X_test)
test_acc = (test_pred==y_test).mean()
print(f'Test accuracy: {test_acc}')
run_model(NBClassifier(X_train, y_train), X_train, y_train, X_test, y_test)
Train accuracy: 0.755700325732899
Test accuracy: 0.7857142857142857
Compare to scikit-learn
Naive Bayes
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB().fit(X_train, y_train)
run_model(clf, X_train, y_train, X_test, y_test)
Train accuracy: 0.755700325732899
Test accuracy: 0.7857142857142857
Our implementation matches the output of scikit-learn
.