The following small project is an implementation we had to do as part of the Machine Learning class at Duke. The objective was to drill deeper into the mechanics of KNN’s in order to allow customization of the algorithm if the standard algorithm offered by Scikit Learn, does not meet the requirements at hand. As highlighted by the model performance, the implementation requires performance optimization to reduce training time, which will be addressed in another post.
Machine Learning Basics¶
Objective is creating a classification algorithm whilst addressing the following:
(a) Build a working version of a binary kNN classifier using the skeleton code below.
(b) Load the datasets to be evaluated here. Each includes training features ($\mathbf{X}$), and test features ($\mathbf{y}$) for both a low dimensional ($p = 2$ features/predictors) and a high dimensional ($p = 100$ features/predictors). For each of these datasets there are $n=100$ observations of each.
(c) Train the classifier on first the low dimensional dataset and then the high dimensional dataset with $k=5$. Evaluate the classification performance on the corresponding test data for each. Calculate the time it takes to make the predictions in each case and the overall accuracy of each set of test data predictions.
(d) Compare my implementation’s accuracy and computation time to the scikit learn KNeighborsClassifier class. Results and speed comparison?
(e) Some supervised learning algorithms are more computationally intensive during training than testing. What are the drawbacks of the prediction process being slow?
ANSWER:
(a)
import pandas as pd
import numpy as np
class Knn:
# k-Nearest Neighbor class object for classification training and testing
def __init__(self):
#This is called when the class is initialised
self.x = []
self.y = []
def fit(self, x, y):
# Save the training data to properties of this class
self.x = np.array(x)
self.y = np.array(y)
def euclidean(points,ref_point):
X_new = np.sqrt(np.sum(np.square(points-ref_point)))
return X_new
def make_knn_prediction(distance,y,K=1):
return pd.DataFrame(y[distance.loc[distance.distance.rank() <= K].index]).mode()
def predict(self, x, k):
y_hat = [] # Variable to store the estimated class label for
# Calculate the distance from each vector in x to the training data
distances = pd.DataFrame(self.x)
x = np.array(x)
for x_ind in range(len(x)):
a = np.zeros(np.shape(self.x))+x[x_ind]
distances.loc[:,'distance'] = np.sqrt(np.sum((self.x-a)**2,axis=1))
y_hat.append(pd.DataFrame(self.y[distances.loc[distances.distance.rank() <= k].index]).mode())
# Return the estimated targets
return y_hat
# Metric of overall classification accuracy
# (a more general function, sklearn.metrics.accuracy_score, is also available)
def accuracy(y,y_hat):
nvalues = len(y)
accuracy = sum(y == y_hat) / nvalues
return accuracy
(b)
import pandas as pd
# Read in low-dimensionality data
x_train_low = pd.read_csv('./Data/A2_X_train_low.csv',header=None)
y_train_low = pd.read_csv('./Data/A2_y_train_low.csv',header=None)
x_test_low = pd.read_csv('./Data/A2_X_test_low.csv',header=None)
y_test_low = pd.read_csv('./Data/A2_y_test_low.csv',header=None)
# Read in high-dimensionality dat
x_train_high = pd.read_csv('./Data/A2_X_train_high.csv',header=None)
y_train_high = pd.read_csv('./Data/A2_y_train_high.csv',header=None)
x_test_high = pd.read_csv('./Data/A2_X_test_high.csv',header=None)
y_test_high = pd.read_csv('./Data/A2_y_test_high.csv',header=None)
(c)
# Evaluate the performance of your kNN classifier on a low- and a high-dimensional dataset
# and time the predictions of each
import time
import warnings
warnings.simplefilter('ignore')
# Initialize Knn class object and train on low-dimensionality data
knn_train_low = Knn()
knn_train_low.fit(x_train_low,y_train_low)
# Initialize Knn class object and train on high-dimensionality data
knn_train_high = Knn()
knn_train_high.fit(x_train_high,y_train_high)
# Predict and time my KNN on low-dimensionality data
t0 = time.time()
y_hat_low=knn_train_low.predict(x_test_low,5)
t1 = time.time()
print('Time it takes my own KNN classifier to predict on low-dimensional data is: {:.2f} seconds at '\
'an accuracy of {:.2%}'.format(t1-t0, np.asscalar(accuracy(np.array(y_test_low),y_hat_low))))
# Predict and time my KNN on high-dimensionality data
t0 = time.time()
y_hat_high=knn_train_high.predict(x_test_high,5)
t1 = time.time()
print('Time it takes my own KNN classifier to predict on high-dimensional data is: {:.2f} seconds at '\
'an accuracy of {:.2%}'.format(t1-t0, np.asscalar(accuracy(np.array(y_test_high),y_hat_high))))
(d)
from sklearn.neighbors import KNeighborsClassifier
import time
# Initialize Knn class object and train on low-dimensionality data
sklearn_low = KNeighborsClassifier(n_neighbors=5)
sklearn_low.fit(x_train_low,np.array(y_train_low).ravel())
# Initialize Knn class object and train on high-dimensionality data
sklearn_high = KNeighborsClassifier(n_neighbors=5)
sklearn_high.fit(x_train_high,np.array(y_train_high).ravel())
# Predict and time scikit learn KNN on low-dimensionality data
t0 = time.time()
accuracy = sklearn_low.score(x_test_low,np.array(y_test_low).ravel())
t1 = time.time()
print('Time it takes sklearn KNN classifier to predict on low-dimensional data is: {:.2f} seconds at '\
'an accuracy of {:.2%}'.format(t1-t0, accuracy))
# Predict and time scikit learn KNN on high-dimensionality data
t0 = time.time()
accuracy = sklearn_high.score(x_test_high,np.array(y_test_high).ravel())
t1 = time.time()
print('Time it takes sklearn KNN classifier to predict on high-dimensional data is: {:.2f} seconds at '\
'an accuracy of {:.2%}'.format(t1-t0, accuracy))
The accuracy is the same for my implementation and the one from scikit learn. The computational time from my implementation takes much longer than the scikit learn implementation.
(e)
If the prediction process is slow, then this introduces challenges if the model is to be used in real time, as the delay in results might e.g. make the company lose potentail clients or have grave consequences if prediction is safety related. In addition, if the predictions are slow, then the model will be taking away resources from other processes, which will be inefficient. The stakeholders or clients of the models might discontinue such models if they believe the prediction time is inhibiting in their process.