Machine Learning Basics¶

Objective is creating a classification algorithm whilst addressing the following:

(a) Build a working version of a binary kNN classifier using the skeleton code below.

(b) Load the datasets to be evaluated here. Each includes training features ($\mathbf{X}$), and test features ($\mathbf{y}$) for both a low dimensional ($p = 2$ features/predictors) and a high dimensional ($p = 100$ features/predictors). For each of these datasets there are $n=100$ observations of each.

(c) Train the classifier on first the low dimensional dataset and then the high dimensional dataset with $k=5$. Evaluate the classification performance on the corresponding test data for each. Calculate the time it takes to make the predictions in each case and the overall accuracy of each set of test data predictions.

(d) Compare my implementation’s accuracy and computation time to the scikit learn KNeighborsClassifier class. Results and speed comparison?

(e) Some supervised learning algorithms are more computationally intensive during training than testing. What are the drawbacks of the prediction process being slow?

ANSWER:

(a)

import pandas as pd
import numpy as np
class Knn:
# k-Nearest Neighbor class object for classification training and testing
    def __init__(self):
        #This is called when the class is initialised
        self.x = []
        self.y = []
    def fit(self, x, y):
        # Save the training data to properties of this class
        self.x = np.array(x)
        self.y = np.array(y)
    
    def euclidean(points,ref_point):
        X_new = np.sqrt(np.sum(np.square(points-ref_point)))       
        return X_new
    
    def make_knn_prediction(distance,y,K=1):
        return pd.DataFrame(y[distance.loc[distance.distance.rank() <= K].index]).mode()
        
    def predict(self, x, k):
        y_hat = [] # Variable to store the estimated class label for 
        # Calculate the distance from each vector in x to the training data
        distances = pd.DataFrame(self.x)
        x = np.array(x)
        for x_ind in range(len(x)):
            a = np.zeros(np.shape(self.x))+x[x_ind]
            distances.loc[:,'distance'] = np.sqrt(np.sum((self.x-a)**2,axis=1))
            y_hat.append(pd.DataFrame(self.y[distances.loc[distances.distance.rank() <= k].index]).mode())
        
        # Return the estimated targets
        return y_hat

# Metric of overall classification accuracy
#  (a more general function, sklearn.metrics.accuracy_score, is also available)
def accuracy(y,y_hat):
    nvalues = len(y)
    accuracy = sum(y == y_hat) / nvalues
    return accuracy

(b)

import pandas as pd

# Read in low-dimensionality data
x_train_low = pd.read_csv('./Data/A2_X_train_low.csv',header=None)
y_train_low = pd.read_csv('./Data/A2_y_train_low.csv',header=None)
x_test_low = pd.read_csv('./Data/A2_X_test_low.csv',header=None)
y_test_low = pd.read_csv('./Data/A2_y_test_low.csv',header=None)

# Read in high-dimensionality dat
x_train_high = pd.read_csv('./Data/A2_X_train_high.csv',header=None)
y_train_high = pd.read_csv('./Data/A2_y_train_high.csv',header=None)
x_test_high = pd.read_csv('./Data/A2_X_test_high.csv',header=None)
y_test_high = pd.read_csv('./Data/A2_y_test_high.csv',header=None)

(c)

# Evaluate the performance of your kNN classifier on a low- and a high-dimensional dataset 
#   and time the predictions of each
import time
import warnings
warnings.simplefilter('ignore')

# Initialize Knn class object and train on low-dimensionality data
knn_train_low = Knn()
knn_train_low.fit(x_train_low,y_train_low)

# Initialize Knn class object and train on high-dimensionality data
knn_train_high = Knn()
knn_train_high.fit(x_train_high,y_train_high)

# Predict and time my KNN on low-dimensionality data
t0 = time.time()
y_hat_low=knn_train_low.predict(x_test_low,5)
t1 = time.time()

print('Time it takes my own KNN classifier to predict on low-dimensional data is: {:.2f} seconds at '\
'an accuracy of {:.2%}'.format(t1-t0, np.asscalar(accuracy(np.array(y_test_low),y_hat_low))))

# Predict and time my KNN on high-dimensionality data
t0 = time.time()
y_hat_high=knn_train_high.predict(x_test_high,5)
t1 = time.time()

print('Time it takes my own KNN classifier to predict on high-dimensional data is: {:.2f} seconds at '\
      'an accuracy of {:.2%}'.format(t1-t0, np.asscalar(accuracy(np.array(y_test_high),y_hat_high))))

Time it takes my own KNN classifier to predict on low-dimensional data is: 1.65 seconds at an accuracy of 92.50%
Time it takes my own KNN classifier to predict on high-dimensional data is: 1.92 seconds at an accuracy of 99.30%

(d)

from sklearn.neighbors import KNeighborsClassifier
import time

# Initialize Knn class object and train on low-dimensionality data
sklearn_low = KNeighborsClassifier(n_neighbors=5)
sklearn_low.fit(x_train_low,np.array(y_train_low).ravel())

# Initialize Knn class object and train on high-dimensionality data
sklearn_high = KNeighborsClassifier(n_neighbors=5)
sklearn_high.fit(x_train_high,np.array(y_train_high).ravel())

# Predict and time scikit learn KNN on low-dimensionality data
t0 = time.time()
accuracy = sklearn_low.score(x_test_low,np.array(y_test_low).ravel())
t1 = time.time()

print('Time it takes sklearn KNN classifier to predict on low-dimensional data is: {:.2f} seconds at '\
'an accuracy of {:.2%}'.format(t1-t0, accuracy))

# Predict and time scikit learn KNN on high-dimensionality data
t0 = time.time()
accuracy = sklearn_high.score(x_test_high,np.array(y_test_high).ravel())
t1 = time.time()

print('Time it takes sklearn KNN classifier to predict on high-dimensional data is: {:.2f} seconds at '\
      'an accuracy of {:.2%}'.format(t1-t0, accuracy))

Time it takes sklearn KNN classifier to predict on low-dimensional data is: 0.02 seconds at an accuracy of 92.50%
Time it takes sklearn KNN classifier to predict on high-dimensional data is: 0.14 seconds at an accuracy of 99.30%

The accuracy is the same for my implementation and the one from scikit learn. The computational time from my implementation takes much longer than the scikit learn implementation.

(e)

If the prediction process is slow, then this introduces challenges if the model is to be used in real time, as the delay in results might e.g. make the company lose potentail clients or have grave consequences if prediction is safety related. In addition, if the predictions are slow, then the model will be taking away resources from other processes, which will be inefficient. The stakeholders or clients of the models might discontinue such models if they believe the prediction time is inhibiting in their process.

K-Nearest Neighbor(KNN) Implementation from first principles

Machine Learning Basics¶

Start your journey right now! Build unique websites with our creative tool.

Index volatility based options trading

Impact of transfer learning techniques in identifying the type of damage on an image for car insurance claims

Machine Learning Basics¶

Related Projects

Bias-Variance trade-off with KNN’s

Logistic regression algorithm implementation from first principles including Cross Validation

Heart failure prediction app

Start your journey right now! Build unique websites with our creative tool.