It is always a challenge to get the appropriate balance between bias and variance, as when one reduces, the other will increase, resulting in a compromise having to be struck. This project tries to illustrate this effect by changing the parameter K of KNN’s, and observing this effect on both training and test error.
Machine Learning: K-Nearest Neighbors¶
Bias-variance tradeoff I: Understanding the tradeoff. This exercise will illustrate the impact of the bias-variance tradeoff on classifier performance by looking at classifier decision boundaries.
(a) Create a synthetic dataset (with both features and targets). Use the make_moons
module with the parameter noise=0.35
to generate 1000 random samples.
(b) Scatterplot your random samples with each class in a different color
(c) Create 3 different data subsets by selecting 100 of the 1000 data points at random three times. For each of these 100-sample datasets, fit three k-Nearest Neighbor classifiers with: $k = \{1, 25, 50\}$. This will result in 9 combinations (3 datasets, with 3 trained classifiers).
(d) For each combination of dataset trained classifier, in a 3-by-3 grid, plot the decision boundary (similar in style to Figure 2.15 from Introduction to Statistical Learning). Each column should represent a different value of $k$ and each row should represent a different dataset.
(e) What do you notice about the difference between the rows and the columns. Which decision boundaries appear to best separate the two classes of data? Which decision boundaries vary the most as the data change?
(f) Explain the bias-variance tradeoff using the example of the plots you made in this exercise.
ANSWER
(a)
from sklearn import datasets
dataset_q5 = datasets.make_moons(n_samples=1000,noise=0.35)
(b)
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6), dpi= 100)
ax = plt.scatter(x=dataset_q5[0][:,0],y=dataset_q5[0][:,1],c=dataset_q5[1],alpha =0.7,label=dataset_q5[1])
plt.legend(handles=ax.legend_elements()[0], labels=['0','1'])
plt.title('Scatter plot of Random generated data')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
(c)
import random
#random.choices()
#dataset_q5[0]
random.seed(123)
set1_ind = random.sample(range(0,1000),100)
set1_x = dataset_q5[0][set1_ind]
set1_y = dataset_q5[1][set1_ind]
random.seed(1234)
set2_ind = random.sample(range(0,1000),100)
set2_x = dataset_q5[0][set2_ind]
set2_y = dataset_q5[1][set2_ind]
random.seed(12345)
set3_ind = random.sample(range(0,1000),100)
set3_x = dataset_q5[0][set3_ind]
set3_y = dataset_q5[1][set3_ind]
from sklearn.neighbors import KNeighborsClassifier
for j in [1,2,3]:
for i in [1,25,50]:
exec("knn_set%s_%s=KNeighborsClassifier(n_neighbors=%s)" % (j,i,i))
exec("knn_set%s_%s.fit(set%s_x,set%s_y)" % (j,i,j,j))
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
%config InlineBackend.figure_format = 'retina' # Makes the plots clear on high-res screens
h = .02 # step size in the mesh
# Create color maps
cmap_light = ListedColormap(['orange', 'cornflowerblue'])
cmap_bold = ListedColormap(['darkorange', 'darkblue'])
subplt = 0
fig, axs = plt.subplots(3,3,figsize=(17,17))
for j in [1,2,3]:
subplt = 0
for i in [1,25,50]:
exec("x_min, x_max = set%s_x[:, 0].min() - 1, set%s_x[:, 0].max() + 1" % (j,j))
exec("y_min, y_max = set%s_x[:, 1].min() - 1, set%s_x[:, 1].max() + 1" % (j,j))
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))
exec("Z = knn_set%s_%s.predict(np.c_[xx.ravel(), yy.ravel()])" % (j,i))
# Put the result into a color plot
Z = Z.reshape(xx.shape)
axs[j-1,subplt].pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
exec("axs[j-1,subplt].scatter(set%s_x[:, 0], set%s_x[:, 1], c=set%s_y, cmap=cmap_bold,edgecolor='k', s=20)"\
%(j,j,j))
axs[j-1,subplt].set_xlim([xx.min(), xx.max()])
axs[j-1,subplt].set_ylim([yy.min(), yy.max()])
axs[j-1,subplt].set(title="Dataset = %i, Neighbours = %s" % (j, i),xlabel='X1',ylabel='X2')
subplt = subplt + 1
fig.suptitle('KNN Models for 3 Datasets and K = 1,25,50')
plt.show()
(e) As you move across the rows (increasing K), there is an increase in underfitting. Moving down the columns, the variance in decision boundary is high for K=1, but reduces as K increases. K=1 seems to do the best job at seperating the two classes but at a cost of high variance. On the other hand, K=25 seems to give the best balance in the stability of decision boundary and seperation of the two classes.
(f) The Bias-Variance tradeoff is the challenge of trying to minimize the overall generalization error. As a model reduces bias or underfit, which is moving across the rows (with K=1 having lowest bias), it will fit the data more tightly. The result or tradeoff, will be increased variance in the decision boundary, as shown by column with K=1, moving down across the different datasets.
Bias-variance trade-off II: Quantifying the tradeoff. This exercise will explore the impact of the bias-variance tradeoff on classifier performance by looking at classifier decision boundaries.
Here, the value of $k$ determines how flexible our model is.
(a) Using the function created earlier to generate random samples (using the make_moons
function), create a new set of 1000 random samples, and call this dataset your test set and the previously created dataset your training set.
(b) Train a kNN classifier on your training set for $k = 1,2,…500$. Apply each of these trained classifiers to both your training dataset and your test dataset and plot the classification error (fraction of mislabeled datapoints).
(c) What trend do you see in the results?
(d) What values of $k$ represent high bias and which represent high variance?
(e) What is the optimal value of $k$ and why?
(f) In kNN classifiers, the value of k controls the flexibility of the model – what controls the flexibility of other models?
ANSWER
(a)
dataset_q7 = datasets.make_moons(n_samples=1000,noise=0.35)
test_x = dataset_q7[0]
test_y = dataset_q7[1]
train_x = dataset_q5[0]
train_y = dataset_q5[1]
(b)
from sklearn.neighbors import KNeighborsClassifier
y_hat_test = []
y_hat_train = []
min_test_error = 1
min_test_error_k = 0
for j in range(1,501):
exec("knn_%s=KNeighborsClassifier(n_neighbors=%s)" % (j,j))
exec("knn_%s.fit(train_x,train_y)" % (j))
exec("y_hat_test.append(1-knn_%s.score(test_x,test_y))" % (j))
exec("y_hat_train.append(1-knn_%s.score(train_x,train_y))" % (j))
if y_hat_test[-1] < min_test_error:
min_test_error = y_hat_test[-1]
min_test_error_k = j
import numpy as np
%config InlineBackend.figure_format = 'retina' # Makes the plots clear on high-res screens
#plot colors
color0 = '#121619' # Dark grey
color1 = '#00B050' # Green
# Create the plot
plt.figure(figsize=(17,10), dpi= 100) # Adjust the figure size and dpi (dots per inch)
# to clearly display the plot and balance plot proportions
plt.plot(range(1,501),y_hat_test,color=color1,label='Test (out-of-sample)')
plt.plot(range(1,501),y_hat_train,color='grey',label='Training (in-sample)')
plt.xlim(1, 501)
plt.xticks((1,100,200,300,400,500))
# Hide the right and top spines
ax = plt.gca()
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.legend()
plt.title('Training vs Test Classification Error for K-Nearest Models')
plt.xlabel('K-Nearest Neighbors')
plt.ylabel('Binary Classification Error Rate')
plt.tight_layout()
plt.show()
(c) It seems test error starts generally decreasing as K increases, but this is up to an optimal K. After that K, it then starts increasing again. In the region when test error as decreasing, the training error seems to be significantly lower, possibly due to the effects of overfitting in that region. As the K increases, the test and training error rate, start to converge, yo reflect the reduction in overfitting. As K gets large, both the training and test error rates eventually start increasing again, due to the underfitting.
(d) Values of K closer to 1 have higher variance and larger values of K (e.g. greater than 100) have more bias.
(e)
print('Optimal value of K is: %.0f, because it has the lowest test error.' % min_test_error_k)
(f) In other models, the number of parameters controls the flexibility. As the number of parameters increase, then the model will be more flexible, as it will be have more levers to tune performance.