Classification through Logistic regression¶
Introduction¶
For this problem I will derive, implement through gradient descent, and test the performance of a logistic regression classifier for a binary classification problem.
In this case, I’ll assume our logistic regression problem will be applied to a two dimensional feature space. Our logistic regression model is:
$$f(\mathbf{x}_i,\mathbf{w})=\sigma(\mathbf{w}^{\intercal} \mathbf{x}_i)$$where the sigmoid function is defined as $\sigma(x) = \dfrac{e^x}{1+e^{x}}= \dfrac{1}{1+e^{-x}}$. Also, since this is a two-dimensional problem, we define $\mathbf{w}^{\intercal} \mathbf{x}_i = w_0 x_{i,0} + w_1 x_{i,1} + w_2 x_{i,2}$ and here, $\mathbf{x}_i=[x_{i,0}, x_{i,1}, x_{i,2}]^{\intercal}$, and $x_{i,0} \triangleq 1$
As in class, we will interpret the response of the logistic regression classifier to be the likelihood of the data given the model parameters. For one sample, $(y_i, \mathbf{x}_i)$, this is given as:
$$P(Y=y_i|X=\mathbf{x}_i) = f(\mathbf{x}_i,\mathbf{w})=\sigma(\mathbf{w}^{\intercal} \mathbf{x}_i)$$Aside: the careful reader will recognize this expression looks different from when we talk about the likelihood of our data given the true class label, typically expressed as $P(x|y)$, or the posterior probability of a class label given our data, typically expressed as $P(y|x)$. In the context of training a logistic regression model, we know the training $\mathbf{x}$ values and $y$ values, so the above probability is primarily a function of the logistic regression parameters, $\mathbf{w}$. It’s our goal to use this to choose the parameters to maximize the probability of our data by adjusting our model
Finding the cost function that we can use to choose the model parameters, $\mathbf{w}$, that best fit the training data in (a) – (d)¶
(a) We need to determine the likelihood function for all the $N$ samples in our training dataset that we will to maximize.
(b) I’ll express part (a) as a cost function of the model parameters, $C(\mathbf{w})$, that is the negative of the logarithm of (a).
(c) I’ll then calculate the gradient of the cost function with respect to the model parameters $\nabla_{\mathbf{w}}C(\mathbf{w})$. Express this in terms of the partial derivatives of the cost function with respect to each of the parameters, e.g. $\nabla_{\mathbf{w}}C(\mathbf{w}) = \left[\dfrac{\partial C}{\partial w_0}, \dfrac{\partial C}{\partial w_1}, \dfrac{\partial C}{\partial w_2}\right]$.
(d) Finally, I’ll write out the gradient descent update equation, assuming $\eta$ represents the learning rate.
Prepare and plot the data in (e) and (f)¶
Implementing gradient descent and logistic regression algorithm (g) – (k)¶
(g) I’ll start by creating a function or class to implement a logistic regression. It’ll take as inputs the model parameters, $\mathbf{w}=\left[w_0,w_1,w_2\right]^{\intercal}$, and output the class confidence probabilities, $P(Y=y_i|X=\mathbf{x}_i)$.
(h) Then I’ll create a function that computes the cost function $C(\mathbf{w})$ for a given dataset and corresponding class labels.
(i) In addition, I’ll create a function or class to run gradient descent on your training data. I’ll refer to this as “batch” gradient descent since it takes into account the gradient based on all our data at each iteration (or “epoch”) of the algorithm. In doing this I’ll need to make some assumptions about and/or experiment with the following:
- The initialization of the algorithm – I’ll randomly initialize the weights to a different values between 0 and 1.
- The learning rate or how slow/fast the algorithm will proceed in the direction opposite the gradient – This I’ll experiment with.
- Stopping criteria or when the algorithm should stop searching for the optimum – I’ll set this to be when the cost function changes by no more than $10^{-6}$ between iterations. Since we have a weight vector, we can compute this by seeing if the L2 norm of the weight vector changes by no more than $10^{-6}$ between iterations.
(j) Approach is designed so that at each step in the gradient descent algorithm, it will produce updated parameter estimates. For each set of estimates, I’ll calculate the cost function for both the training and the test data.
(k) We’ll then show the gradient descent process for different learning rates by plotting the resulting cost as a function of each iteration (or “epoch”).
Testing the model performance through cross validation (m) – (o)¶
(l) I’ll test the performance of the trained classifier using K-folds cross validation and plot a Receiver Operating Characteristic curves (ROC curves) of the cross validated performance.
ANSWER
(a) P(y|X)= $\prod_{i=1}^{N}{\sigma(\mathbf{w}^{\intercal} \mathbf{x}_i)^{y_i}}{[1-{\sigma(\mathbf{w}^{\intercal} \mathbf{x}_i)}]^{1-y_i}}$
(b) C(w) = $-\frac{1}{N}\sum_{i=1}^{N}[{y_i}{log( \sigma(\mathbf{w}^{\intercal} \mathbf{x}_i ))} + (1-y_i)log(1-\sigma(\mathbf{w}^{\intercal} \mathbf{x}_i))]$
(c) $$\nabla_{\mathbf{w}}C(\mathbf{w}) = \begin{bmatrix} -\frac{1}{N}\dfrac{\partial}{\partial w_0} {\sum_{i=1}^{N}[{y_i}{log( \sigma(\mathbf{w}^{\intercal} \mathbf{x}_i ))} + (1-y_i)log(1-\sigma(\mathbf{w}^{\intercal} \mathbf{x}_i))]} \\ -\frac{1}{N}\dfrac{\partial}{\partial w_1} {\sum_{i=1}^{N}[{y_i}{log( \sigma(\mathbf{w}^{\intercal} \mathbf{x}_i ))} + (1-y_i)log(1-\sigma(\mathbf{w}^{\intercal} \mathbf{x}_i))]} \\ -\frac{1}{N}\dfrac{\partial}{\partial w_2} {\sum_{i=1}^{N}[{y_i}{log( \sigma(\mathbf{w}^{\intercal} \mathbf{x}_i ))} + (1-y_i)log(1-\sigma(\mathbf{w}^{\intercal} \mathbf{x}_i))]} \end{bmatrix}$$
$$\nabla_{\mathbf{w}}C(\mathbf{w}) = \begin{bmatrix} \frac{1}{N}\sum_{i=1}^{N}[\sigma(\mathbf{w}^{\intercal} \mathbf{x}_i) – y_i]x_{i,0} \\ \frac{1}{N}\sum_{i=1}^{N}[\sigma(\mathbf{w}^{\intercal} \mathbf{x}_i) – y_i]x_{i,1} \\ \frac{1}{N}\sum_{i=1}^{N}[\sigma(\mathbf{w}^{\intercal} \mathbf{x}_i) – y_i]x_{i,2} \end{bmatrix}$$(d) $\mathbf{w^{(i+1)}} = \mathbf{w^{(i)}} – \eta(\nabla_{\mathbf{w}}C(\mathbf{w}))$
(e)
import pandas as pd
data_q1 = pd.read_csv('./data/A3_Q1_data.csv')
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.figure(figsize=(10,6), dpi= 100)
ax =plt.scatter(data_q1.x1,data_q1.x2,c=data_q1.y,alpha =0.7,edgecolor='k',\
label=data_q1.y,s=10)
plt.legend(handles=ax.legend_elements()[0], labels=['0','1'])
plt.title('Scatter plot of dataset')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
The data does not look linearly seperable from visual inspection because of the overlap between the purple and yellow points as shown in the plot. Logistic regression would be suitable because of the binary level type of response but might not perform as well because no linear boundary can successfully seperate the two classes.
(f) Dataset had no missing values. x1 and x2 variables had similar min and max values hence range. In addition the standard deviation was similar, so decided scaling was not necessary.
(g)
import numpy as np
def confidence_probabilities(X,w):
z = np.dot(X,w)
g = 1/(1+np.exp(-z))
return g
(h)
def cost(X,Y,w):
m = len(X)
prediction = confidence_probabilities(X,w)
J = (1/m)*np.sum((-Y*np.log(prediction)) - ((1-Y)*np.log(1-prediction)))
return J
(i)
import numpy as np
def gradient(X,Y,w):
m = len(X)
prediction = confidence_probabilities(X,w)
cost_change = (1/m)*(np.transpose(X)@(prediction - Y))
return cost_change
def gradient_descent(X_train,Y_train,X_test=None,Y_test=None,learning_rate = 0.1):
w = np.random.uniform(low=0.0, high=1.0, size=3).reshape(3,1)
w_delta = 1
cost_training = []
cost_test = []
for it in range(1000):
w_new = w - learning_rate*gradient(X_train,Y_train,w)
w_delta = np.abs(np.sqrt(np.sum(np.square(w_new))) - np.sqrt(np.sum(np.square(w))))
w = w_new
cost_training.append(cost(X_train,Y_train,w))
if not (X_test is None or Y_test is None):
cost_test.append(cost(X_test,Y_test,w))
if w_delta < 10**(-6):
print('Ending iteration: %s because L2-Norm converged' % (it+1))
break
pass
return w,cost_training,cost_test
(k)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_q1.loc[:,['x1','x2']],\
data_q1.y,test_size = 0.3,random_state=123)
X_train.insert(0,'x0',1)
X_test.insert(0,'x0',1)
weights_all,cost_training_all,cost_test_all = [], [],[]
for c in [0.3,0.5,0.8,1,3,5]:
weights,cost_training,cost_test = gradient_descent(np.array(X_train),np.array(y_train).\
reshape(len(y_train),1),\
np.array(X_test),\
np.array(y_test).reshape(len(y_test),\
1),\
learning_rate = c)
weights_all.append(weights)
cost_training_all.append(cost_training)
cost_test_all.append(cost_test)
pd.DataFrame(cost_training_all,index=["LR = 0.3","LR = 0.5","LR = 0.8","LR = 1","LR = 3",\
"LR = 5"]).\
transpose().plot(figsize=(17,10),title=\
"Learning Rate effect on Training Dataset Cost Function")
plt.xlabel('Iteration Number')
plt.ylabel('Average Cost')
# Hide the right and top spines
ax = plt.gca()
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
pd.DataFrame(cost_test_all,index=["LR = 0.3","LR = 0.5","LR = 0.8","LR = 1","LR = 3",\
"LR = 5"]).\
transpose().plot(figsize=(17,10),title="Learning Rate effect on Test Dataset Cost Function")
plt.xlabel('Iteration Number')
plt.ylabel('Average Cost')
# Hide the right and top spines
ax = plt.gca()
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
The learning rate affects how fast the cost will converge. As shown in the two plots above, as the learning rate increases, it affects the gradient of the curve. The learning rate on the other hand is like a step size which presents trade-offs that one must consider when choosing value to use. On one side, if you choose a step size that is too small, then the algorithm might take a long time to converge. on the other hand, if step size is too big, then it exposes the algorithm of potentially missing the optimal solution and eventually failing to converge in some cases.
I have chosen a learning rate of 3, because it converges at iterations just below 100, and it seems the cost function convergeance pattern is consistent on both training and test, which reduces likelihood of unexpected results if the data is partition in another way.
weights,cost_training,cost_test = gradient_descent(np.array(X_train),np.array(y_train).reshape(len(y_train),1),\
np.array(X_test),np.array(y_test).reshape(len(y_test),1),learning_rate = 3)
(l)
from sklearn.model_selection import StratifiedKFold
prediction_scores = np.empty((y_train.shape[0],1),dtype='object')
kf = StratifiedKFold(n_splits=5, shuffle=True)
for train_index, val_index in kf.split(X_train, y_train):
# Extract the training and validation data for this fold
x_train_cv, x_val_cv = X_train.iloc[train_index], X_train.iloc[val_index]
y_train_cv = y_train.iloc[train_index]
# Training the cnn classifier
weights_,cost_training_,cost_test_ = gradient_descent(np.array(x_train_cv),np.array(y_train_cv).\
reshape(len(y_train_cv),1),learning_rate = 3)
# Test the classifier on the validation data for this fold
prediction_ = confidence_probabilities(x_val_cv,weights_)
# Save the predictions for this fold
prediction_scores[val_index] =prediction_
from sklearn.metrics import roc_curve,roc_auc_score
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc
# generate a no skill prediction (majority class)
ns_probs = [0 for _ in range(len(y_train))]
ns_fpr, ns_tpr, _ = roc_curve(y_train, ns_probs)
fpr , tpr , thresholds = roc_curve ( y_train , prediction_scores)
auc_ = auc(fpr, tpr)
def plot_roc_curve(fpr,tpr,ns_fpr,ns_tpr,auc =0,title='ROC Curve: Cross Validation'):
plt.figure(figsize=(10,6), dpi= 100)
plt.plot(fpr,tpr, label='Logistic')
plt.plot(ns_fpr, ns_tpr, linestyle='--', label='Random Guess')
plt.axis([0,1,0,1])
plt.text(1, 0.5, 'AUC:{:.2f}'.format(auc), horizontalalignment='right',\
verticalalignment='center',
bbox=dict(boxstyle="round",
ec=(1., 0.5, 0.5),
fc=(1., 0.8, 0.8),
))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(title)
# show the legend
plt.legend()
# Hide the right and top spines
ax = plt.gca()
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.show()
plot_roc_curve (fpr,tpr,ns_fpr,ns_tpr,auc=auc_)
(m) We use cross validation to try and estimate or get an idea of how the model will perform on test data or in the future when presented with unseen data.
(n)
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
plt.figure(figsize=(10,6), dpi= 100)
# Create color maps
cmap_light = ListedColormap(['orange', 'cornflowerblue'])
cmap_bold = ListedColormap(['darkorange', 'darkblue'])
h=0.02
x_min, x_max = X_train.iloc[:, 1].min() - 1, X_train.iloc[:, 1].max() + 1
y_min, y_max = X_train.iloc[:, 2].min() - 1, X_train.iloc[:, 2].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))
# TRAINING SET DECISION BOUNDARY PLOT
Z = confidence_probabilities(np.c_[np.ones(np.shape(xx.ravel())[0]).ravel(),xx.ravel(),\
yy.ravel()],weights)
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
cls_0 = X_train[y_train == 0]
cls_1 = X_train[y_train == 1]
plt.scatter(cls_0.iloc[:, 1], cls_0.iloc[:, 2], c='darkorange',edgecolor='k',\
s=10,label="Class 0")
plt.scatter(cls_1.iloc[:, 1], cls_1.iloc[:, 2], c='darkblue',edgecolor='k',alpha=0.5,\
s=10,label="Class 1")
plt.legend()
plt.title("Training Data Decision Boundary")
plt.xlabel("X1")
plt.ylabel("X2")
plt.show()
plt.figure(figsize=(10,6), dpi= 100)
# Create color maps
x_min, x_max = X_test.iloc[:, 1].min() - 1, X_test.iloc[:, 1].max() + 1
y_min, y_max = X_test.iloc[:, 2].min() - 1, X_test.iloc[:, 2].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))
# TRAINING SET DECISION BOUNDARY PLOT
Z = confidence_probabilities(np.c_[np.ones(np.shape(xx.ravel())[0]).ravel(),xx.ravel(), yy.ravel()],weights)
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
cls_0 = X_test[y_test == 0]
cls_1 = X_test[y_test == 1]
plt.scatter(cls_0.iloc[:, 1], cls_0.iloc[:, 2], c='darkorange',edgecolor='k',\
s=10,label="Class 0")
plt.scatter(cls_1.iloc[:, 1], cls_1.iloc[:, 2], c='darkblue',edgecolor='k',\
alpha=0.5, s=10,label="Class 1")
plt.legend()
plt.title("Test Data Decision Boundary")
plt.xlabel("X1")
plt.ylabel("X2")
plt.show()
The linear decision boundary seems to be inappropriate to seperate the two classes. A non-linear decision boundary could improve performance, but the overlap, makes me believe, that most models would struggle to find a decision boundary that has good generalizated performance i.e. without overfitting.
(o)
# generate a no skill prediction (majority class)
ns_probs = [0 for _ in range(len(y_test))]
ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
prediction_scores_ = confidence_probabilities(X_test,weights)
fpr , tpr , thresholds = roc_curve ( y_test , prediction_scores_)
auc_ = auc(fpr, tpr)
plot_roc_curve (fpr,tpr,ns_fpr,ns_tpr,auc=auc_,title = 'ROC Curve: Test Dataset')
Model performance better than chance as shown by the curve which significantly above the chance line as well as the higher AUC.

