[LOGR1] - Logistic regressionĀ¶
Simple example of logistic regression with a sklearn solutionObjectives :Ā¶
- A logistic regression has the objective of providing a probability of belonging to a class.
- DĆ©couvrir une implĆ©mentation 100% Tensorflow ..et apprendre Ć aimer Keras
What we're going to do :Ā¶
X contains characteristics
y contains the probability of membership (1 or 0)
We'll look for a value of $\theta$ such that the linear regression $\theta^{T}X$ can be used to calculate our probability:
$\hat{p} = h_\theta(X) = \sigma(\theta^T{X})$
Where $\sigma$ is the logit function, typically a sigmoid (S) function:
$ \sigma(t) = \dfrac{1}{1 + \exp(-t)} $
The predicted value $\hat{y}$ will then be calculated as follows:
$ \hat{y} = \begin{cases} 0 & \text{if } \hat{p} < 0.5 \\ 1 & \text{if } \hat{p} \geq 0.5 \end{cases} $
Calculation of the cost of the regression:
For a training observation x, the cost can be calculated as follows:
$ c(\theta) = \begin{cases} -\log(\hat{p}) & \text{if } y = 1 \\ -\log(1 - \hat{p}) & \text{if } y = 0 \end{cases} $
The regression cost function (log loss) over the whole training set can be written as follows:
$ J(\theta) = -\dfrac{1}{m} \sum_{i=1}^{m}{\left[ y^{(i)} log\left(\hat{p}^{(i)}\right) + (1 - y^{(i)}) log\left(1 - \hat{p}^{(i)}\right)\right]} $
Step 1 - Import and initĀ¶
You can also adjust the verbosity by changing the value of TF_CPP_MIN_LOG_LEVEL :
- 0 = all messages are logged (default)
- 1 = INFO messages are not printed.
- 2 = INFO and WARNING messages are not printed.
- 3 = INFO , WARNING and ERROR messages are not printed.
# import os
# os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import numpy as np
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
import matplotlib
import matplotlib.pyplot as plt
# import math
import random
# import os
import sys
import fidle
# Init Fidle environment
run_id, run_dir, datasets_dir = fidle.init('LOGR1')
FIDLE - Environment initialization
Version : 2.3.0 Run id : LOGR1 Run dir : ./run/LOGR1 Datasets dir : /gpfswork/rech/mlh/uja62cb/fidle-project/datasets-fidle Start time : 03/03/24 21:03:14 Hostname : r3i6n3 (Linux) Tensorflow log level : Info + Warning + Error (=0) Update keras cache : False Update torch cache : False Save figs : ./run/LOGR1/figs (True) numpy : 1.24.4 sklearn : 1.3.2 yaml : 6.0.1 matplotlib : 3.8.2 pandas : 2.1.3
1.1 - Usefull stuff (hidden)Ā¶
def vector_infos(name,V):
'''Displaying some information about a vector'''
with np.printoptions(precision=4, suppress=True):
print("{:16} : ndim={} shape={:10} Mean = {} Std = {}".format( name,V.ndim, str(V.shape), V.mean(axis=0), V.std(axis=0)))
def do_i_have_it(hours_of_work, hours_of_sleep):
'''Returns the exam result based on work and sleep hours'''
hours_of_sleep_min = 5
hours_of_work_min = 4
hours_of_game_max = 3
# ---- Have to sleep and work
if hours_of_sleep < hours_of_sleep_min: return 0
if hours_of_work < hours_of_work_min: return 0
# ---- Gameboy is not good for you
hours_of_game = 24 - 10 - hours_of_sleep - hours_of_work + random.gauss(0,0.4)
if hours_of_game > hours_of_game_max: return 0
# ---- Fine, you got it
return 1
def make_students_dataset(size, noise):
'''Fabrique un dataset pour <size> Ć©tudiants'''
x = []
y = []
for i in range(size):
w = random.gauss(5,1)
s = random.gauss(7,1.5)
r = do_i_have_it(w,s)
x.append([w,s])
y.append(r)
return (np.array(x), np.array(y))
def plot_data(x,y, colors=('green','red'), legend=True):
'''Affiche un dataset'''
fig, ax = plt.subplots(1, 1)
fig.set_size_inches(10,8)
ax.plot(x[y==1, 0], x[y==1, 1], 'o', color=colors[0], markersize=4, label="y=1 (positive)")
ax.plot(x[y==0, 0], x[y==0, 1], 'o', color=colors[1], markersize=4, label="y=0 (negative)")
if legend : ax.legend()
plt.tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)
plt.xlabel('Hours of work')
plt.ylabel('Hours of sleep')
plt.show()
def plot_results(x_test,y_test, y_pred):
'''Affiche un resultat'''
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)
print("Accuracy = {:5.3f} Recall = {:5.3f}".format(precision, recall))
x_pred_positives = x_test[ y_pred == 1 ] # items prƩdits positifs
x_real_positives = x_test[ y_test == 1 ] # items rƩellement positifs
x_pred_negatives = x_test[ y_pred == 0 ] # items prƩdits nƩgatifs
x_real_negatives = x_test[ y_test == 0 ] # items rƩellement nƩgatifs
fig, axs = plt.subplots(2, 2)
fig.subplots_adjust(wspace=.1,hspace=0.2)
fig.set_size_inches(14,10)
axs[0,0].plot(x_pred_positives[:,0], x_pred_positives[:,1], 'o',color='lightgreen', markersize=10, label="PrƩdits positifs")
axs[0,0].plot(x_real_positives[:,0], x_real_positives[:,1], 'o',color='green', markersize=4, label="RĆ©els positifs")
axs[0,0].legend()
axs[0,0].tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)
axs[0,0].set_xlabel('$x_1$')
axs[0,0].set_ylabel('$x_2$')
axs[0,1].plot(x_pred_negatives[:,0], x_pred_negatives[:,1], 'o',color='lightsalmon', markersize=10, label="PrƩdits nƩgatifs")
axs[0,1].plot(x_real_negatives[:,0], x_real_negatives[:,1], 'o',color='red', markersize=4, label="RƩels nƩgatifs")
axs[0,1].legend()
axs[0,1].tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)
axs[0,1].set_xlabel('$x_1$')
axs[0,1].set_ylabel('$x_2$')
axs[1,0].plot(x_pred_positives[:,0], x_pred_positives[:,1], 'o',color='lightgreen', markersize=10, label="PrƩdits positifs")
axs[1,0].plot(x_pred_negatives[:,0], x_pred_negatives[:,1], 'o',color='lightsalmon', markersize=10, label="PrƩdits nƩgatifs")
axs[1,0].plot(x_real_positives[:,0], x_real_positives[:,1], 'o',color='green', markersize=4, label="RĆ©els positifs")
axs[1,0].plot(x_real_negatives[:,0], x_real_negatives[:,1], 'o',color='red', markersize=4, label="RƩels nƩgatifs")
axs[1,0].tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)
axs[1,0].set_xlabel('$x_1$')
axs[1,0].set_ylabel('$x_2$')
axs[1,1].pie([precision,1-precision], explode=[0,0.1], labels=["","Errors"],
autopct='%1.1f%%', shadow=False, startangle=70, colors=["lightsteelblue","coral"])
axs[1,1].axis('equal')
plt.show()
1.2 - ParametersĀ¶
data_size = 1000 # Number of observations
data_cols = 2 # observation size
data_noise = 0.2
random_seed = 123
x_data,y_data=make_students_dataset(data_size,data_noise)
2.2 - Show itĀ¶
plot_data(x_data, y_data)
vector_infos('Dataset X',x_data)
vector_infos('Dataset y',y_data)
Dataset X : ndim=2 shape=(1000, 2) Mean = [5.0027 6.9802] Std = [0.9491 1.4651] Dataset y : ndim=1 shape=(1000,) Mean = 0.656 Std = 0.47504105085771264
2.3 - Preparation of dataĀ¶
We're going to:
- split the data to have : :
- a training set
- a test set
- normalize the data
# ---- Split data
n = int(data_size * 0.8)
x_train = x_data[:n]
y_train = y_data[:n]
x_test = x_data[n:]
y_test = y_data[n:]
# ---- Normalization
mean = np.mean(x_train, axis=0)
std = np.std(x_train, axis=0)
x_train = (x_train-mean)/std
x_test = (x_test-mean)/std
# ---- About it
vector_infos('X_train',x_train)
vector_infos('y_train',y_train)
vector_infos('X_test',x_test)
vector_infos('y_test',y_test)
y_train_h = y_train.reshape(-1,) # nƩcessaire pour la visu.
X_train : ndim=2 shape=(800, 2) Mean = [-0. -0.] Std = [1. 1.] y_train : ndim=1 shape=(800,) Mean = 0.65375 Std = 0.4757740403805151 X_test : ndim=2 shape=(200, 2) Mean = [0.028 0.077] Std = [0.9318 0.99 ] y_test : ndim=1 shape=(200,) Mean = 0.665 Std = 0.4719904660054057
2.4 - Have a lookĀ¶
fidle.utils.display_md('**This is what we know :**')
plot_data(x_train, y_train)
fidle.utils.display_md('**This is what we want to classify :**')
plot_data(x_test, y_test, colors=("gray","gray"), legend=False)
This is what we know :
This is what we want to classify :
# ---- Create an instance
# Use SAGA solver (Stochastic Average Gradient descent solver)
#
logreg = LogisticRegression(C=1e5, verbose=0, solver='saga')
# ---- Fit the data.
#
logreg.fit(x_train, y_train)
# ---- Do a prediction
#
y_pred = logreg.predict(x_test)
3.3 - EvaluationĀ¶
Accuracy = Ability to avoid false positives = $\frac{Tp}{Tp+Fp}$
Recall = Ability to find the right positives = $\frac{Tp}{Tp+Fn}$
Avec :
$T_p$ (true positive) Correct positive answer
$F_p$ (false positive) False positive answer
$T_n$ (true negative) Correct negative answer
$F_n$ (false negative) Wrong negative answer
plot_results(x_test,y_test, y_pred)
Accuracy = 0.881 Recall = 0.887
Step 4 - Bending the space to a model #2 ;-)Ā¶
We're going to increase the characteristics of our observations, with : ${x_1}^2$, ${x_2}^2$, ${x_1}^3$ et ${x_2}^3$
$ X= \begin{bmatrix}1 & x_{11} & x_{12} \\ \vdots & \dots\\ 1 & x_{m1} & x_{m2} \end{bmatrix} \text{et } X_{ng}=\begin{bmatrix}1 & x_{11} & x_{12} & x_{11}^2 & x_{12}^2& x_{11}^3 & x_{12}^3 \\ \vdots & & & \dots \\ 1 & x_{m1} & x_{m2} & x_{m1}^2 & x_{m2}^2& x_{m1}^3 & x_{m2}^3 \end{bmatrix} $
Note : sklearn.preprocessing.PolynomialFeatures
can do that for us, but we'll do it ourselves:
4.1 - Extend dataĀ¶
x_train_enhanced = np.c_[x_train,
x_train[:, 0] ** 2,
x_train[:, 1] ** 2,
x_train[:, 0] ** 3,
x_train[:, 1] ** 3]
x_test_enhanced = np.c_[x_test,
x_test[:, 0] ** 2,
x_test[:, 1] ** 2,
x_test[:, 0] ** 3,
x_test[:, 1] ** 3]
4.2 - Run the classifierĀ¶
# ---- Create an instance
# Use SAGA solver (Stochastic Average Gradient descent solver)
#
logreg = LogisticRegression(C=1e5, verbose=0, solver='saga', max_iter=5000, n_jobs=-1)
# ---- Fit the data.
#
logreg.fit(x_train_enhanced, y_train)
# ---- Do a prediction
#
y_pred = logreg.predict(x_test_enhanced)
4.3 - EvaluationĀ¶
plot_results(x_test_enhanced, y_test, y_pred)
Accuracy = 0.926 Recall = 0.940
fidle.end()
End time : 03/03/24 21:03:16
Duration : 00:00:02 970ms
This notebook ends here :-)
https://fidle.cnrs.fr