[LINR1] - Linear regression with direct resolution¶
Low-level implementation, using numpy, of a direct resolution for a linear regressionObjectives :¶
- Just one, the illustration of a direct resolution :-)
What we're going to do :¶
Equation : $$Y = X.\theta + N$$
Where N is a noise vector
and $\theta = (a,b)$ a vector as y = a.x + b
Step 1 - Import and init¶
import numpy as np
import math
import matplotlib
import matplotlib.pyplot as plt
import sys
import fidle
# Init Fidle environment
run_id, run_dir, datasets_dir = fidle.init('LINR1')
FIDLE - Environment initialization
Version : 2.3.0 Run id : LINR1 Run dir : ./run/LINR1 Datasets dir : /gpfswork/rech/mlh/uja62cb/fidle-project/datasets-fidle Start time : 03/03/24 21:02:53 Hostname : r3i6n3 (Linux) Tensorflow log level : Info + Warning + Error (=0) Update keras cache : False Update torch cache : False Save figs : ./run/LINR1/figs (True) numpy : 1.24.4 sklearn : 1.3.2 yaml : 6.0.1 matplotlib : 3.8.2 pandas : 2.1.3
Step 2 - Retrieve a set of points¶
# ---- Paramètres
nb = 100 # Nombre de points
xmin = 0 # Distribution / x
xmax = 10
a = 4 # Distribution / y
b = 2 # y= a.x + b (+ bruit)
noise = 7 # bruit
theta = np.array([[a],[b]])
# ---- Vecteur X (1,x) x nb
# la premiere colonne est a 1 afin que X.theta <=> 1.b + x.a
Xc1 = np.ones((nb,1))
Xc2 = np.random.uniform(xmin,xmax,(nb,1))
X = np.c_[ Xc1, Xc2 ]
# ---- Noise
# N = np.random.uniform(-noise,noise,(nb,1))
N = noise * np.random.normal(0,1,(nb,1))
# ---- Vecteur Y
Y = (X @ theta) + N
# print("X:\n",X,"\nY:\n ",Y)
Show it¶
width = 12
height = 6
fig, ax = plt.subplots()
fig.set_size_inches(width,height)
ax.plot(X[:,1], Y, ".")
ax.tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)
ax.set_xlabel('x axis')
ax.set_ylabel('y axis')
fidle.scrawler.save_fig('01-set_of_points')
plt.show()
Step 3 - Direct calculation of the normal equation¶
We'll try to find an optimal value of $\theta$, minimizing a cost function.
The cost function, classically used in the case of linear regressions, is the root mean square error (racine carré de l'erreur quadratique moyenne):
$$RMSE(X,h_\theta)=\sqrt{\frac1n\sum_{i=1}^n\left[h_\theta(X^{(i)})-Y^{(i)}\right]^2}$$
With the simplified variant : $$MSE(X,h_\theta)=\frac1n\sum_{i=1}^n\left[h_\theta(X^{(i)})-Y^{(i)}\right]^2$$
The optimal value of regression is : $$\hat{ \theta } =( X^{-T} .X)^{-1}.X^{-T}.Y$$
Démontstration : https://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression
theta_hat = np.linalg.inv(X.T @ X) @ X.T @ Y
print("Theta :\n",theta,"\n\ntheta hat :\n",theta_hat)
Theta : [[4] [2]] theta hat : [[4.56940717] [1.94517426]]
Show it¶
Xd = np.array([[1,xmin], [1,xmax]])
Yd = Xd @ theta_hat
fig, ax = plt.subplots()
fig.set_size_inches(width,height)
ax.plot(X[:,1], Y, ".")
ax.plot(Xd[:,1], Yd, "-")
ax.tick_params(axis='both', which='both', bottom=False, left=False, labelbottom=False, labelleft=False)
ax.set_xlabel('x axis')
ax.set_ylabel('y axis')
fidle.scrawler.save_fig('02-regression-line')
plt.show()
fidle.end()
End time : 03/03/24 21:02:55
Duration : 00:00:01 403ms
This notebook ends here :-)
https://fidle.cnrs.fr