[LWINE1] - Wine quality prediction with a Dense Network (DNN)¶
Another example of regression, with a wine quality prediction, using PyTorch LightningObjectives :¶
- Predict the quality of wines, based on their analysis
- Understanding the principle and the architecture of a regression with a dense neural network with backup and restore of the trained model.
The Wine Quality datasets are made up of analyses of a large number of wines, with an associated quality (between 0 and 10)
This dataset is provide by :
Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez
A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal, @2009
This dataset can be retreive at University of California Irvine (UCI)
Due to privacy and logistic issues, only physicochemical and sensory variables are available
There is no data about grape types, wine brand, wine selling price, etc.
- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol
- quality (score between 0 and 10)
What we're going to do :¶
- (Retrieve data)
- (Preparing the data)
- (Build a model)
- Train and save the model
- Restore saved model
- Evaluate the model
- Make some predictions
Step 1 - Import and init¶
# Import some packages
import os
import sys
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import lightning.pytorch as pl
import torch.nn.functional as F
import torchvision.transforms as T
from importlib import reload
from IPython.display import Markdown
from torch.utils.data import Dataset, DataLoader, random_split
from modules.progressbar import CustomTrainProgressBar
from modules.data_load import WineQualityDataset, Normalize, ToTensor
from lightning.pytorch.loggers.tensorboard import TensorBoardLogger
from torchmetrics.functional.regression import mean_absolute_error, mean_squared_error
import fidle
# Init Fidle environment
run_id, run_dir, datasets_dir = fidle.init('LWINE1')
FIDLE - Environment initialization
Version : 2.3.0 Run id : LWINE1 Run dir : ./run/LWINE1 Datasets dir : /gpfswork/rech/mlh/uja62cb/fidle-project/datasets-fidle Start time : 03/03/24 21:04:48 Hostname : r6i0n6 (Linux) Tensorflow log level : Warning + Error (=1) Update keras cache : False Update torch cache : False Save figs : ./run/LWINE1/figs (True) numpy : 1.24.4 sklearn : 1.3.2 yaml : 6.0.1 matplotlib : 3.8.2 pandas : 2.1.3 torch : 2.1.1 torchvision : 0.16.1+fdea156 lightning : 2.1.2
Verbosity during training :
- 0 = silent
- 1 = progress bar
- 2 = one line per epoch
fit_verbosity = 1
dataset_name = 'winequality-red.csv'
Override parameters (batch mode) - Just forget this cell
fidle.override('fit_verbosity', 'dataset_name')
** Overrided parameters : ** fit_verbosity : 2
Step 2 - Retrieve data¶
csv_file_path=f'{datasets_dir}/WineQuality/origine/{dataset_name}'
datasets=WineQualityDataset(csv_file_path)
display(datasets.data.head(5).style.format("{0:.2f}"))
print('Missing Data : ',datasets.data.isna().sum().sum(), ' Shape is : ', datasets.data.shape)
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.40 | 0.70 | 0.00 | 1.90 | 0.08 | 11.00 | 34.00 | 1.00 | 3.51 | 0.56 | 9.40 | 5.00 |
1 | 7.80 | 0.88 | 0.00 | 2.60 | 0.10 | 25.00 | 67.00 | 1.00 | 3.20 | 0.68 | 9.80 | 5.00 |
2 | 7.80 | 0.76 | 0.04 | 2.30 | 0.09 | 15.00 | 54.00 | 1.00 | 3.26 | 0.65 | 9.80 | 5.00 |
3 | 11.20 | 0.28 | 0.56 | 1.90 | 0.07 | 17.00 | 60.00 | 1.00 | 3.16 | 0.58 | 9.80 | 6.00 |
4 | 7.40 | 0.70 | 0.00 | 1.90 | 0.08 | 11.00 | 34.00 | 1.00 | 3.51 | 0.56 | 9.40 | 5.00 |
Missing Data : 0 Shape is : (1599, 12)
Step 3 - Preparing the data¶
3.1 - Data normalization¶
Note :
- All input features must be normalized.
- To do this we will subtract the mean and divide by the standard deviation for each input features.
- Then we convert numpy array features and target (quality) to torch tensor
transforms=T.Compose([Normalize(csv_file_path), ToTensor()])
dataset=WineQualityDataset(csv_file_path,transform=transforms)
display(Markdown("before normalization :"))
display(datasets[:]["features"])
print()
display(Markdown("After normalization :"))
display(dataset[:]["features"])
before normalization :
array([[ 7.4 , 0.7 , 0. , ..., 3.51 , 0.56 , 9.4 ], [ 7.8 , 0.88 , 0. , ..., 3.2 , 0.68 , 9.8 ], [ 7.8 , 0.76 , 0.04 , ..., 3.26 , 0.65 , 9.8 ], ..., [ 6.3 , 0.51 , 0.13 , ..., 3.42 , 0.75 , 11. ], [ 5.9 , 0.645, 0.12 , ..., 3.57 , 0.71 , 10.2 ], [ 6. , 0.31 , 0.47 , ..., 3.39 , 0.66 , 11. ]], dtype=float32)
After normalization :
tensor([[-0.5282, 0.9616, -1.3910, ..., 1.2882, -0.5790, -0.9599], [-0.2985, 1.9668, -1.3910, ..., -0.7197, 0.1289, -0.5846], [-0.2985, 1.2967, -1.1857, ..., -0.3311, -0.0481, -0.5846], ..., [-1.1600, -0.0995, -0.7237, ..., 0.7053, 0.5419, 0.5415], [-1.3897, 0.6544, -0.7750, ..., 1.6769, 0.3059, -0.2092], [-1.3323, -1.2165, 1.0217, ..., 0.5110, 0.0109, 0.5415]])
3.2 - Split data¶
We will use 80% of the data for training and 20% for validation.
x will be the features data of the analysis and y the target (quality)
# ---- Split => train, test
#
data_train_len = int(len(dataset)*0.8) # get 80 %
data_test_len = len(dataset) -data_train_len # test = all - train
# ---- Split => x,y with random_split
#
data_train_subset, data_test_subset=random_split(dataset, [data_train_len, data_test_len])
x_train = data_train_subset[:]["features"]
y_train = data_train_subset[:]["quality" ]
x_test = data_test_subset [:]["features"]
y_test = data_test_subset [:]["quality" ]
print('Original data shape was : ',dataset.data.shape)
print('x_train : ',x_train.shape, 'y_train : ',y_train.shape)
print('x_test : ',x_test.shape, 'y_test : ',y_test.shape)
Original data shape was : (1599, 12) x_train : torch.Size([1279, 11]) y_train : torch.Size([1279, 1]) x_test : torch.Size([320, 11]) y_test : torch.Size([320, 1])
3.3 - For Training model use Dataloader¶
The Dataset retrieves our dataset’s features and labels one sample at a time. While training a model, we typically want to pass samples in minibatches, reshuffle the data at every epoch to reduce model overfitting. DataLoader is an iterable that abstracts this complexity for us in an easy API.
# train bacth data
train_loader= DataLoader(
dataset=data_train_subset,
shuffle=True,
batch_size=20,
num_workers=2
)
# test bacth data
test_loader= DataLoader(
dataset=data_test_subset,
shuffle=False,
batch_size=20,
num_workers=2
)
class LitRegression(pl.LightningModule):
def __init__(self,in_features=11):
super().__init__()
self.model = nn.Sequential(
nn.Linear(in_features, 128), # hidden layer 1
nn.ReLU(), # activation function
nn.Linear(128, 128), # hidden layer 2
nn.ReLU(), # activation function
nn.Linear(128, 1)) # output layer
def forward(self, x): # forward pass
x = self.model(x)
return x
# optimizer
def configure_optimizers(self):
optimizer = torch.optim.RMSprop(self.parameters(),lr=1e-4)
return optimizer
def training_step(self, batch, batch_idx):
# defines the train loop.
x_features, y_target = batch["features"],batch["quality"]
# forward pass
y_pred = self.model(x_features)
# loss function MSE
loss = F.mse_loss(y_pred, y_target)
# metrics mae
mae = mean_absolute_error(y_pred,y_target)
# metrics mse
mse = mean_squared_error(y_pred,y_target)
metrics= {"train_loss": loss,
"train_mae" : mae,
"train_mse" : mse
}
# logs metrics for each training_step
self.log_dict(metrics,
on_step = False,
on_epoch = True,
logger = True,
prog_bar = True,
)
return loss
def validation_step(self, batch, batch_idx):
# defines the val loop.
x_features, y_target = batch["features"],batch["quality"]
# forward pass
y_pred = self.model(x_features)
# loss function MSE
loss = F.mse_loss(y_pred, y_target)
# metrics
mae = mean_absolute_error(y_pred,y_target)
# metrics
mse = mean_squared_error(y_pred,y_target)
metrics= {"val_loss": loss,
"val_mae" : mae,
"val_mse" : mse
}
# logs metrics for each validation_step
self.log_dict(metrics,
on_step = False,
on_epoch = True,
logger = True,
prog_bar = True,
)
return metrics
reg=LitRegression(in_features=11)
print(reg)
LitRegression( (model): Sequential( (0): Linear(in_features=11, out_features=128, bias=True) (1): ReLU() (2): Linear(in_features=128, out_features=128, bias=True) (3): ReLU() (4): Linear(in_features=128, out_features=1, bias=True) ) )
5.2 - Add callback¶
os.makedirs('./run/models', exist_ok=True)
save_dir = "./run/models/"
filename ='best-model-{epoch}-{val_loss:.2f}'
savemodel_callback = pl.callbacks.ModelCheckpoint(dirpath=save_dir,
filename=filename,
save_top_k=1,
verbose=False,
monitor="val_loss"
)
5.3 - Train it¶
# loggers data
os.makedirs(f'{run_dir}/logs', mode=0o750, exist_ok=True)
logger= TensorBoardLogger(save_dir=f'{run_dir}/logs',name="reg_logs")
# train model
trainer = pl.Trainer(accelerator='auto',
max_epochs=100,
logger=logger,
num_sanity_val_steps=0,
callbacks=[savemodel_callback,CustomTrainProgressBar()])
trainer.fit(model=reg, train_dataloaders=train_loader, val_dataloaders=test_loader)
0it [00:00, ?it/s]GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs Missing logger folder: ./run/LWINE1/logs/reg_logs LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params ------------------------------------- 0 | model | Sequential | 18.2 K ------------------------------------- 18.2 K Trainable params 0 Non-trainable params 18.2 K Total params 0.073 Total estimated model params size (MB) SLURM auto-requeueing enabled. Setting signal handlers.
Training: | | 0/? [00:00<?, ?it/s]
`Trainer.fit` stopped: `max_epochs=100` reached.
score=trainer.validate(model=reg, dataloaders=test_loader, verbose=False)
print('x_test / loss : {:5.4f}'.format(score[0]['val_loss']))
print('x_test / mae : {:5.4f}'.format(score[0]['val_mae']))
print('x_test / mse : {:5.4f}'.format(score[0]['val_mse']))
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] SLURM auto-requeueing enabled. Setting signal handlers.
x_test / loss : 0.4169 x_test / mae : 0.4864 x_test / mse : 0.4169
6.2 - Training history¶
To access logs with tensorboad :
- Under Docker, from a terminal launched via the jupyterlab launcher, use the following command:
tensorboard --logdir <path-to-logs> --host 0.0.0.0
- If you're not using Docker, from a terminal :
tensorboard --logdir <path-to-logs>
Note: One tensorboard instance can be used simultaneously.
Step 7 - Restore a model :¶
7.1 - Reload model¶
# Load the model from a checkpoint
loaded_model = LitRegression.load_from_checkpoint(savemodel_callback.best_model_path)
print("Loaded:")
print(loaded_model)
Loaded: LitRegression( (model): Sequential( (0): Linear(in_features=11, out_features=128, bias=True) (1): ReLU() (2): Linear(in_features=128, out_features=128, bias=True) (3): ReLU() (4): Linear(in_features=128, out_features=1, bias=True) ) )
7.2 - Evaluate it :¶
score=trainer.validate(model=loaded_model, dataloaders=test_loader, verbose=False)
print('x_test / loss : {:5.4f}'.format(score[0]['val_loss']))
print('x_test / mae : {:5.4f}'.format(score[0]['val_mae']))
print('x_test / mse : {:5.4f}'.format(score[0]['val_mse']))
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] SLURM auto-requeueing enabled. Setting signal handlers.
x_test / loss : 0.4014 x_test / mae : 0.4743 x_test / mse : 0.4014
7.3 - Make a prediction¶
# ---- Pick n entries from our test set
n = 200
ii = np.random.randint(1,len(x_test),n)
x_sample = x_test[ii]
y_sample = y_test[ii]
# ---- Make a predictions :
# Sets the model in evaluation mode.
loaded_model.eval()
# Perform inference using the loaded model
y_pred = loaded_model( x_sample )
# ---- Show it
print('Wine Prediction Real Delta')
for i in range(n):
pred = y_pred[i][0].item()
real = y_sample[i][0].item()
delta = real-pred
print(f'{i:03d} {pred:.2f} {real} {delta:+.2f} ')
Wine Prediction Real Delta 000 5.12 7.0 +1.88 001 6.26 7.0 +0.74 002 5.06 5.0 -0.06 003 6.56 6.0 -0.56 004 6.28 6.0 -0.28 005 6.55 7.0 +0.45 006 5.45 5.0 -0.45 007 6.16 7.0 +0.84 008 6.59 7.0 +0.41 009 6.04 6.0 -0.04 010 4.33 4.0 -0.33 011 5.49 5.0 -0.49 012 5.18 5.0 -0.18 013 5.64 6.0 +0.36 014 5.45 5.0 -0.45 015 5.11 5.0 -0.11 016 4.96 5.0 +0.04 017 4.74 4.0 -0.74 018 6.26 6.0 -0.26 019 5.04 5.0 -0.04 020 5.17 5.0 -0.17 021 6.07 6.0 -0.07 022 4.84 5.0 +0.16 023 6.45 6.0 -0.45 024 5.58 6.0 +0.42 025 7.09 6.0 -1.09 026 5.41 7.0 +1.59 027 6.07 6.0 -0.07 028 5.72 6.0 +0.28 029 5.79 6.0 +0.21 030 5.15 5.0 -0.15 031 6.25 6.0 -0.25 032 4.55 6.0 +1.45 033 6.17 6.0 -0.17 034 5.27 5.0 -0.27 035 6.59 7.0 +0.41 036 5.55 5.0 -0.55 037 4.99 5.0 +0.01 038 4.40 3.0 -1.40 039 5.58 6.0 +0.42 040 6.29 5.0 -1.29 041 5.95 6.0 +0.05 042 5.23 5.0 -0.23 043 5.23 4.0 -1.23 044 6.33 7.0 +0.67 045 4.74 4.0 -0.74 046 5.40 5.0 -0.40 047 5.12 5.0 -0.12 048 5.64 6.0 +0.36 049 6.64 7.0 +0.36 050 5.82 6.0 +0.18 051 7.09 6.0 -1.09 052 5.11 5.0 -0.11 053 6.50 6.0 -0.50 054 5.22 5.0 -0.22 055 6.30 7.0 +0.70 056 5.91 6.0 +0.09 057 5.58 5.0 -0.58 058 4.96 5.0 +0.04 059 7.24 6.0 -1.24 060 6.74 7.0 +0.26 061 5.58 5.0 -0.58 062 5.19 5.0 -0.19 063 6.59 7.0 +0.41 064 5.00 5.0 +0.00 065 6.13 6.0 -0.13 066 5.72 6.0 +0.28 067 5.95 6.0 +0.05 068 6.46 6.0 -0.46 069 5.64 6.0 +0.36 070 6.28 6.0 -0.28 071 5.13 5.0 -0.13 072 5.50 6.0 +0.50 073 5.23 5.0 -0.23 074 5.31 6.0 +0.69 075 5.73 5.0 -0.73 076 5.91 6.0 +0.09 077 5.64 6.0 +0.36 078 5.72 5.0 -0.72 079 5.75 6.0 +0.25 080 5.34 6.0 +0.66 081 6.28 6.0 -0.28 082 5.00 5.0 +0.00 083 5.59 5.0 -0.59 084 6.01 7.0 +0.99 085 6.74 7.0 +0.26 086 5.81 6.0 +0.19 087 6.17 6.0 -0.17 088 5.06 5.0 -0.06 089 4.83 5.0 +0.17 090 5.58 6.0 +0.42 091 4.89 5.0 +0.11 092 5.04 5.0 -0.04 093 6.40 6.0 -0.40 094 5.01 5.0 -0.01 095 5.46 6.0 +0.54 096 5.84 6.0 +0.16 097 4.89 5.0 +0.11 098 5.09 5.0 -0.09 099 5.49 5.0 -0.49 100 5.73 5.0 -0.73 101 5.17 5.0 -0.17 102 6.30 6.0 -0.30 103 5.28 5.0 -0.28 104 5.35 5.0 -0.35 105 5.77 6.0 +0.23 106 5.91 6.0 +0.09 107 6.12 6.0 -0.12 108 5.22 6.0 +0.78 109 6.86 6.0 -0.86 110 5.25 5.0 -0.25 111 6.56 6.0 -0.56 112 5.24 5.0 -0.24 113 5.98 5.0 -0.98 114 5.58 6.0 +0.42 115 5.59 5.0 -0.59 116 5.47 6.0 +0.53 117 5.94 6.0 +0.06 118 5.52 6.0 +0.48 119 5.43 5.0 -0.43 120 5.77 5.0 -0.77 121 6.01 7.0 +0.99 122 5.79 6.0 +0.21 123 6.35 6.0 -0.35 124 4.80 5.0 +0.20 125 5.45 5.0 -0.45 126 5.46 5.0 -0.46 127 4.33 4.0 -0.33 128 4.66 6.0 +1.34 129 5.38 5.0 -0.38 130 5.33 5.0 -0.33 131 4.88 5.0 +0.12 132 5.79 6.0 +0.21 133 5.77 5.0 -0.77 134 6.59 7.0 +0.41 135 5.45 5.0 -0.45 136 5.25 5.0 -0.25 137 5.13 5.0 -0.13 138 6.74 7.0 +0.26 139 5.46 6.0 +0.54 140 4.75 5.0 +0.25 141 5.36 6.0 +0.64 142 5.72 6.0 +0.28 143 5.96 7.0 +1.04 144 6.73 7.0 +0.27 145 5.01 5.0 -0.01 146 6.16 8.0 +1.84 147 6.35 6.0 -0.35 148 5.22 5.0 -0.22 149 6.45 6.0 -0.45 150 5.54 5.0 -0.54 151 5.19 5.0 -0.19 152 5.79 5.0 -0.79 153 4.83 5.0 +0.17 154 5.41 7.0 +1.59 155 5.41 5.0 -0.41 156 5.06 5.0 -0.06 157 6.01 6.0 -0.01 158 5.58 6.0 +0.42 159 4.98 6.0 +1.02 160 5.46 6.0 +0.54 161 5.79 6.0 +0.21 162 6.10 6.0 -0.10 163 5.18 5.0 -0.18 164 5.35 6.0 +0.65 165 5.58 6.0 +0.42 166 5.63 5.0 -0.63 167 6.56 6.0 -0.56 168 5.18 5.0 -0.18 169 4.89 5.0 +0.11 170 5.21 6.0 +0.79 171 4.89 5.0 +0.11 172 5.06 5.0 -0.06 173 5.96 6.0 +0.04 174 6.86 6.0 -0.86 175 6.51 5.0 -1.51 176 6.50 6.0 -0.50 177 6.42 7.0 +0.58 178 5.14 5.0 -0.14 179 6.88 7.0 +0.12 180 5.62 6.0 +0.38 181 5.78 6.0 +0.22 182 5.22 5.0 -0.22 183 5.38 5.0 -0.38 184 6.30 6.0 -0.30 185 5.79 6.0 +0.21 186 6.01 7.0 +0.99 187 5.79 6.0 +0.21 188 5.40 6.0 +0.60 189 6.33 6.0 -0.33 190 6.07 6.0 -0.07 191 5.39 5.0 -0.39 192 5.45 5.0 -0.45 193 5.75 6.0 +0.25 194 6.26 6.0 -0.26 195 5.22 5.0 -0.22 196 5.56 7.0 +1.44 197 4.80 6.0 +1.20 198 5.79 6.0 +0.21 199 6.56 6.0 -0.56
fidle.end()
End time : 03/03/24 21:06:09
Duration : 00:01:21 246ms
This notebook ends here :-)
https://fidle.cnrs.fr