No description has been provided for this image

[K3GTSRB3] - Training monitoring¶

Episode 3 : Monitoring, analysis and check points during a training session, using Keras3

Objectives :¶

  • Understand what happens during the training process
  • Implement monitoring, backup and recovery solutions

The German Traffic Sign Recognition Benchmark (GTSRB) is a dataset with more than 50,000 photos of road signs from about 40 classes.
The final aim is to recognise them !
Description is available there : http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset

What we're going to do :¶

  • Monitoring and understanding our model training
  • Add recovery points
  • Analyze the results
  • Restore and run recovery points

Step 1 - Import and init¶

1.1 - Python stuffs¶

In [1]:
import os
os.environ['KERAS_BACKEND'] = 'torch'

import keras

import numpy as np
import os, random

import fidle

import modules.my_loader as my_loader
import modules.my_models as my_models
import modules.my_tools  as my_tools
from modules.my_TensorboardCallback import TensorboardCallback


# Init Fidle environment
run_id, run_dir, datasets_dir = fidle.init('K3GTSRB3')
Module my_loader loaded.
Module my_models loaded.


FIDLE - Environment initialization

Version              : 2.3.2
Run id               : K3GTSRB3_1
Run dir              : ./run/K3GTSRB3_1
Datasets dir         : /lustre/fswork/projects/rech/mlh/uja62cb/fidle-project/datasets-fidle
Start time           : 22/12/24 21:33:45
Hostname             : r3i5n3 (Linux)
Tensorflow log level : Info + Warning + Error  (=0)
Update keras cache   : False
Update torch cache   : False
Save figs            : ./run/K3GTSRB3_1/figs (True)
keras                : 3.7.0
numpy                : 2.1.2
sklearn              : 1.5.2
yaml                 : 6.0.2
matplotlib           : 3.9.2
pandas               : 2.2.3
torch                : 2.5.0

** run_id has been overrided from K3GTSRB3 to K3GTSRB3_1

1.2 - Parameters¶

scale is the proportion of the dataset that will be used during the training. (1 mean 100%)

  • A 20% 24x24 L dataset, 10 epochs, 20% dataset, need 1'30 on a CPU laptop. (Accuracy=91.4)\
  • A 20% 48x48 RGB dataset, 10 epochs, 20% dataset, need 6'30s on a CPU laptop. (Accuracy=91.5)

model_name is the model name from modules.my_models :

  • model_01 for 24x24 ou 48x48 images
  • model_02 for 48x48 images

fit_verbosity is the verbosity during training :

  • 0 = silent, 1 = progress bar, 2 = one line per epoch
In [2]:
enhanced_dir = './data'
# enhanced_dir = f'{datasets_dir}/GTSRB/enhanced'

model_name   = 'model_01'
dataset_name = 'set-24x24-L'
batch_size   = 64
epochs       = 10
scale        = 1
fit_verbosity = 1

Override parameters (batch mode) - Just forget this cell

In [3]:
fidle.override('enhanced_dir', 'model_name', 'dataset_name', 'batch_size', 'epochs', 'scale', 'fit_verbosity')
** Overrided parameters : **
enhanced_dir         : /lustre/fswork/projects/rech/mlh/uja62cb/fidle-project/datasets-fidle/GTSRB/enhanced
model_name           : model_01
dataset_name         : set-48x48-RGB
batch_size           : 64
epochs               : 5
scale                : 1
fit_verbosity        : 2

Step 2 - Load dataset¶

Dataset is one of the saved dataset...

In [4]:
x_train,y_train,x_test,y_test, x_meta,y_meta = my_loader.read_dataset(enhanced_dir, dataset_name, scale)
Original shape  : (39209, 48, 48, 3) (39209,)
Datasets have been resized with a factor  1
Rescaled shape  : (39209, 48, 48, 3) (39209,)
Datasets have been shuffled.
Dataset "set-48x48-RGB" is loaded and shuffled. (1.3 Go in 0:00:00)

Step 3 - Have a look to the dataset¶

In [5]:
print("x_train : ", x_train.shape)
print("y_train : ", y_train.shape)
print("x_test  : ", x_test.shape)
print("y_test  : ", y_test.shape)

fidle.scrawler.images(x_train, y_train, range(24), columns=8, x_size=1, y_size=1, save_as='02-dataset-small')
x_train :  (39209, 48, 48, 3)
y_train :  (39209,)
x_test  :  (12630, 48, 48, 3)
y_test  :  (12630,)
Saved: ./run/K3GTSRB3_1/figs/02-dataset-small
No description has been provided for this image

Step 4 - Get a model¶

In [6]:
(n,lx,ly,lz) = x_train.shape

model = my_models.get_model( model_name, lx,ly,lz )
model.summary()

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ conv2d (Conv2D)                      │ (None, 46, 46, 96)          │           2,688 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ max_pooling2d (MaxPooling2D)         │ (None, 23, 23, 96)          │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout (Dropout)                    │ (None, 23, 23, 96)          │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ conv2d_1 (Conv2D)                    │ (None, 21, 21, 192)         │         166,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ max_pooling2d_1 (MaxPooling2D)       │ (None, 10, 10, 192)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_1 (Dropout)                  │ (None, 10, 10, 192)         │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ flatten (Flatten)                    │ (None, 19200)               │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense (Dense)                        │ (None, 1500)                │      28,801,500 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_2 (Dropout)                  │ (None, 1500)                │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_1 (Dense)                      │ (None, 43)                  │          64,543 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 29,034,811 (110.76 MB)
 Trainable params: 29,034,811 (110.76 MB)
 Non-trainable params: 0 (0.00 B)

Step 5 - Prepare callbacks¶

We will add 2 callbacks :

TensorBoard
Training logs, which can be visualised using Tensorboard tool.

Model backup
It is possible to save the model each xx epoch or at each improvement.
The model can be saved completely or partially (weight).
See Keras documentation

In [7]:
fidle.utils.mkdir(run_dir + '/models')
fidle.utils.mkdir(run_dir + '/logs')

# ---- Callback for tensorboard (This one is homemade !)
#
tenseorboard_callback = TensorboardCallback(
                                log_dir=run_dir + "/logs/tb_" + fidle.Chrono.tag_now())

# ---- Callback to save best model
#
bestmodel_callback = keras.callbacks.ModelCheckpoint( 
                                filepath= run_dir + "/models/best-model.keras",
                                monitor='val_accuracy', 
                                mode='max', 
                                save_best_only=True)

# ---- Callback to save model from each epochs
#
savemodel_callback = keras.callbacks.ModelCheckpoint(
                                filepath= run_dir + "/models/{epoch:02d}.keras",
                                save_freq="epoch")

Step 6 - Train the model¶

To access logs with tensorboad :

  • Under Docker, from a terminal launched via the jupyterlab launcher, use the following command:
    tensorboard --logdir <path-to-logs> --host 0.0.0.0
  • If you're not using Docker, from a terminal :
    tensorboard --logdir <path-to-logs>

Note: One tensorboard instance can be used simultaneously.

Train it :
Note: The training curve is visible in real time with Tensorboard (see step 5)

In [8]:
chrono=fidle.Chrono()
chrono.start()

# ---- Shuffle train data
x_train,y_train=fidle.utils.shuffle_np_dataset(x_train,y_train)

# ---- Train
# Note: To be faster in our example, we can take only 2000 values
#
history = model.fit(  x_train, y_train,
                      batch_size=batch_size,
                      epochs=epochs,
                      verbose=fit_verbosity,
                      validation_data=(x_test, y_test),
                      callbacks=[tenseorboard_callback, bestmodel_callback, savemodel_callback] )

model.save(f'{run_dir}/models/last-model.keras')

chrono.show()
Datasets have been shuffled.
Epoch 1/5
613/613 - 13s - 21ms/step - accuracy: 0.7524 - loss: 0.8763 - val_accuracy: 0.9148 - val_loss: 0.3034
Epoch 2/5
613/613 - 13s - 21ms/step - accuracy: 0.9594 - loss: 0.1389 - val_accuracy: 0.9370 - val_loss: 0.2248
Epoch 3/5
613/613 - 13s - 21ms/step - accuracy: 0.9780 - loss: 0.0777 - val_accuracy: 0.9458 - val_loss: 0.2165
Epoch 4/5
613/613 - 11s - 19ms/step - accuracy: 0.9839 - loss: 0.0546 - val_accuracy: 0.9356 - val_loss: 0.2628
Epoch 5/5
613/613 - 13s - 21ms/step - accuracy: 0.9870 - loss: 0.0450 - val_accuracy: 0.9538 - val_loss: 0.1885
Duration :  63.61 seconds

Evaluate it :

In [9]:
max_val_accuracy = max(history.history["val_accuracy"])
print("Max validation accuracy is : {:.4f}".format(max_val_accuracy))
Max validation accuracy is : 0.9538
In [10]:
score = model.evaluate(x_test, y_test, verbose=0)

print('Test loss      : {:5.4f}'.format(score[0]))
print('Test accuracy  : {:5.4f}'.format(score[1]))
Test loss      : 0.1885
Test accuracy  : 0.9538

Step 7 - History¶

The return of model.fit() returns us the learning history

In [11]:
fidle.scrawler.history(history, save_as='03-history')
Saved: ./run/K3GTSRB3_1/figs/03-history_0
No description has been provided for this image
Saved: ./run/K3GTSRB3_1/figs/03-history_1
No description has been provided for this image

Step 8 - Evaluation and confusion¶

In [12]:
y_sigmoid = model.predict(x_test, verbose=fit_verbosity)
y_pred    = np.argmax(y_sigmoid, axis=-1)

fidle.scrawler.confusion_matrix(y_test,y_pred,range(43), figsize=(12, 12),normalize=False, save_as='04-confusion-matrix')
395/395 - 1s - 4ms/step
Saved: ./run/K3GTSRB3_1/figs/04-confusion-matrix
No description has been provided for this image

Step 9 - Restore and evaluate¶

List saved models :¶

In [13]:
# !ls -1rt "$run_dir"/models/

Restore a model :¶

In [14]:
loaded_model = keras.models.load_model(f'{run_dir}/models/best-model.keras')
# loaded_model.summary()
print("Loaded.")
Loaded.

Evaluate it :¶

In [15]:
score = loaded_model.evaluate(x_test, y_test, verbose=0)

print('Test loss      : {:5.4f}'.format(score[0]))
print('Test accuracy  : {:5.4f}'.format(score[1]))
Test loss      : 0.1885
Test accuracy  : 0.9538

Make a prediction :¶

In [16]:
# ---- Pick a random image
#
i   = random.randint(1,len(x_test))
x,y = x_test[i], y_test[i]

# ---- Do prediction
#
prediction = loaded_model.predict( np.array([x]), verbose=fit_verbosity )

# ---- Show result

my_tools.show_prediction( prediction, x, y, x_meta )
1/1 - 0s - 19ms/step


Output layer from model is (x100) :

[[ 0.    0.   99.97  0.    0.    0.03  0.    0.    0.    0.    0.    0.    0.    0.    0.
   0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
   0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]]


Graphically :

Saved: ./run/K3GTSRB3_1/figs/05-prediction-proba
No description has been provided for this image


In pictures :

The image :               Prediction :            Real stuff:
Saved: ./run/K3GTSRB3_1/figs/06-prediction-images
No description has been provided for this image
YEEES ! that's right!
In [17]:
fidle.end()

End time : 22/12/24 21:35:09
Duration : 00:01:23 212ms
This notebook ends here :-)
https://fidle.cnrs.fr


Step 10 - To go further ;-)¶

What you can do:

  • Try differents models
  • Use a subset of the dataset
  • Try different datasets
  • Try to recognize exotic signs !
  • Test different hyperparameters (epochs, batch size, optimization, etc.

No description has been provided for this image