[K3GTSRB3] - Training monitoring¶
Episode 3 : Monitoring, analysis and check points during a training session, using Keras3Objectives :¶
- Understand what happens during the training process
- Implement monitoring, backup and recovery solutions
The German Traffic Sign Recognition Benchmark (GTSRB) is a dataset with more than 50,000 photos of road signs from about 40 classes.
The final aim is to recognise them !
Description is available there : http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset
What we're going to do :¶
- Monitoring and understanding our model training
- Add recovery points
- Analyze the results
- Restore and run recovery points
Step 1 - Import and init¶
1.1 - Python stuffs¶
import os
os.environ['KERAS_BACKEND'] = 'torch'
import keras
import numpy as np
import os, random
import fidle
import modules.my_loader as my_loader
import modules.my_models as my_models
import modules.my_tools as my_tools
from modules.my_TensorboardCallback import TensorboardCallback
# Init Fidle environment
run_id, run_dir, datasets_dir = fidle.init('K3GTSRB3')
Module my_loader loaded. Module my_models loaded.
FIDLE - Environment initialization
Version : 2.3.2 Run id : K3GTSRB3_2 Run dir : ./run/K3GTSRB3_2 Datasets dir : /lustre/fswork/projects/rech/mlh/uja62cb/fidle-project/datasets-fidle Start time : 22/12/24 21:35:20 Hostname : r3i5n3 (Linux) Tensorflow log level : Info + Warning + Error (=0) Update keras cache : False Update torch cache : False Save figs : ./run/K3GTSRB3_2/figs (True) keras : 3.7.0 numpy : 2.1.2 sklearn : 1.5.2 yaml : 6.0.2 matplotlib : 3.9.2 pandas : 2.2.3 torch : 2.5.0 ** run_id has been overrided from K3GTSRB3 to K3GTSRB3_2
1.2 - Parameters¶
scale
is the proportion of the dataset that will be used during the training. (1 mean 100%)
- A 20% 24x24 L dataset, 10 epochs, 20% dataset, need 1'30 on a CPU laptop. (Accuracy=91.4)\
- A 20% 48x48 RGB dataset, 10 epochs, 20% dataset, need 6'30s on a CPU laptop. (Accuracy=91.5)
model_name
is the model name from modules.my_models :
- model_01 for 24x24 ou 48x48 images
- model_02 for 48x48 images
fit_verbosity
is the verbosity during training :
- 0 = silent, 1 = progress bar, 2 = one line per epoch
enhanced_dir = './data'
# enhanced_dir = f'{datasets_dir}/GTSRB/enhanced'
model_name = 'model_01'
dataset_name = 'set-24x24-L'
batch_size = 64
epochs = 10
scale = 1
fit_verbosity = 1
Override parameters (batch mode) - Just forget this cell
fidle.override('enhanced_dir', 'model_name', 'dataset_name', 'batch_size', 'epochs', 'scale', 'fit_verbosity')
** Overrided parameters : ** enhanced_dir : /lustre/fswork/projects/rech/mlh/uja62cb/fidle-project/datasets-fidle/GTSRB/enhanced model_name : model_02 dataset_name : set-48x48-RGB batch_size : 64 epochs : 5 scale : 1 fit_verbosity : 2
Step 2 - Load dataset¶
Dataset is one of the saved dataset...
x_train,y_train,x_test,y_test, x_meta,y_meta = my_loader.read_dataset(enhanced_dir, dataset_name, scale)
Original shape : (39209, 48, 48, 3) (39209,) Datasets have been resized with a factor 1 Rescaled shape : (39209, 48, 48, 3) (39209,) Datasets have been shuffled.
Dataset "set-48x48-RGB" is loaded and shuffled. (1.3 Go in 0:00:00)
Step 3 - Have a look to the dataset¶
print("x_train : ", x_train.shape)
print("y_train : ", y_train.shape)
print("x_test : ", x_test.shape)
print("y_test : ", y_test.shape)
fidle.scrawler.images(x_train, y_train, range(24), columns=8, x_size=1, y_size=1, save_as='02-dataset-small')
x_train : (39209, 48, 48, 3) y_train : (39209,) x_test : (12630, 48, 48, 3) y_test : (12630,)
Step 4 - Get a model¶
(n,lx,ly,lz) = x_train.shape
model = my_models.get_model( model_name, lx,ly,lz )
model.summary()
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ conv2d (Conv2D) │ (None, 46, 46, 96) │ 2,688 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ max_pooling2d (MaxPooling2D) │ (None, 23, 23, 96) │ 0 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ dropout (Dropout) │ (None, 23, 23, 96) │ 0 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ conv2d_1 (Conv2D) │ (None, 21, 21, 192) │ 166,080 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ max_pooling2d_1 (MaxPooling2D) │ (None, 10, 10, 192) │ 0 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ dropout_1 (Dropout) │ (None, 10, 10, 192) │ 0 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ flatten (Flatten) │ (None, 19200) │ 0 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ dense (Dense) │ (None, 1500) │ 28,801,500 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ dropout_2 (Dropout) │ (None, 1500) │ 0 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ dense_1 (Dense) │ (None, 43) │ 64,543 │ └──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
Total params: 29,034,811 (110.76 MB)
Trainable params: 29,034,811 (110.76 MB)
Non-trainable params: 0 (0.00 B)
Step 5 - Prepare callbacks¶
We will add 2 callbacks :
TensorBoard
Training logs, which can be visualised using Tensorboard tool.
Model backup
It is possible to save the model each xx epoch or at each improvement.
The model can be saved completely or partially (weight).
See Keras documentation
fidle.utils.mkdir(run_dir + '/models')
fidle.utils.mkdir(run_dir + '/logs')
# ---- Callback for tensorboard (This one is homemade !)
#
tenseorboard_callback = TensorboardCallback(
log_dir=run_dir + "/logs/tb_" + fidle.Chrono.tag_now())
# ---- Callback to save best model
#
bestmodel_callback = keras.callbacks.ModelCheckpoint(
filepath= run_dir + "/models/best-model.keras",
monitor='val_accuracy',
mode='max',
save_best_only=True)
# ---- Callback to save model from each epochs
#
savemodel_callback = keras.callbacks.ModelCheckpoint(
filepath= run_dir + "/models/{epoch:02d}.keras",
save_freq="epoch")
Step 6 - Train the model¶
To access logs with tensorboad :
- Under Docker, from a terminal launched via the jupyterlab launcher, use the following command:
tensorboard --logdir <path-to-logs> --host 0.0.0.0
- If you're not using Docker, from a terminal :
tensorboard --logdir <path-to-logs>
Note: One tensorboard instance can be used simultaneously.
Train it :
Note: The training curve is visible in real time with Tensorboard (see step 5)
chrono=fidle.Chrono()
chrono.start()
# ---- Shuffle train data
x_train,y_train=fidle.utils.shuffle_np_dataset(x_train,y_train)
# ---- Train
# Note: To be faster in our example, we can take only 2000 values
#
history = model.fit( x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=fit_verbosity,
validation_data=(x_test, y_test),
callbacks=[tenseorboard_callback, bestmodel_callback, savemodel_callback] )
model.save(f'{run_dir}/models/last-model.keras')
chrono.show()
Datasets have been shuffled.
Epoch 1/5
613/613 - 13s - 21ms/step - accuracy: 0.7922 - loss: 0.7408 - val_accuracy: 0.9082 - val_loss: 0.3315
Epoch 2/5
613/613 - 12s - 20ms/step - accuracy: 0.9670 - loss: 0.1137 - val_accuracy: 0.9407 - val_loss: 0.2081
Epoch 3/5
613/613 - 12s - 20ms/step - accuracy: 0.9811 - loss: 0.0654 - val_accuracy: 0.9485 - val_loss: 0.2015
Epoch 4/5
613/613 - 12s - 20ms/step - accuracy: 0.9867 - loss: 0.0460 - val_accuracy: 0.9521 - val_loss: 0.1972
Epoch 5/5
613/613 - 12s - 20ms/step - accuracy: 0.9894 - loss: 0.0353 - val_accuracy: 0.9565 - val_loss: 0.1890
Duration : 63.91 seconds
Evaluate it :
max_val_accuracy = max(history.history["val_accuracy"])
print("Max validation accuracy is : {:.4f}".format(max_val_accuracy))
Max validation accuracy is : 0.9565
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss : {:5.4f}'.format(score[0]))
print('Test accuracy : {:5.4f}'.format(score[1]))
Test loss : 0.1890 Test accuracy : 0.9565
Step 7 - History¶
The return of model.fit() returns us the learning history
fidle.scrawler.history(history, save_as='03-history')
Step 8 - Evaluation and confusion¶
y_sigmoid = model.predict(x_test, verbose=fit_verbosity)
y_pred = np.argmax(y_sigmoid, axis=-1)
fidle.scrawler.confusion_matrix(y_test,y_pred,range(43), figsize=(12, 12),normalize=False, save_as='04-confusion-matrix')
395/395 - 1s - 4ms/step
# !ls -1rt "$run_dir"/models/
Restore a model :¶
loaded_model = keras.models.load_model(f'{run_dir}/models/best-model.keras')
# loaded_model.summary()
print("Loaded.")
Loaded.
Evaluate it :¶
score = loaded_model.evaluate(x_test, y_test, verbose=0)
print('Test loss : {:5.4f}'.format(score[0]))
print('Test accuracy : {:5.4f}'.format(score[1]))
Test loss : 0.1890 Test accuracy : 0.9565
Make a prediction :¶
# ---- Pick a random image
#
i = random.randint(1,len(x_test))
x,y = x_test[i], y_test[i]
# ---- Do prediction
#
prediction = loaded_model.predict( np.array([x]), verbose=fit_verbosity )
# ---- Show result
my_tools.show_prediction( prediction, x, y, x_meta )
1/1 - 0s - 11ms/step
Output layer from model is (x100) :
[[ 0. 0. 0. 0. 100. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Graphically :
In pictures :
The image : Prediction : Real stuff:
YEEES ! that's right!
fidle.end()
End time : 22/12/24 21:36:43
Duration : 00:01:23 270ms
This notebook ends here :-)
https://fidle.cnrs.fr
Step 10 - To go further ;-)¶
What you can do:
- Try differents models
- Use a subset of the dataset
- Try different datasets
- Try to recognize exotic signs !
- Test different hyperparameters (epochs, batch size, optimization, etc.