[K3IMDB5] - Sentiment analysis with a RNN network¶
Still the same problem, but with a network combining embedding and RNN, using Keras 3 and PyTorchObjectives :¶
- The objective is to guess whether film reviews are positive or negative based on the analysis of the text.
- Use of a model combining embedding and LSTM
Original dataset can be find there
Note that IMDb.com offers several easy-to-use datasets
For simplicity's sake, we'll use the dataset directly embedded in Keras
What we're going to do :¶
- Retrieve data
- Preparing the data
- Build a Embedding/LSTM model
- Train the model
- Evaluate the result
Step 1 - Init python stuff¶
import os
os.environ['KERAS_BACKEND'] = 'torch'
import keras
import keras.datasets.imdb as imdb
import json,re
import numpy as np
import fidle
# Init Fidle environment
run_id, run_dir, datasets_dir = fidle.init('K3IMDB5')
FIDLE - Environment initialization
Version : 2.3.0 Run id : K3IMDB5 Run dir : ./run/K3IMDB5 Datasets dir : /gpfswork/rech/mlh/uja62cb/fidle-project/datasets-fidle Start time : 03/03/24 21:06:58 Hostname : r3i6n3 (Linux) Tensorflow log level : Warning + Error (=1) Update keras cache : False Update torch cache : False Save figs : ./run/K3IMDB5/figs (True) keras : 3.0.4 numpy : 1.24.4 sklearn : 1.3.2 yaml : 6.0.1 matplotlib : 3.8.2 pandas : 2.1.3 torch : 2.1.1
Step 2 - Parameters¶
The words in the vocabulary are classified from the most frequent to the rarest.
vocab_size
is the number of words we will remember in our vocabulary (the other words will be considered as unknown).
hide_most_frequently
is the number of ignored words, among the most common ones
review_len
is the review length
dense_vector_size
is the size of the generated dense vectors
fit_verbosity
is the verbosity during training : 0 = silent, 1 = progress bar, 2 = one line per epoch
scale
is a dataset scale factor - note a scale=1 need a training time > 10'
vocab_size = 10000
hide_most_frequently = 0
review_len = 256
dense_vector_size = 32
epochs = 10
batch_size = 128
fit_verbosity = 1
scale = 0.2
Override parameters (batch mode) - Just forget this cell
fidle.override('vocab_size', 'hide_most_frequently', 'review_len', 'dense_vector_size')
fidle.override('batch_size', 'epochs', 'fit_verbosity', 'scale')
** Overrided parameters : ** fit_verbosity : 2 scale : 1
Step 3 - Retrieve data¶
IMDb dataset can bet get directly from Keras - see documentation
Note : Due to their nature, textual data can be somewhat complex.
3.1 - Get dataset¶
For simplicity, we will use a pre-formatted dataset - See documentation
However, Keras offers some usefull tools for formatting textual data - See documentation
Load dataset :
# ----- Retrieve x,y
#
start_char = 1 # Start of a sequence (padding is 0)
oov_char = 2 # Out-of-vocabulary
index_from = 3 # First word id
(x_train, y_train), (x_test, y_test) = imdb.load_data( num_words = vocab_size,
skip_top = hide_most_frequently,
start_char = start_char,
oov_char = oov_char,
index_from = index_from)
# ---- Rescale
#
n1 = int(scale * len(x_train))
n2 = int(scale * len(x_test))
x_train, y_train = x_train[:n1], y_train[:n1]
x_test, y_test = x_test[:n2], y_test[:n2]
# ---- About
#
print("Max(x_train,x_test) : ", fidle.utils.rmax([x_train,x_test]) )
print("Min(x_train,x_test) : ", fidle.utils.rmin([x_train,x_test]) )
print("Len(x_train) : ", len(x_train))
print("Len(x_test) : ", len(x_test))
Max(x_train,x_test) : 9999 Min(x_train,x_test) : 1 Len(x_train) : 25000 Len(x_test) : 25000
3.2 - Have a look for humans (optional)¶
When we loaded the dataset, we asked for using <start> as 1, <unknown word> as 2
So, we shifted the dataset by 3 with the parameter index_from=3
Load dictionary :
# ---- Retrieve dictionary {word:index}, and encode it in ascii
# Shift the dictionary from +3
# Add <pad>, <start> and <unknown> tags
# Create a reverse dictionary : {index:word}
#
word_index = imdb.get_word_index()
word_index = {w:(i+index_from) for w,i in word_index.items()}
word_index.update( {'<pad>':0, '<start>':1, '<unknown>':2, '<undef>':3,} )
index_word = {index:word for word,index in word_index.items()}
# ---- A nice function to transpose :
#
def dataset2text(review):
return ' '.join([index_word.get(i, '?') for i in review])
Have a look :
print('\nDictionary size : ', len(word_index))
for k in range(440,455):print(f'{k:2d} : {index_word[k]}' )
fidle.utils.subtitle('Review example :')
print(x_train[12])
fidle.utils.subtitle('After translation :')
print(dataset2text(x_train[12]))
Dictionary size : 88588 440 : hope 441 : entertaining 442 : she's 443 : mr 444 : overall 445 : evil 446 : called 447 : loved 448 : based 449 : oh 450 : several 451 : fans 452 : mother 453 : drama 454 : beginning
Review example :
[1, 13, 119, 954, 189, 1554, 13, 92, 459, 48, 4, 116, 9, 1492, 2291, 42, 726, 4, 1939, 168, 2031, 13, 423, 14, 20, 549, 18, 4, 2, 547, 32, 4, 96, 39, 4, 454, 7, 4, 22, 8, 4, 55, 130, 168, 13, 92, 359, 6, 158, 1511, 2, 42, 6, 1913, 19, 194, 4455, 4121, 6, 114, 8, 72, 21, 465, 9667, 304, 4, 51, 9, 14, 20, 44, 155, 8, 6, 226, 162, 616, 651, 51, 9, 14, 20, 44, 10, 10, 14, 218, 4843, 629, 42, 3017, 21, 48, 25, 28, 35, 534, 5, 6, 320, 8, 516, 5, 42, 25, 181, 8, 130, 56, 547, 3571, 5, 1471, 851, 14, 2286]
After translation :
<start> i love cheesy horror flicks i don't care if the acting is sub par or whether the monsters look corny i liked this movie except for the <unknown> feeling all the way from the beginning of the film to the very end look i don't need a 10 page <unknown> or a sign with big letters explaining a plot to me but dark floors takes the what is this movie about thing to a whole new annoying level what is this movie about br br this isn't exceptionally scary or thrilling but if you have an hour and a half to kill and or you want to end up feeling frustrated and confused rent this winner
Step 4 - Preprocess the data (padding)¶
In order to be processed by an NN, all entries must have the same length.
We chose a review length of review_len
We will therefore complete them with a padding (of <pad>)
x_train = keras.preprocessing.sequence.pad_sequences(x_train,
value = 0,
padding = 'post',
maxlen = review_len)
x_test = keras.preprocessing.sequence.pad_sequences(x_test,
value = 0 ,
padding = 'post',
maxlen = review_len)
fidle.utils.subtitle('After padding :')
print(x_train[12])
fidle.utils.subtitle('In real words :')
print(dataset2text(x_train[12]))
After padding :
[ 1 13 119 954 189 1554 13 92 459 48 4 116 9 1492 2291 42 726 4 1939 168 2031 13 423 14 20 549 18 4 2 547 32 4 96 39 4 454 7 4 22 8 4 55 130 168 13 92 359 6 158 1511 2 42 6 1913 19 194 4455 4121 6 114 8 72 21 465 9667 304 4 51 9 14 20 44 155 8 6 226 162 616 651 51 9 14 20 44 10 10 14 218 4843 629 42 3017 21 48 25 28 35 534 5 6 320 8 516 5 42 25 181 8 130 56 547 3571 5 1471 851 14 2286 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
In real words :
<start> i love cheesy horror flicks i don't care if the acting is sub par or whether the monsters look corny i liked this movie except for the <unknown> feeling all the way from the beginning of the film to the very end look i don't need a 10 page <unknown> or a sign with big letters explaining a plot to me but dark floors takes the what is this movie about thing to a whole new annoying level what is this movie about br br this isn't exceptionally scary or thrilling but if you have an hour and a half to kill and or you want to end up feeling frustrated and confused rent this winner <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
Step 5 - Build the model¶
More documentation about this model functions :
model = keras.Sequential()
model.add(keras.layers.Embedding(input_dim = vocab_size, output_dim = dense_vector_size))
model.add(keras.layers.GRU(50))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.compile(optimizer = 'rmsprop',
loss = 'binary_crossentropy',
metrics = ['accuracy'])
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ embedding (Embedding) │ ? │ 0 │ │ │ │ (unbuilt) │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ gru (GRU) │ ? │ 0 │ │ │ │ (unbuilt) │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense (Dense) │ ? │ 0 │ │ │ │ (unbuilt) │ └─────────────────────────────────┴───────────────────────────┴────────────┘
Total params: 0 (0.00 B)
Trainable params: 0 (0.00 B)
Non-trainable params: 0 (0.00 B)
os.makedirs(f'{run_dir}/models', mode=0o750, exist_ok=True)
save_dir = f'{run_dir}/models/best_model.keras'
savemodel_callback = keras.callbacks.ModelCheckpoint( filepath=save_dir, monitor='val_accuracy', mode='max', save_best_only=True)
6.2 - Train it¶
Note : With a scale=0.2, batch_size=128, epochs=10 => Need 4' on a cpu laptop
history = model.fit(x_train,
y_train,
epochs = epochs,
batch_size = batch_size,
validation_data = (x_test, y_test),
verbose = fit_verbosity,
callbacks = [savemodel_callback])
Epoch 1/10 196/196 - 102s - 520ms/step - accuracy: 0.5095 - loss: 0.6922 - val_accuracy: 0.5083 - val_loss: 0.6906 Epoch 2/10 196/196 - 101s - 515ms/step - accuracy: 0.5547 - loss: 0.6796 - val_accuracy: 0.5869 - val_loss: 0.8755 Epoch 3/10 196/196 - 101s - 515ms/step - accuracy: 0.7925 - loss: 0.4866 - val_accuracy: 0.8474 - val_loss: 0.4261 Epoch 4/10 196/196 - 101s - 515ms/step - accuracy: 0.8647 - loss: 0.3742 - val_accuracy: 0.8649 - val_loss: 0.3688 Epoch 5/10 196/196 - 101s - 514ms/step - accuracy: 0.8858 - loss: 0.3292 - val_accuracy: 0.8749 - val_loss: 0.3410 Epoch 6/10 196/196 - 102s - 520ms/step - accuracy: 0.8993 - loss: 0.2999 - val_accuracy: 0.8652 - val_loss: 0.3847 Epoch 7/10 196/196 - 102s - 521ms/step - accuracy: 0.9084 - loss: 0.2719 - val_accuracy: 0.7688 - val_loss: 0.5159 Epoch 8/10 196/196 - 102s - 520ms/step - accuracy: 0.9159 - loss: 0.2490 - val_accuracy: 0.8809 - val_loss: 0.3145 Epoch 9/10 196/196 - 102s - 521ms/step - accuracy: 0.9268 - loss: 0.2218 - val_accuracy: 0.8799 - val_loss: 0.3224 Epoch 10/10 196/196 - 102s - 521ms/step - accuracy: 0.9343 - loss: 0.1941 - val_accuracy: 0.8768 - val_loss: 0.3255
6.4 - Training history¶
fidle.scrawler.history(history, save_as='02-history')
Step 7 - Evaluation¶
Reload and evaluate best model
model = keras.models.load_model(f'{run_dir}/models/best_model.keras')
# ---- Evaluate
score = model.evaluate(x_test, y_test, verbose=0)
print('x_test / loss : {:5.4f}'.format(score[0]))
print('x_test / accuracy : {:5.4f}'.format(score[1]))
values=[score[1], 1-score[1]]
fidle.scrawler.donut(values,["Accuracy","Errors"], title="#### Accuracy donut is :", save_as='03-donut')
# ---- Confusion matrix
y_sigmoid = model.predict(x_test, verbose=fit_verbosity)
y_pred = y_sigmoid.copy()
y_pred[ y_sigmoid< 0.5 ] = 0
y_pred[ y_sigmoid>=0.5 ] = 1
fidle.scrawler.confusion_matrix_txt(y_test,y_pred,labels=range(2))
fidle.scrawler.confusion_matrix(y_test,y_pred,range(2), figsize=(8, 8),normalize=False, save_as='04-confusion-matrix')
fidle.end()
End time : 03/03/24 21:28:33
Duration : 00:21:34 470ms
This notebook ends here :-)
https://fidle.cnrs.fr