[K3IMDB2] - Sentiment analysis with text embedding¶
A very classical example of word embedding with a dataset from Internet Movie Database (IMDB), using Keras 3 on PyTorchObjectives :¶
- The objective is to guess whether film reviews are positive or negative based on the analysis of the text.
- Understand the management of textual data and sentiment analysis
Original dataset can be find there
Note that IMDb.com offers several easy-to-use datasets
For simplicity's sake, we'll use the dataset directly embedded in Keras
What we're going to do :¶
- Retrieve data
- Preparing the data
- Build a model
- Train the model
- Evaluate the result
import os
os.environ['KERAS_BACKEND'] = 'torch'
import keras
import keras.datasets.imdb as imdb
import h5py,json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import fidle
# Init Fidle environment
run_id, run_dir, datasets_dir = fidle.init('K3IMDB2')
FIDLE - Environment initialization
Version : 2.3.0 Run id : K3IMDB2 Run dir : ./run/K3IMDB2 Datasets dir : /gpfswork/rech/mlh/uja62cb/fidle-project/datasets-fidle Start time : 03/03/24 21:07:00 Hostname : r6i0n6 (Linux) Tensorflow log level : Warning + Error (=1) Update keras cache : False Update torch cache : False Save figs : ./run/K3IMDB2/figs (True) keras : 3.0.4 numpy : 1.24.4 sklearn : 1.3.2 yaml : 6.0.1 matplotlib : 3.8.2 pandas : 2.1.3 torch : 2.1.1
1.2 - Parameters¶
The words in the vocabulary are classified from the most frequent to the rarest.
vocab_size
is the number of words we will remember in our vocabulary (the other words will be considered as unknown).
hide_most_frequently
is the number of ignored words, among the most common ones
review_len
is the review length
dense_vector_size
is the size of the generated dense vectors
output_dir
is where we will go to save our dictionaries. (./data is a good choice)
fit_verbosity
is the verbosity during training : 0 = silent, 1 = progress bar, 2 = one line per epoch
vocab_size = 5000
hide_most_frequently = 0
review_len = 256
dense_vector_size = 32
epochs = 30
batch_size = 512
output_dir = './data'
fit_verbosity = 1
Override parameters (batch mode) - Just forget this cell
fidle.override('vocab_size', 'hide_most_frequently', 'review_len', 'dense_vector_size')
fidle.override('batch_size', 'epochs', 'output_dir', 'fit_verbosity')
** Overrided parameters : ** fit_verbosity : 2
Step 2 - Retrieve data¶
IMDb dataset can bet get directly from Keras - see documentation
Note : Due to their nature, textual data can be somewhat complex.
For more details about the management of this dataset, see notebook IMDB1
2.2 - Get dataset¶
# ----- Retrieve x,y
#
start_char = 1 # Start of a sequence (padding is 0)
oov_char = 2 # Out-of-vocabulary
index_from = 3 # First word id
(x_train, y_train), (x_test, y_test) = imdb.load_data( num_words = vocab_size,
skip_top = hide_most_frequently,
start_char = start_char,
oov_char = oov_char,
index_from = index_from)
# ---- About
#
print("Max(x_train,x_test) : ", fidle.utils.rmax([x_train,x_test]) )
print("Min(x_train,x_test) : ", fidle.utils.rmin([x_train,x_test]) )
print("Len(x_train) : ", len(x_train))
print("Len(x_test) : ", len(x_test))
Max(x_train,x_test) : 4999 Min(x_train,x_test) : 1 Len(x_train) : 25000 Len(x_test) : 25000
2.2 - Load dictionary¶
Not essential, but nice if you want to take a closer look at our reviews ;-)
# ---- Retrieve dictionary {word:index}, and encode it in ascii
# Shift the dictionary from +3
# Add <pad>, <start> and <unknown> tags
# Create a reverse dictionary : {index:word}
#
word_index = imdb.get_word_index()
word_index = {w:(i+index_from) for w,i in word_index.items()}
word_index.update( {'<pad>':0, '<start>':1, '<unknown>':2, '<undef>':3,} )
index_word = {index:word for word,index in word_index.items()}
# ---- A nice function to transpose :
#
def dataset2text(review):
return ' '.join([index_word.get(i, '?') for i in review])
Step 3 - Preprocess the data (padding)¶
In order to be processed by an NN, all entries must have the same length.
We chose a review length of review_len
We will therefore complete them with a padding (of 0 as <pad>)
x_train = keras.preprocessing.sequence.pad_sequences(x_train,
value = 0,
padding = 'post',
maxlen = review_len)
x_test = keras.preprocessing.sequence.pad_sequences(x_test,
value = 0 ,
padding = 'post',
maxlen = review_len)
fidle.utils.subtitle('After padding :')
print(x_train[12])
After padding :
[ 1 13 119 954 189 1554 13 92 459 48 4 116 9 1492 2291 42 726 4 1939 168 2031 13 423 14 20 549 18 4 2 547 32 4 96 39 4 454 7 4 22 8 4 55 130 168 13 92 359 6 158 1511 2 42 6 1913 19 194 4455 4121 6 114 8 72 21 465 2 304 4 51 9 14 20 44 155 8 6 226 162 616 651 51 9 14 20 44 10 10 14 218 4843 629 42 3017 21 48 25 28 35 534 5 6 320 8 516 5 42 25 181 8 130 56 547 3571 5 1471 851 14 2286 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Save dataset and dictionary (For future use but not mandatory)
# ---- Write dataset in a h5 file, could be usefull
#
fidle.utils.mkdir(output_dir)
with h5py.File(f'{output_dir}/dataset_imdb.h5', 'w') as f:
f.create_dataset("x_train", data=x_train)
f.create_dataset("y_train", data=y_train)
f.create_dataset("x_test", data=x_test)
f.create_dataset("y_test", data=y_test)
print('Dataset h5 file saved.')
with open(f'{output_dir}/word_index.json', 'w') as fp:
json.dump(word_index, fp)
print('Word to index saved.')
Dataset h5 file saved. Word to index saved.
Step 4 - Build the model¶
More documentation about this model functions :
model = keras.Sequential(name='Embedding model')
model.add(keras.layers.Input( shape=(review_len,) ))
model.add(keras.layers.Embedding( input_dim = vocab_size,
output_dim = dense_vector_size))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(dense_vector_size, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.compile( optimizer = 'adam',
loss = 'binary_crossentropy',
metrics = ['accuracy'])
model.summary()
Model: "Embedding model"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ embedding (Embedding) │ (None, 256, 32) │ 160,000 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ global_average_pooling1d │ (None, 32) │ 0 │ │ (GlobalAveragePooling1D) │ │ │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense (Dense) │ (None, 32) │ 1,056 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_1 (Dense) │ (None, 1) │ 33 │ └─────────────────────────────────┴───────────────────────────┴────────────┘
Total params: 161,089 (629.25 KB)
Trainable params: 161,089 (629.25 KB)
Non-trainable params: 0 (0.00 B)
os.makedirs(f'{run_dir}/models', mode=0o750, exist_ok=True)
save_dir = f'{run_dir}/models/best_model.keras'
savemodel_callback = keras.callbacks.ModelCheckpoint( filepath=save_dir, monitor='val_accuracy', mode='max', save_best_only=True)
5.2 - Train it¶
%%time
history = model.fit(x_train,
y_train,
epochs = epochs,
batch_size = batch_size,
validation_data = (x_test, y_test),
verbose = fit_verbosity,
callbacks = [savemodel_callback])
Epoch 1/30 49/49 - 6s - 129ms/step - accuracy: 0.6310 - loss: 0.6881 - val_accuracy: 0.7067 - val_loss: 0.6781 Epoch 2/30 49/49 - 6s - 120ms/step - accuracy: 0.7486 - loss: 0.6488 - val_accuracy: 0.7591 - val_loss: 0.6116 Epoch 3/30 49/49 - 6s - 124ms/step - accuracy: 0.8014 - loss: 0.5535 - val_accuracy: 0.8093 - val_loss: 0.5045 Epoch 4/30 49/49 - 6s - 121ms/step - accuracy: 0.8455 - loss: 0.4435 - val_accuracy: 0.8464 - val_loss: 0.4136 Epoch 5/30 49/49 - 6s - 121ms/step - accuracy: 0.8675 - loss: 0.3661 - val_accuracy: 0.8596 - val_loss: 0.3611 Epoch 6/30 49/49 - 6s - 121ms/step - accuracy: 0.8805 - loss: 0.3202 - val_accuracy: 0.8668 - val_loss: 0.3324 Epoch 7/30 49/49 - 6s - 124ms/step - accuracy: 0.8912 - loss: 0.2893 - val_accuracy: 0.8705 - val_loss: 0.3156 Epoch 8/30 49/49 - 6s - 121ms/step - accuracy: 0.8986 - loss: 0.2678 - val_accuracy: 0.8767 - val_loss: 0.3035 Epoch 9/30 49/49 - 6s - 121ms/step - accuracy: 0.9044 - loss: 0.2513 - val_accuracy: 0.8775 - val_loss: 0.2969 Epoch 10/30 49/49 - 6s - 121ms/step - accuracy: 0.9100 - loss: 0.2375 - val_accuracy: 0.8796 - val_loss: 0.2925 Epoch 11/30 49/49 - 6s - 121ms/step - accuracy: 0.9150 - loss: 0.2260 - val_accuracy: 0.8802 - val_loss: 0.2901 Epoch 12/30 49/49 - 6s - 124ms/step - accuracy: 0.9190 - loss: 0.2160 - val_accuracy: 0.8799 - val_loss: 0.2923 Epoch 13/30 49/49 - 6s - 121ms/step - accuracy: 0.9216 - loss: 0.2084 - val_accuracy: 0.8799 - val_loss: 0.2896 Epoch 14/30 49/49 - 6s - 120ms/step - accuracy: 0.9244 - loss: 0.2010 - val_accuracy: 0.8806 - val_loss: 0.2901 Epoch 15/30 49/49 - 6s - 118ms/step - accuracy: 0.9278 - loss: 0.1945 - val_accuracy: 0.8799 - val_loss: 0.2925 Epoch 16/30 49/49 - 6s - 120ms/step - accuracy: 0.9295 - loss: 0.1893 - val_accuracy: 0.8785 - val_loss: 0.2954 Epoch 17/30 49/49 - 6s - 118ms/step - accuracy: 0.9308 - loss: 0.1849 - val_accuracy: 0.8785 - val_loss: 0.2982 Epoch 18/30 49/49 - 6s - 117ms/step - accuracy: 0.9339 - loss: 0.1792 - val_accuracy: 0.8780 - val_loss: 0.3017 Epoch 19/30 49/49 - 6s - 117ms/step - accuracy: 0.9350 - loss: 0.1757 - val_accuracy: 0.8771 - val_loss: 0.3061 Epoch 20/30 49/49 - 6s - 120ms/step - accuracy: 0.9366 - loss: 0.1714 - val_accuracy: 0.8760 - val_loss: 0.3098 Epoch 21/30 49/49 - 6s - 117ms/step - accuracy: 0.9382 - loss: 0.1683 - val_accuracy: 0.8756 - val_loss: 0.3152 Epoch 22/30 49/49 - 6s - 117ms/step - accuracy: 0.9401 - loss: 0.1645 - val_accuracy: 0.8741 - val_loss: 0.3207 Epoch 23/30 49/49 - 6s - 117ms/step - accuracy: 0.9405 - loss: 0.1625 - val_accuracy: 0.8722 - val_loss: 0.3265 Epoch 24/30 49/49 - 6s - 117ms/step - accuracy: 0.9406 - loss: 0.1606 - val_accuracy: 0.8719 - val_loss: 0.3299 Epoch 25/30 49/49 - 6s - 120ms/step - accuracy: 0.9427 - loss: 0.1569 - val_accuracy: 0.8691 - val_loss: 0.3381 Epoch 26/30 49/49 - 6s - 117ms/step - accuracy: 0.9425 - loss: 0.1562 - val_accuracy: 0.8714 - val_loss: 0.3392 Epoch 27/30 49/49 - 6s - 117ms/step - accuracy: 0.9448 - loss: 0.1514 - val_accuracy: 0.8691 - val_loss: 0.3453 Epoch 28/30 49/49 - 6s - 118ms/step - accuracy: 0.9462 - loss: 0.1490 - val_accuracy: 0.8677 - val_loss: 0.3513 Epoch 29/30 49/49 - 6s - 124ms/step - accuracy: 0.9485 - loss: 0.1466 - val_accuracy: 0.8695 - val_loss: 0.3544 Epoch 30/30 49/49 - 6s - 119ms/step - accuracy: 0.9471 - loss: 0.1453 - val_accuracy: 0.8685 - val_loss: 0.3603 CPU times: user 2min 53s, sys: 945 ms, total: 2min 54s Wall time: 2min 56s
fidle.scrawler.history(history, save_as='02-history')
6.2 - Reload and evaluate best model¶
model = keras.models.load_model(f'{run_dir}/models/best_model.keras')
# ---- Evaluate
score = model.evaluate(x_test, y_test, verbose=0)
print('x_test / loss : {:5.4f}'.format(score[0]))
print('x_test / accuracy : {:5.4f}'.format(score[1]))
values=[score[1], 1-score[1]]
fidle.scrawler.donut(values,["Accuracy","Errors"], title="#### Accuracy donut is :", save_as='03-donut')
# ---- Confusion matrix
y_sigmoid = model.predict(x_test, verbose=fit_verbosity)
y_pred = y_sigmoid.copy()
y_pred[ y_sigmoid< 0.5 ] = 0
y_pred[ y_sigmoid>=0.5 ] = 1
fidle.scrawler.confusion_matrix_txt(y_test,y_pred,labels=range(2))
fidle.scrawler.confusion_matrix(y_test,y_pred,range(2), figsize=(8, 8),normalize=False, save_as='04-confusion-matrix')
fidle.end()
End time : 03/03/24 21:10:19
Duration : 00:03:20 782ms
This notebook ends here :-)
https://fidle.cnrs.fr