[K3IMDB2] - Sentiment analysis with text embedding¶
A very classical example of word embedding with a dataset from Internet Movie Database (IMDB), using Keras 3 on PyTorchObjectives :¶
- The objective is to guess whether film reviews are positive or negative based on the analysis of the text.
- Understand the management of textual data and sentiment analysis
Original dataset can be find there
Note that IMDb.com offers several easy-to-use datasets
For simplicity's sake, we'll use the dataset directly embedded in Keras
What we're going to do :¶
- Retrieve data
- Preparing the data
- Build a model
- Train the model
- Evaluate the result
import os
os.environ['KERAS_BACKEND'] = 'torch'
import keras
import keras.datasets.imdb as imdb
import h5py,json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import fidle
# Init Fidle environment
run_id, run_dir, datasets_dir = fidle.init('K3IMDB2')
Version : 2.3.2 Run id : K3IMDB2 Run dir : ./run/K3IMDB2 Datasets dir : /lustre/fswork/projects/rech/mlh/uja62cb/fidle-project/datasets-fidle Start time : 22/12/24 21:22:36 Hostname : r3i7n1 (Linux) Tensorflow log level : Info + Warning + Error (=0) Update keras cache : False Update torch cache : False Save figs : ./run/K3IMDB2/figs (True) keras : 3.7.0 numpy : 2.1.2 sklearn : 1.5.2 yaml : 6.0.2 matplotlib : 3.9.2 pandas : 2.2.3 torch : 2.5.0
1.2 - Parameters¶
The words in the vocabulary are classified from the most frequent to the rarest.
is the number of words we will remember in our vocabulary (the other words will be considered as unknown).
is the number of ignored words, among the most common ones
is the review length
is the size of the generated dense vectors
is where we will go to save our dictionaries. (./data is a good choice)
is the verbosity during training : 0 = silent, 1 = progress bar, 2 = one line per epoch
vocab_size = 5000
hide_most_frequently = 0
review_len = 256
dense_vector_size = 32
epochs = 30
batch_size = 512
output_dir = './data'
fit_verbosity = 1
Step 2 - Retrieve data¶
IMDb dataset can bet get directly from Keras - see documentation
Note : Due to their nature, textual data can be somewhat complex.
For more details about the management of this dataset, see notebook IMDB1
2.2 - Get dataset¶
# ----- Retrieve x,y
start_char = 1 # Start of a sequence (padding is 0)
oov_char = 2 # Out-of-vocabulary
index_from = 3 # First word id
(x_train, y_train), (x_test, y_test) = imdb.load_data( num_words = vocab_size,
skip_top = hide_most_frequently,
start_char = start_char,
oov_char = oov_char,
index_from = index_from)
# ---- About
print("Max(x_train,x_test) : ", fidle.utils.rmax([x_train,x_test]) )
print("Min(x_train,x_test) : ", fidle.utils.rmin([x_train,x_test]) )
print("Len(x_train) : ", len(x_train))
print("Len(x_test) : ", len(x_test))
Max(x_train,x_test) : 4999
Min(x_train,x_test) : 1 Len(x_train) : 25000 Len(x_test) : 25000
2.2 - Load dictionary¶
Not essential, but nice if you want to take a closer look at our reviews ;-)
# ---- Retrieve dictionary {word:index}, and encode it in ascii
# Shift the dictionary from +3
# Add <pad>, <start> and <unknown> tags
# Create a reverse dictionary : {index:word}
word_index = imdb.get_word_index()
word_index = {w:(i+index_from) for w,i in word_index.items()}
word_index.update( {'<pad>':0, '<start>':1, '<unknown>':2, '<undef>':3,} )
index_word = {index:word for word,index in word_index.items()}
# ---- A nice function to transpose :
def dataset2text(review):
return ' '.join([index_word.get(i, '?') for i in review])
Step 3 - Preprocess the data (padding)¶
In order to be processed by an NN, all entries must have the same length.
We chose a review length of review_len
We will therefore complete them with a padding (of 0 as <pad>)
x_train = keras.preprocessing.sequence.pad_sequences(x_train,
value = 0,
padding = 'post',
maxlen = review_len)
x_test = keras.preprocessing.sequence.pad_sequences(x_test,
value = 0 ,
padding = 'post',
maxlen = review_len)
fidle.utils.subtitle('After padding :')
After padding :
[ 1 13 119 954 189 1554 13 92 459 48 4 116 9 1492 2291 42 726 4 1939 168 2031 13 423 14 20 549 18 4 2 547 32 4 96 39 4 454 7 4 22 8 4 55 130 168 13 92 359 6 158 1511 2 42 6 1913 19 194 4455 4121 6 114 8 72 21 465 2 304 4 51 9 14 20 44 155 8 6 226 162 616 651 51 9 14 20 44 10 10 14 218 4843 629 42 3017 21 48 25 28 35 534 5 6 320 8 516 5 42 25 181 8 130 56 547 3571 5 1471 851 14 2286 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Save dataset and dictionary (For future use but not mandatory)
# ---- Write dataset in a h5 file, could be usefull
with h5py.File(f'{output_dir}/dataset_imdb.h5', 'w') as f:
f.create_dataset("x_train", data=x_train)
f.create_dataset("y_train", data=y_train)
f.create_dataset("x_test", data=x_test)
f.create_dataset("y_test", data=y_test)
print('Dataset h5 file saved.')
with open(f'{output_dir}/word_index.json', 'w') as fp:
json.dump(word_index, fp)
print('Word to index saved.')
Dataset h5 file saved. Word to index saved.
Step 4 - Build the model¶
More documentation about this model functions :
model = keras.Sequential(name='Embedding model')
model.add(keras.layers.Input( shape=(review_len,) ))
model.add(keras.layers.Embedding( input_dim = vocab_size,
output_dim = dense_vector_size))
model.add(keras.layers.Dense(dense_vector_size, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.compile( optimizer = 'adam',
loss = 'binary_crossentropy',
metrics = ['accuracy'])
Model: "Embedding model"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ embedding (Embedding) │ (None, 256, 32) │ 160,000 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ global_average_pooling1d │ (None, 32) │ 0 │ │ (GlobalAveragePooling1D) │ │ │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ dense (Dense) │ (None, 32) │ 1,056 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ dense_1 (Dense) │ (None, 1) │ 33 │ └──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
Total params: 161,089 (629.25 KB)
Trainable params: 161,089 (629.25 KB)
Non-trainable params: 0 (0.00 B)
os.makedirs(f'{run_dir}/models', mode=0o750, exist_ok=True)
save_dir = f'{run_dir}/models/best_model.keras'
savemodel_callback = keras.callbacks.ModelCheckpoint( filepath=save_dir, monitor='val_accuracy', mode='max', save_best_only=True)
5.2 - Train it¶
history = model.fit(x_train,
epochs = epochs,
batch_size = batch_size,
validation_data = (x_test, y_test),
verbose = fit_verbosity,
callbacks = [savemodel_callback])
CPU times: user 15.6 s, sys: 240 ms, total: 15.9 s Wall time: 16.1 s
fidle.scrawler.history(history, save_as='02-history')
6.2 - Reload and evaluate best model¶
model = keras.models.load_model(f'{run_dir}/models/best_model.keras')
# ---- Evaluate
score = model.evaluate(x_test, y_test, verbose=0)
print('x_test / loss : {:5.4f}'.format(score[0]))
print('x_test / accuracy : {:5.4f}'.format(score[1]))
values=[score[1], 1-score[1]]
fidle.scrawler.donut(values,["Accuracy","Errors"], title="#### Accuracy donut is :", save_as='03-donut')
# ---- Confusion matrix
y_sigmoid = model.predict(x_test, verbose=fit_verbosity)
y_pred = y_sigmoid.copy()
y_pred[ y_sigmoid< 0.5 ] = 0
y_pred[ y_sigmoid>=0.5 ] = 1
fidle.scrawler.confusion_matrix(y_test,y_pred,range(2), figsize=(8, 8),normalize=False, save_as='04-confusion-matrix')
