No description has been provided for this image

[K3IMDB2] - Sentiment analysis with text embedding¶

A very classical example of word embedding with a dataset from Internet Movie Database (IMDB), using Keras 3 on PyTorch

Objectives :¶

  • The objective is to guess whether film reviews are positive or negative based on the analysis of the text.
  • Understand the management of textual data and sentiment analysis

Original dataset can be find there
Note that IMDb.com offers several easy-to-use datasets
For simplicity's sake, we'll use the dataset directly embedded in Keras

What we're going to do :¶

  • Retrieve data
  • Preparing the data
  • Build a model
  • Train the model
  • Evaluate the result

Step 1 - Import and init¶

1.1 - Python stuff¶

In [1]:
import os
os.environ['KERAS_BACKEND'] = 'torch'

import keras
import keras.datasets.imdb as imdb

import h5py,json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import fidle

# Init Fidle environment
run_id, run_dir, datasets_dir = fidle.init('K3IMDB2')


FIDLE - Environment initialization

Version              : 2.3.2
Run id               : K3IMDB2
Run dir              : ./run/K3IMDB2
Datasets dir         : /lustre/fswork/projects/rech/mlh/uja62cb/fidle-project/datasets-fidle
Start time           : 22/12/24 21:22:36
Hostname             : r3i7n1 (Linux)
Tensorflow log level : Info + Warning + Error  (=0)
Update keras cache   : False
Update torch cache   : False
Save figs            : ./run/K3IMDB2/figs (True)
keras                : 3.7.0
numpy                : 2.1.2
sklearn              : 1.5.2
yaml                 : 6.0.2
matplotlib           : 3.9.2
pandas               : 2.2.3
torch                : 2.5.0

1.2 - Parameters¶

The words in the vocabulary are classified from the most frequent to the rarest.
vocab_size is the number of words we will remember in our vocabulary (the other words will be considered as unknown).
hide_most_frequently is the number of ignored words, among the most common ones
review_len is the review length
dense_vector_size is the size of the generated dense vectors
output_dir is where we will go to save our dictionaries. (./data is a good choice)
fit_verbosity is the verbosity during training : 0 = silent, 1 = progress bar, 2 = one line per epoch

In [2]:
vocab_size           = 5000
hide_most_frequently = 0

review_len           = 256
dense_vector_size    = 32

epochs               = 30
batch_size           = 512

output_dir           = './data'
fit_verbosity        = 1

Override parameters (batch mode) - Just forget this cell

In [3]:
fidle.override('vocab_size', 'hide_most_frequently', 'review_len', 'dense_vector_size')
fidle.override('batch_size', 'epochs', 'output_dir', 'fit_verbosity')
** Overrided parameters : **
fit_verbosity        : 2

Step 2 - Retrieve data¶

IMDb dataset can bet get directly from Keras - see documentation
Note : Due to their nature, textual data can be somewhat complex.

For more details about the management of this dataset, see notebook IMDB1

2.2 - Get dataset¶

In [4]:
# ----- Retrieve x,y
#
start_char = 1      # Start of a sequence (padding is 0)
oov_char   = 2      # Out-of-vocabulary
index_from = 3      # First word id

(x_train, y_train), (x_test, y_test) = imdb.load_data( num_words  = vocab_size, 
                                                       skip_top   = hide_most_frequently,
                                                       start_char = start_char, 
                                                       oov_char   = oov_char, 
                                                       index_from = index_from)

# ---- About
#
print("Max(x_train,x_test)  : ", fidle.utils.rmax([x_train,x_test]) )
print("Min(x_train,x_test)  : ", fidle.utils.rmin([x_train,x_test]) )
print("Len(x_train)         : ", len(x_train))
print("Len(x_test)          : ", len(x_test))
Max(x_train,x_test)  :  4999
Min(x_train,x_test)  :  1
Len(x_train)         :  25000
Len(x_test)          :  25000

2.2 - Load dictionary¶

Not essential, but nice if you want to take a closer look at our reviews ;-)

In [5]:
# ---- Retrieve dictionary {word:index}, and encode it in ascii
#      Shift the dictionary from +3
#      Add <pad>, <start> and <unknown> tags
#      Create a reverse dictionary : {index:word}
#
word_index = imdb.get_word_index()
word_index = {w:(i+index_from) for w,i in word_index.items()}
word_index.update( {'<pad>':0, '<start>':1, '<unknown>':2, '<undef>':3,} )
index_word = {index:word for word,index in word_index.items()} 

# ---- A nice function to transpose :
#
def dataset2text(review):
    return ' '.join([index_word.get(i, '?') for i in review])

Step 3 - Preprocess the data (padding)¶

In order to be processed by an NN, all entries must have the same length.
We chose a review length of review_len
We will therefore complete them with a padding (of 0 as <pad>)

In [6]:
x_train = keras.preprocessing.sequence.pad_sequences(x_train,
                                                     value   = 0,
                                                     padding = 'post',
                                                     maxlen  = review_len)

x_test  = keras.preprocessing.sequence.pad_sequences(x_test,
                                                     value   = 0 ,
                                                     padding = 'post',
                                                     maxlen  = review_len)

fidle.utils.subtitle('After padding :')
print(x_train[12])


After padding :

[   1   13  119  954  189 1554   13   92  459   48    4  116    9 1492
 2291   42  726    4 1939  168 2031   13  423   14   20  549   18    4
    2  547   32    4   96   39    4  454    7    4   22    8    4   55
  130  168   13   92  359    6  158 1511    2   42    6 1913   19  194
 4455 4121    6  114    8   72   21  465    2  304    4   51    9   14
   20   44  155    8    6  226  162  616  651   51    9   14   20   44
   10   10   14  218 4843  629   42 3017   21   48   25   28   35  534
    5    6  320    8  516    5   42   25  181    8  130   56  547 3571
    5 1471  851   14 2286    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0]

Save dataset and dictionary (For future use but not mandatory)

In [7]:
# ---- Write dataset in a h5 file, could be usefull
#
fidle.utils.mkdir(output_dir)

with h5py.File(f'{output_dir}/dataset_imdb.h5', 'w') as f:
    f.create_dataset("x_train",    data=x_train)
    f.create_dataset("y_train",    data=y_train)
    f.create_dataset("x_test",     data=x_test)
    f.create_dataset("y_test",     data=y_test)
    print('Dataset h5 file saved.')

with open(f'{output_dir}/word_index.json', 'w') as fp:
    json.dump(word_index, fp)
    print('Word to index saved.')
Dataset h5 file saved.
Word to index saved.

Step 4 - Build the model¶

More documentation about this model functions :

  • Embedding
  • GlobalAveragePooling1D
In [8]:
model = keras.Sequential(name='Embedding model')

model.add(keras.layers.Input( shape=(review_len,) ))
model.add(keras.layers.Embedding( input_dim    = vocab_size,
                                  output_dim   = dense_vector_size))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(dense_vector_size, activation='relu'))
model.add(keras.layers.Dense(1,                 activation='sigmoid'))

model.compile( optimizer = 'adam',
               loss      = 'binary_crossentropy',
               metrics   = ['accuracy'])

model.summary()
Model: "Embedding model"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ embedding (Embedding)                │ (None, 256, 32)             │         160,000 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ global_average_pooling1d             │ (None, 32)                  │               0 │
│ (GlobalAveragePooling1D)             │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense (Dense)                        │ (None, 32)                  │           1,056 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_1 (Dense)                      │ (None, 1)                   │              33 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 161,089 (629.25 KB)
 Trainable params: 161,089 (629.25 KB)
 Non-trainable params: 0 (0.00 B)

Step 5 - Train the model¶

5.1 Add Callbacks¶

In [9]:
os.makedirs(f'{run_dir}/models',   mode=0o750, exist_ok=True)
save_dir = f'{run_dir}/models/best_model.keras'

savemodel_callback = keras.callbacks.ModelCheckpoint( filepath=save_dir, monitor='val_accuracy', mode='max', save_best_only=True)

5.2 - Train it¶

In [10]:
%%time

history = model.fit(x_train,
                    y_train,
                    epochs          = epochs,
                    batch_size      = batch_size,
                    validation_data = (x_test, y_test),
                    verbose         = fit_verbosity,
                    callbacks       = [savemodel_callback])
Epoch 1/30
49/49 - 1s - 20ms/step - accuracy: 0.5996 - loss: 0.6883 - val_accuracy: 0.7153 - val_loss: 0.6783
Epoch 2/30
49/49 - 1s - 11ms/step - accuracy: 0.7361 - loss: 0.6511 - val_accuracy: 0.7635 - val_loss: 0.6168
Epoch 3/30
49/49 - 1s - 11ms/step - accuracy: 0.7927 - loss: 0.5625 - val_accuracy: 0.8134 - val_loss: 0.5152
Epoch 4/30
49/49 - 1s - 11ms/step - accuracy: 0.8401 - loss: 0.4555 - val_accuracy: 0.8440 - val_loss: 0.4239
Epoch 5/30
49/49 - 1s - 11ms/step - accuracy: 0.8655 - loss: 0.3755 - val_accuracy: 0.8574 - val_loss: 0.3684
Epoch 6/30
49/49 - 1s - 11ms/step - accuracy: 0.8790 - loss: 0.3270 - val_accuracy: 0.8653 - val_loss: 0.3378
Epoch 7/30
49/49 - 1s - 11ms/step - accuracy: 0.8893 - loss: 0.2950 - val_accuracy: 0.8722 - val_loss: 0.3182
Epoch 8/30
49/49 - 1s - 14ms/step - accuracy: 0.8977 - loss: 0.2726 - val_accuracy: 0.8752 - val_loss: 0.3063
Epoch 9/30
49/49 - 1s - 11ms/step - accuracy: 0.9026 - loss: 0.2550 - val_accuracy: 0.8769 - val_loss: 0.2995
Epoch 10/30
49/49 - 1s - 11ms/step - accuracy: 0.9088 - loss: 0.2412 - val_accuracy: 0.8788 - val_loss: 0.2934
Epoch 11/30
49/49 - 1s - 11ms/step - accuracy: 0.9133 - loss: 0.2296 - val_accuracy: 0.8792 - val_loss: 0.2919
Epoch 12/30
49/49 - 1s - 10ms/step - accuracy: 0.9175 - loss: 0.2195 - val_accuracy: 0.8790 - val_loss: 0.2906
Epoch 13/30
49/49 - 1s - 11ms/step - accuracy: 0.9206 - loss: 0.2117 - val_accuracy: 0.8796 - val_loss: 0.2912
Epoch 14/30
49/49 - 1s - 11ms/step - accuracy: 0.9226 - loss: 0.2046 - val_accuracy: 0.8803 - val_loss: 0.2901
Epoch 15/30
49/49 - 1s - 10ms/step - accuracy: 0.9261 - loss: 0.1974 - val_accuracy: 0.8794 - val_loss: 0.2925
Epoch 16/30
49/49 - 1s - 10ms/step - accuracy: 0.9285 - loss: 0.1921 - val_accuracy: 0.8801 - val_loss: 0.2937
Epoch 17/30
49/49 - 1s - 10ms/step - accuracy: 0.9290 - loss: 0.1869 - val_accuracy: 0.8790 - val_loss: 0.2966
Epoch 18/30
49/49 - 1s - 10ms/step - accuracy: 0.9320 - loss: 0.1820 - val_accuracy: 0.8780 - val_loss: 0.2996
Epoch 19/30
49/49 - 1s - 10ms/step - accuracy: 0.9346 - loss: 0.1773 - val_accuracy: 0.8767 - val_loss: 0.3032
Epoch 20/30
49/49 - 1s - 10ms/step - accuracy: 0.9354 - loss: 0.1740 - val_accuracy: 0.8764 - val_loss: 0.3066
Epoch 21/30
49/49 - 1s - 10ms/step - accuracy: 0.9376 - loss: 0.1701 - val_accuracy: 0.8749 - val_loss: 0.3120
Epoch 22/30
49/49 - 1s - 10ms/step - accuracy: 0.9382 - loss: 0.1680 - val_accuracy: 0.8753 - val_loss: 0.3151
Epoch 23/30
49/49 - 1s - 10ms/step - accuracy: 0.9405 - loss: 0.1637 - val_accuracy: 0.8733 - val_loss: 0.3213
Epoch 24/30
49/49 - 1s - 10ms/step - accuracy: 0.9408 - loss: 0.1611 - val_accuracy: 0.8738 - val_loss: 0.3251
Epoch 25/30
49/49 - 1s - 10ms/step - accuracy: 0.9426 - loss: 0.1576 - val_accuracy: 0.8731 - val_loss: 0.3314
Epoch 26/30
49/49 - 1s - 10ms/step - accuracy: 0.9438 - loss: 0.1555 - val_accuracy: 0.8725 - val_loss: 0.3339
Epoch 27/30
49/49 - 1s - 10ms/step - accuracy: 0.9437 - loss: 0.1532 - val_accuracy: 0.8699 - val_loss: 0.3408
Epoch 28/30
49/49 - 1s - 10ms/step - accuracy: 0.9449 - loss: 0.1507 - val_accuracy: 0.8704 - val_loss: 0.3459
Epoch 29/30
49/49 - 1s - 10ms/step - accuracy: 0.9460 - loss: 0.1489 - val_accuracy: 0.8691 - val_loss: 0.3492
Epoch 30/30
49/49 - 1s - 10ms/step - accuracy: 0.9473 - loss: 0.1465 - val_accuracy: 0.8654 - val_loss: 0.3581
CPU times: user 15.6 s, sys: 240 ms, total: 15.9 s
Wall time: 16.1 s

Step 6 - Evaluate¶

6.1 - Training history¶

In [11]:
fidle.scrawler.history(history, save_as='02-history')
Saved: ./run/K3IMDB2/figs/02-history_0
No description has been provided for this image
Saved: ./run/K3IMDB2/figs/02-history_1
No description has been provided for this image

6.2 - Reload and evaluate best model¶

In [12]:
model = keras.models.load_model(f'{run_dir}/models/best_model.keras')

# ---- Evaluate
score  = model.evaluate(x_test, y_test, verbose=0)

print('x_test / loss      : {:5.4f}'.format(score[0]))
print('x_test / accuracy  : {:5.4f}'.format(score[1]))

values=[score[1], 1-score[1]]
fidle.scrawler.donut(values,["Accuracy","Errors"], title="#### Accuracy donut is :", save_as='03-donut')

# ---- Confusion matrix

y_sigmoid = model.predict(x_test, verbose=fit_verbosity)

y_pred = y_sigmoid.copy()
y_pred[ y_sigmoid< 0.5 ] = 0
y_pred[ y_sigmoid>=0.5 ] = 1    

fidle.scrawler.confusion_matrix_txt(y_test,y_pred,labels=range(2))
fidle.scrawler.confusion_matrix(y_test,y_pred,range(2), figsize=(8, 8),normalize=False, save_as='04-confusion-matrix')
x_test / loss      : 0.2901
x_test / accuracy  : 0.8803

Accuracy donut is :¶

Saved: ./run/K3IMDB2/figs/03-donut
No description has been provided for this image
782/782 - 1s - 2ms/step

Confusion matrix is :¶

  0 1
0 0.89 0.11
1 0.13 0.87
Saved: ./run/K3IMDB2/figs/04-confusion-matrix
No description has been provided for this image
In [13]:
fidle.end()

End time : 22/12/24 21:23:11
Duration : 00:00:35 932ms
This notebook ends here :-)
https://fidle.cnrs.fr


No description has been provided for this image