[K3IMDB1] - Sentiment analysis with hot-one encoding¶
A basic example of sentiment analysis with sparse encoding, using a dataset from Internet Movie Database (IMDB), using Keras 3 on PyTorchObjectives :¶
- The objective is to guess whether film reviews are positive or negative based on the analysis of the text.
- Understand the management of textual data and sentiment analysis
Original dataset can be find there
Note that IMDb.com offers several easy-to-use datasets
For simplicity's sake, we'll use the dataset directly embedded in Keras
What we're going to do :¶
- Retrieve data
- Preparing the data
- Build a model
- Train the model
- Evaluate the result
import os
os.environ['KERAS_BACKEND'] = 'torch'
import keras
import keras.datasets.imdb as imdb
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import fidle
# Init Fidle environment
run_id, run_dir, datasets_dir = fidle.init('K3IMDB1')
FIDLE - Environment initialization
Version : 2.3.2
Run id : K3IMDB1
Run dir : ./run/K3IMDB1
Datasets dir : /lustre/fswork/projects/rech/mlh/uja62cb/fidle-project/datasets-fidle
Start time : 22/12/24 21:22:34
1.2 - Parameters¶
The words in the vocabulary are classified from the most frequent to the rarest.
is the number of words we will remember in our vocabulary (the other words will be considered as unknown).
is the number of ignored words, among the most common ones
is the verbosity during training : 0 = silent, 1 = progress bar, 2 = one line per epoch
vocab_size = 5000
hide_most_frequently = 0
epochs = 10
batch_size = 512
fit_verbosity = 1
sentence = "I've never seen a movie like this before"
dictionary = {"a":0, "before":1, "fantastic":2, "i've":3, "is":4, "like":5, "movie":6, "never":7, "seen":8, "this":9}
We encode our sentence as a numerical vector :¶
sentence_words = sentence.lower().split()
sentence_vect = [ dictionary[w] for w in sentence_words ]
print('Words sentence are : ', sentence_words)
print('Our vectorized sentence is : ', sentence_vect)
Words sentence are : ["i've", 'never', 'seen', 'a', 'movie', 'like', 'this', 'before'] Our vectorized sentence is : [3, 7, 8, 0, 6, 5, 9, 1]
Next, we one-hot encode our vectorized sentence as a tensor :¶
# ---- We get a (sentence length x vector size) matrix of zeros
onehot = np.zeros( (10,8) )
# ---- We set some 1 for each word
for i,w in enumerate(sentence_vect):
# --- Show it
print('In a basic way :\n\n', onehot, '\n\nWith a pandas wiew :\n')
data={ f'{sentence_words[i]:.^10}':onehot[:,i] for i,w in enumerate(sentence_vect) }
# --- Pandas Warning
df.style.format('{:1.0f}').highlight_max(axis=0).set_properties(**{'text-align': 'center'})
In a basic way : [[0. 0. 0. 1. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 1.] [0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 1. 0. 0.] [0. 0. 0. 0. 1. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 0.] [0. 0. 1. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 1. 0.]] With a pandas wiew :
...i've... | ..never... | ...seen... | ....a..... | ..movie... | ...like... | ...this... | ..before.. | |
a | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
before | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
fantastic | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
i've | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
is | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
like | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
movie | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
never | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
seen | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
this | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Step 3 - Retrieve data¶
IMDb dataset can bet get directly from Keras - see documentation
Note : Due to their nature, textual data can be somewhat complex.
3.1 - Data structure :¶
The dataset is composed of 2 parts:
- reviews, this will be our x
- opinions (positive/negative), this will be our y
There are also a dictionary, because words are indexed in reviews
<dataset> = (<reviews>, <opinions>)
with : <reviews> = [ <review1>, <review2>, ... ]
<opinions> = [ <rate1>, <rate2>, ... ] where <ratei> = integer
where : <reviewi> = [ <w1>, <w2>, ...] <wi> are the index (int) of the word in the dictionary
<ratei> = int 0 for negative opinion, 1 for positive
<dictionary> = [ <word1>:<w1>, <word2>:<w2>, ... ]
with : <wordi> = word
<wi> = int
3.2 - Load dataset¶
For simplicity, we will use a pre-formatted dataset - See documentation
However, Keras offers some useful tools for formatting textual data - See documentation
By default :
- Start of a sequence will be marked with : 1
- Out of vocabulary word will be : 2
- First index will be : 3
# ----- Retrieve x,y
start_char = 1 # Start of a sequence (padding is 0)
oov_char = 2 # Out-of-vocabulary
index_from = 3 # First word id
(x_train, y_train), (x_test, y_test) = imdb.load_data( num_words = vocab_size,
skip_top = hide_most_frequently,
start_char = start_char,
oov_char = oov_char,
index_from = index_from)
# ---- About
print("Max(x_train,x_test) : ", fidle.utils.rmax([x_train,x_test]) )
print("Min(x_train,x_test) : ", fidle.utils.rmin([x_train,x_test]) )
print("Len(x_train) : ", len(x_train))
print("Len(x_test) : ", len(x_test))
Max(x_train,x_test) : 4999
Min(x_train,x_test) : 1 Len(x_train) : 25000 Len(x_test) : 25000
print('\nReview example (x_train[12]) :\n\n',x_train[12])
print('\nOpinions (y_train) :\n\n',y_train)
Review example (x_train[12]) : [1, 13, 119, 954, 189, 1554, 13, 92, 459, 48, 4, 116, 9, 1492, 2291, 42, 726, 4, 1939, 168, 2031, 13, 423, 14, 20, 549, 18, 4, 2, 547, 32, 4, 96, 39, 4, 454, 7, 4, 22, 8, 4, 55, 130, 168, 13, 92, 359, 6, 158, 1511, 2, 42, 6, 1913, 19, 194, 4455, 4121, 6, 114, 8, 72, 21, 465, 2, 304, 4, 51, 9, 14, 20, 44, 155, 8, 6, 226, 162, 616, 651, 51, 9, 14, 20, 44, 10, 10, 14, 218, 4843, 629, 42, 3017, 21, 48, 25, 28, 35, 534, 5, 6, 320, 8, 516, 5, 42, 25, 181, 8, 130, 56, 547, 3571, 5, 1471, 851, 14, 2286] Opinions (y_train) : [1 0 0 ... 0 1 0]
4.2 - Load dictionary¶
# ---- Retrieve dictionary {word:index}, and encode it in ascii
word_index = imdb.get_word_index()
# ---- Shift the dictionary from <index_from>
word_index = {w:(i+index_from) for w,i in word_index.items()}
# ---- Add <pad>, <start> and <unknown> tags
word_index.update( {'<pad>':0, '<start>':1, '<unknown>':2, '<undef>':3,} )
# ---- Create a reverse dictionary : {index:word}
index_word = {index:word for word,index in word_index.items()}
# ---- About dictionary
print('\nDictionary size : ', len(word_index))
print('\nSmall extract :\n')
for k in range(440,455):print(f' {k:2d} : {index_word[k]}' )
# ---- Add a nice function to transpose :
def dataset2text(review):
return ' '.join([index_word.get(i, '?') for i in review])
Dictionary size : 88588 Small extract : 440 : hope 441 : entertaining 442 : she's 443 : mr 444 : overall 445 : evil 446 : called 447 : loved 448 : based 449 : oh 450 : several 451 : fans 452 : mother 453 : drama 454 : beginning
4.3 - Have a look, for human¶
fidle.utils.subtitle('Review example :')
fidle.utils.subtitle('After translation :')
Review example :
[1, 13, 119, 954, 189, 1554, 13, 92, 459, 48, 4, 116, 9, 1492, 2291, 42, 726, 4, 1939, 168, 2031, 13, 423, 14, 20, 549, 18, 4, 2, 547, 32, 4, 96, 39, 4, 454, 7, 4, 22, 8, 4, 55, 130, 168, 13, 92, 359, 6, 158, 1511, 2, 42, 6, 1913, 19, 194, 4455, 4121, 6, 114, 8, 72, 21, 465, 2, 304, 4, 51, 9, 14, 20, 44, 155, 8, 6, 226, 162, 616, 651, 51, 9, 14, 20, 44, 10, 10, 14, 218, 4843, 629, 42, 3017, 21, 48, 25, 28, 35, 534, 5, 6, 320, 8, 516, 5, 42, 25, 181, 8, 130, 56, 547, 3571, 5, 1471, 851, 14, 2286]
After translation :
<start> i love cheesy horror flicks i don't care if the acting is sub par or whether the monsters look corny i liked this movie except for the <unknown> feeling all the way from the beginning of the film to the very end look i don't need a 10 page <unknown> or a sign with big letters explaining a plot to me but dark <unknown> takes the what is this movie about thing to a whole new annoying level what is this movie about br br this isn't exceptionally scary or thrilling but if you have an hour and a half to kill and or you want to end up feeling frustrated and confused rent this winner
4.4 - Few statistics¶
sizes=[len(i) for i in x_train]
plt.hist(sizes, bins=400)
plt.gca().set(title='Distribution of reviews by size - [{:5.2f}, {:5.2f}]'.format(min(sizes),max(sizes)),
xlabel='Size', ylabel='Density', xlim=[0,1500])
unk=[ 100*(s.count(oov_char)/len(s)) for s in x_train]
plt.hist(unk, bins=100)
plt.gca().set(title='Percent of unknown words - [{:5.2f}, {:5.2f}]'.format(min(unk),max(unk)),
xlabel='# unknown', ylabel='Density', xlim=[0,30])
Step 5 - Basic approach with "one-hot" vector encoding¶
Basic approach.
Each sentence is encoded with a vector of length equal to the size of the dictionary.
Each sentence will therefore be encoded with a simple vector.
The value of each component is 0 if the word is not present in the sentence or 1 if the word is present.
For a sentence s=[3,4,7] and a dictionary of 10 words...
We wil have a vector v=[0,0,0,1,1,0,0,1,0,0,0]
5.1 - Our one-hot encoder function¶
def one_hot_encoder(x, vector_size=10000):
# ---- Set all to 0
x_encoded = np.zeros((len(x), vector_size))
# ---- For each sentence
for i,sentence in enumerate(x):
for word in sentence:
x_encoded[i, word] = 1.
return x_encoded
5.2 - Encoding..¶
x_train = one_hot_encoder(x_train, vector_size=vocab_size)
x_test = one_hot_encoder(x_test, vector_size=vocab_size)
print("To have a look, x_train[12] became :", x_train[12] )
To have a look, x_train[12] became : [0. 1. 1. ... 0. 0. 0.]
Step 6 - Build a nice model¶
model = keras.Sequential(name='My IMDB classifier')
model.add(keras.layers.Input( shape=(vocab_size,) ))
model.add(keras.layers.Dense( 32, activation='relu'))
model.add(keras.layers.Dense( 32, activation='relu'))
model.add(keras.layers.Dense( 1, activation='sigmoid'))
model.compile(optimizer = 'rmsprop',
loss = 'binary_crossentropy',
metrics = ['accuracy'])
Model: "My IMDB classifier"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 32) │ 160,032 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ dense_1 (Dense) │ (None, 32) │ 1,056 │ ├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤ │ dense_2 (Dense) │ (None, 1) │ 33 │ └──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
Total params: 161,121 (629.38 KB)
Trainable params: 161,121 (629.38 KB)
Non-trainable params: 0 (0.00 B)
os.makedirs(f'{run_dir}/models', mode=0o750, exist_ok=True)
save_dir = f'{run_dir}/models/best_model.keras'
savemodel_callback = keras.callbacks.ModelCheckpoint( filepath=save_dir, monitor='val_accuracy', mode='max', save_best_only=True)
7.2 - Train it¶
history = model.fit(x_train,
epochs = epochs,
batch_size = batch_size,
validation_data = (x_test, y_test),
verbose = fit_verbosity,
callbacks = [savemodel_callback])
Epoch 1/10
49/49 - 2s - 43ms/step - accuracy: 0.7986 - loss: 0.4631 - val_accuracy: 0.8628 - val_loss: 0.3470
Epoch 2/10
49/49 - 2s - 36ms/step - accuracy: 0.8958 - loss: 0.2775 - val_accuracy: 0.8630 - val_loss: 0.3312
Epoch 3/10
49/49 - 2s - 36ms/step - accuracy: 0.9076 - loss: 0.2381 - val_accuracy: 0.8827 - val_loss: 0.2914
Epoch 4/10
49/49 - 2s - 36ms/step - accuracy: 0.9188 - loss: 0.2090 - val_accuracy: 0.8774 - val_loss: 0.2993
Epoch 5/10
49/49 - 2s - 36ms/step - accuracy: 0.9234 - loss: 0.1973 - val_accuracy: 0.8806 - val_loss: 0.2995
Epoch 6/10
49/49 - 2s - 36ms/step - accuracy: 0.9308 - loss: 0.1777 - val_accuracy: 0.8725 - val_loss: 0.3281
Epoch 7/10
49/49 - 2s - 36ms/step - accuracy: 0.9390 - loss: 0.1629 - val_accuracy: 0.8752 - val_loss: 0.3248
Epoch 8/10
49/49 - 2s - 36ms/step - accuracy: 0.9460 - loss: 0.1457 - val_accuracy: 0.8734 - val_loss: 0.3461
Epoch 9/10
49/49 - 2s - 36ms/step - accuracy: 0.9508 - loss: 0.1343 - val_accuracy: 0.8683 - val_loss: 0.3643
Epoch 10/10
49/49 - 2s - 36ms/step - accuracy: 0.9588 - loss: 0.1163 - val_accuracy: 0.8534 - val_loss: 0.4389
fidle.scrawler.history(history, save_as='02-history')
8.2 - Reload and evaluate best model¶
model = keras.models.load_model(f'{run_dir}/models/best_model.keras')
# ---- Evaluate
score = model.evaluate(x_test, y_test, verbose=0)
print('\n\nModel evaluation :\n')
print(' x_test / loss : {:5.4f}'.format(score[0]))
print(' x_test / accuracy : {:5.4f}'.format(score[1]))
values=[score[1], 1-score[1]]
fidle.scrawler.donut(values,["Accuracy","Errors"], title="#### Accuracy donut is :", save_as='03-donut')
# ---- Confusion matrix
y_sigmoid = model.predict(x_test, verbose=fit_verbosity)
y_pred = y_sigmoid.copy()
y_pred[ y_sigmoid< 0.5 ] = 0
y_pred[ y_sigmoid>=0.5 ] = 1
fidle.scrawler.confusion_matrix(y_test,y_pred,range(2), figsize=(8, 8),normalize=False, save_as='04-confusion-matrix')
