No description has been provided for this image

[K3IMDB3] - Reload and reuse a saved model¶

Retrieving a saved model to perform a sentiment analysis (movie review), using Keras 3 and PyTorch

Objectives :¶

  • The objective is to guess whether our personal film reviews are positive or negative based on the analysis of the text.
  • For this, we will use our previously saved model.

What we're going to do :¶

  • Preparing our data
  • Retrieve our saved model
  • Evaluate the result

Step 1 - Init python stuff¶

In [1]:
import os
os.environ['KERAS_BACKEND'] = 'torch'

import keras

import json,re
import numpy as np

import fidle

# Init Fidle environment
run_id, run_dir, datasets_dir = fidle.init('K3IMDB3')


FIDLE - Environment initialization

Version              : 2.3.2
Run id               : K3IMDB3
Run dir              : ./run/K3IMDB3
Datasets dir         : /lustre/fswork/projects/rech/mlh/uja62cb/fidle-project/datasets-fidle
Start time           : 22/12/24 21:23:22
Hostname             : r3i7n1 (Linux)
Tensorflow log level : Info + Warning + Error  (=0)
Update keras cache   : False
Update torch cache   : False
Save figs            : ./run/K3IMDB3/figs (True)
keras                : 3.7.0
numpy                : 2.1.2
sklearn              : 1.5.2
yaml                 : 6.0.2
matplotlib           : 3.9.2
pandas               : 2.2.3
torch                : 2.5.0

1.2 - Parameters¶

The words in the vocabulary are classified from the most frequent to the rarest.
vocab_size is the number of words we will remember in our vocabulary (the other words will be considered as unknown).
review_len is the review length
saved_models where our models were previously saved
dictionaries_dir is where we will go to save our dictionaries. (./data is a good choice)

In [2]:
vocab_size           = 10000
review_len           = 256

saved_models         = './run/K3IMDB2'
dictionaries_dir     = './data'

Override parameters (batch mode) - Just forget this cell

In [3]:
fidle.override('vocab_size', 'review_len', 'saved_models', 'dictionaries_dir')

Step 2 : Preparing the data¶

2.1 - Our reviews :¶

In [4]:
reviews = [ "This film is particularly nice, a must see.",
             "This film is a great classic that cannot be ignored.",
             "I don't remember ever having seen such a movie...",
             "This movie is just abominable and doesn't deserve to be seen!"]

2.2 - Retrieve dictionaries¶

Note : This dictionary is generated by 02-Embedding-Keras notebook.

In [5]:
with open(f'{dictionaries_dir}/word_index.json', 'r') as fp:
    word_index = json.load(fp)
    index_word = { i:w      for w,i in word_index.items() }
    print('Dictionaries loaded. ', len(word_index), 'entries' )
Dictionaries loaded.  88588 entries

2.3 - Clean, index and padd¶

Phases are split into words, punctuation is removed, sentence length is limited and padding is added...
Note : 1 is "Start" and 2 is "unknown"

In [6]:
start_char = 1      # Start of a sequence (padding is 0)
oov_char   = 2      # Out-of-vocabulary
index_from = 3      # First word id

nb_reviews = len(reviews)
x_data     = []

# ---- For all reviews
for review in reviews:
    print('Words are : ', end='')
    # ---- First index must be <start>
    index_review=[start_char]
    print(f'{start_char} ', end='')
    # ---- For all words
    for w in review.split(' '):
        # ---- Clean it
        w_clean = re.sub(r"[^a-zA-Z0-9]", "", w)
        # ---- Not empty ?
        if len(w_clean)>0:
            # ---- Get the index - must be inside dict or is out of vocab (oov)
            w_index = word_index.get(w, oov_char)
            if w_index>vocab_size : w_index=oov_char
            # ---- Add the index if < vocab_size
            index_review.append(w_index)
            print(f'{w_index} ', end='')
    # ---- Add the indexed review
    x_data.append(index_review)
    print()

# ---- Padding
x_data = keras.preprocessing.sequence.pad_sequences(x_data, value   = 0, padding = 'post', maxlen  = review_len)
Words are : 1 2 22 9 572 2 6 215 2 
Words are : 1 2 22 9 6 87 356 15 566 30 2 
Words are : 1 2 92 377 126 260 110 141 6 2 
Words are : 1 2 20 9 43 2 5 152 1833 8 30 2 

2.4 - Have a look¶

In [7]:
def translate(x):
    return ' '.join( [index_word.get(i,'?') for i in x] )

for i in range(nb_reviews):
    imax=np.where(x_data[i]==0)[0][0]+5
    print(f'\nText review {i}  :',    reviews[i])
    print(f'tokens vector  :', list(x_data[i][:imax]), '(...)')
    print('Translation    :', translate(x_data[i][:imax]), '(...)')
Text review 0  : This film is particularly nice, a must see.
tokens vector  : [np.int32(1), np.int32(2), np.int32(22), np.int32(9), np.int32(572), np.int32(2), np.int32(6), np.int32(215), np.int32(2), np.int32(0), np.int32(0), np.int32(0), np.int32(0), np.int32(0)] (...)
Translation    : <start> <unknown> film is particularly <unknown> a must <unknown> <pad> <pad> <pad> <pad> <pad> (...)

Text review 1  : This film is a great classic that cannot be ignored.
tokens vector  : [np.int32(1), np.int32(2), np.int32(22), np.int32(9), np.int32(6), np.int32(87), np.int32(356), np.int32(15), np.int32(566), np.int32(30), np.int32(2), np.int32(0), np.int32(0), np.int32(0), np.int32(0), np.int32(0)] (...)
Translation    : <start> <unknown> film is a great classic that cannot be <unknown> <pad> <pad> <pad> <pad> <pad> (...)

Text review 2  : I don't remember ever having seen such a movie...
tokens vector  : [np.int32(1), np.int32(2), np.int32(92), np.int32(377), np.int32(126), np.int32(260), np.int32(110), np.int32(141), np.int32(6), np.int32(2), np.int32(0), np.int32(0), np.int32(0), np.int32(0), np.int32(0)] (...)
Translation    : <start> <unknown> don't remember ever having seen such a <unknown> <pad> <pad> <pad> <pad> <pad> (...)

Text review 3  : This movie is just abominable and doesn't deserve to be seen!
tokens vector  : [np.int32(1), np.int32(2), np.int32(20), np.int32(9), np.int32(43), np.int32(2), np.int32(5), np.int32(152), np.int32(1833), np.int32(8), np.int32(30), np.int32(2), np.int32(0), np.int32(0), np.int32(0), np.int32(0), np.int32(0)] (...)
Translation    : <start> <unknown> movie is just <unknown> and doesn't deserve to be <unknown> <pad> <pad> <pad> <pad> <pad> (...)

Step 3 - Bring back the model¶

In [8]:
model = keras.models.load_model(f'{saved_models}/models/best_model.keras')

Step 4 - Predict¶

In [9]:
y_pred   = model.predict(x_data, verbose=0)

And the winner is :¶

In [10]:
for i,review in enumerate(reviews):
    rate    = y_pred[i][0]
    opinion =  'NEGATIVE :-(' if rate<0.5 else 'POSITIVE :-)'    
    print(f'{review:<70} => {rate:.2f} - {opinion}')
This film is particularly nice, a must see.                            => 0.52 - POSITIVE :-)
This film is a great classic that cannot be ignored.                   => 0.68 - POSITIVE :-)
I don't remember ever having seen such a movie...                      => 0.49 - NEGATIVE :-(
This movie is just abominable and doesn't deserve to be seen!          => 0.31 - NEGATIVE :-(
In [11]:
fidle.end()

End time : 22/12/24 21:23:22
Duration : 00:00:00 489ms
This notebook ends here :-)
https://fidle.cnrs.fr


No description has been provided for this image