[K3IMDB4] - Reload embedded vectors¶
Retrieving embedded vectors from our trained model, using Keras 3 and PyTorchObjectives :¶
- The objective is to retrieve and visualize our embedded vectors
- For this, we will use our previously saved model.
What we're going to do :¶
- Retrieve our saved model
- Extract vectors and play with
Step 1 - Init python stuff¶
In [1]:
import os
os.environ['KERAS_BACKEND'] = 'torch'
import keras
import json,re
import numpy as np
import fidle
# Init Fidle environment
run_id, run_dir, datasets_dir = fidle.init('K3IMDB4')
FIDLE - Environment initialization
Version : 2.3.0 Run id : K3IMDB4 Run dir : ./run/K3IMDB4 Datasets dir : /gpfswork/rech/mlh/uja62cb/fidle-project/datasets-fidle Start time : 03/03/24 21:11:35 Hostname : r6i0n6 (Linux) Tensorflow log level : Warning + Error (=1) Update keras cache : False Update torch cache : False Save figs : ./run/K3IMDB4/figs (True) keras : 3.0.4 numpy : 1.24.4 sklearn : 1.3.2 yaml : 6.0.1 matplotlib : 3.8.2 pandas : 2.1.3 torch : 2.1.1
1.2 - Parameters¶
The words in the vocabulary are classified from the most frequent to the rarest.
vocab_size
is the number of words we will remember in our vocabulary (the other words will be considered as unknown).
review_len
is the review length
saved_models
where our models were previously saved
dictionaries_dir
is where we will go to save our dictionaries. (./data is a good choice)
In [2]:
vocab_size = 5000
review_len = 256
saved_models = './run/K3IMDB2'
dictionaries_dir = './data'
Override parameters (batch mode) - Just forget this cell
In [3]:
fidle.override('vocab_size', 'review_len', 'saved_models', 'dictionaries_dir')
Step 2 - Get the embedding vectors !¶
2.1 - Load model and dictionaries¶
Note : This dictionary is generated by 02-Embedding-Keras notebook.
In [4]:
model = keras.models.load_model(f'{saved_models}/models/best_model.keras')
print('Model loaded.')
with open(f'{dictionaries_dir}/word_index.json', 'r') as fp:
word_index = json.load(fp)
index_word = { i:w for w,i in word_index.items() }
print('Dictionaries loaded. ', len(word_index), 'entries' )
Model loaded. Dictionaries loaded. 88588 entries
2.2 - Retrieve embeddings¶
In [5]:
embeddings = model.layers[0].get_weights()[0]
print('Shape of embeddings : ',embeddings.shape)
Shape of embeddings : (5000, 32)
2.3 - Build a nice dictionary¶
In [6]:
word_embedding = { index_word[i]:embeddings[i] for i in range(vocab_size) }
In [7]:
word_embedding['nice']
Out[7]:
array([-0.19265787, -0.1782463 , -0.12522878, 0.13078946, -0.18754876, -0.21244262, -0.22118522, 0.19628656, -0.12525214, 0.16055878, 0.14045134, 0.12333236, -0.15863857, 0.19821374, -0.19368635, 0.19027653, 0.16695064, 0.14144616, -0.1473321 , 0.20249814, -0.18405882, 0.13139895, 0.12899621, -0.14405546, 0.15086392, -0.16722818, -0.16204427, -0.11995099, 0.18977174, 0.11766762, -0.18468359, -0.15323788], dtype=float32)
Few usefull functions to play with¶
In [8]:
# Return a l2 distance between 2 words
#
def l2w(w1,w2):
v1=word_embedding[w1]
v2=word_embedding[w2]
return np.linalg.norm(v2-v1)
# Show distance between 2 words
#
def show_l2(w1,w2):
print(f'\nL2 between [{w1}] and [{w2}] : ',l2w(w1,w2))
# Displays the 15 closest words to a given word
#
def neighbors(w1):
v1=word_embedding[w1]
dd={}
for i in range(4, 1000):
w2=index_word[i]
dd[w2]=l2w(w1,w2)
dd= {k: v for k, v in sorted(dd.items(), key=lambda item: item[1])}
print(f'\nNeighbors of [{w1}] : ', list(dd.keys())[1:15])
Examples¶
In [9]:
show_l2('nice', 'pleasant')
show_l2('nice', 'horrible')
neighbors('horrible')
neighbors('great')
L2 between [nice] and [pleasant] : 0.5500398 L2 between [nice] and [horrible] : 3.900939 Neighbors of [horrible] : ['save', 'annoying', 'dull', 'mess', 'terrible', 'ridiculous', 'badly', 'poor', 'avoid', 'worse', 'fails', 'boring', 'predictable', 'lame'] Neighbors of [great] : ['fantastic', 'enjoyed', 'definitely', '9', 'enjoyable', 'brilliant', 'fun', 'loved', 'masterpiece', 'surprised', 'wonderful', 'highly', 'hilarious', 'amazing']
In [10]:
fidle.end()
End time : 03/03/24 21:11:35
Duration : 00:00:00 386ms
This notebook ends here :-)
https://fidle.cnrs.fr