In this lab, you will be building a sentiment classification model to distinguish between positive and negative movie reviews. You will train it on the IMDB Reviews dataset and visualize the word embeddings generated after training.

In [1]:
import tensorflow_datasets as tfds
import tensorflow as tf
import io

You will load the dataset via Tensorflow Datasets, a collection of prepared datasets for machine learning. If you're running this notebook on your local machine, make sure to have the tensorflow-datasets package installed before importing it. You can install it via pip as shown in the commented cell below - pip install -q tensorflow-datasets

The tfds.load() method downloads the dataset into your working directory. You can set the with_info parameter to True if you want to see the description of the dataset. The as_supervised parameter, on the other hand, is set to load the data as (input, label) pairs.

To ensure smooth operation, the data was pre-downloaded and saved in the data folder. When you have the data already downloaded, you can read it by passing two additional arguments. With data_dir="./data/", you specify the folder where the data is located (if different than default) and by setting download=False you explicitly tell the method to read the data from the folder, rather than downloading it.

In [3]:
# Load the IMDB Reviews dataset
# I am setting download = True as I have to download the data in the required directory and from next time
# I can fetch it directly from folder setting download=False

imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True, data_dir="./data/", download=True)
Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to data\imdb_reviews\plain_text\1.0.0...
Dl Completed...: 0 url [00:00, ? url/s]
Dl Size...: 0 MiB [00:00, ? MiB/s]
Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]
Generating train examples...: 0 examples [00:00, ? examples/s]
Shuffling data\imdb_reviews\plain_text\incomplete.BM143N_1.0.0\imdb_reviews-train.tfrecord*...:   0%|         …
Generating test examples...: 0 examples [00:00, ? examples/s]
Shuffling data\imdb_reviews\plain_text\incomplete.BM143N_1.0.0\imdb_reviews-test.tfrecord*...:   0%|          …
Generating unsupervised examples...: 0 examples [00:00, ? examples/s]
Shuffling data\imdb_reviews\plain_text\incomplete.BM143N_1.0.0\imdb_reviews-unsupervised.tfrecord*...:   0%|  …
Dataset imdb_reviews downloaded and prepared to data\imdb_reviews\plain_text\1.0.0. Subsequent calls will reuse this data.
In [5]:
print(info)
tfds.core.DatasetInfo(
    name='imdb_reviews',
    full_name='imdb_reviews/plain_text/1.0.0',
    description="""
    Large Movie Review Dataset. This is a dataset for binary sentiment
    classification containing substantially more data than previous benchmark
    datasets. We provide a set of 25,000 highly polar movie reviews for training,
    and 25,000 for testing. There is additional unlabeled data for use as well.
    """,
    config_description="""
    Plain text
    """,
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    data_dir='data\\imdb_reviews\\plain_text\\1.0.0',
    file_format=tfrecord,
    download_size=Unknown size,
    dataset_size=129.83 MiB,
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
        'text': Text(shape=(), dtype=string),
    }),
    supervised_keys=('text', 'label'),
    disable_shuffling=False,
    splits={
        'test': <SplitInfo num_examples=25000, num_shards=1>,
        'train': <SplitInfo num_examples=25000, num_shards=1>,
        'unsupervised': <SplitInfo num_examples=50000, num_shards=1>,
    },
    citation="""@InProceedings{maas-EtAl:2011:ACL-HLT2011,
      author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
      title     = {Learning Word Vectors for Sentiment Analysis},
      booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
      month     = {June},
      year      = {2011},
      address   = {Portland, Oregon, USA},
      publisher = {Association for Computational Linguistics},
      pages     = {142--150},
      url       = {http://www.aclweb.org/anthology/P11-1015}
    }""",
)

As you can see in the output above, there is a total of 100,000 examples in the dataset and it is split into train, test and unsupervised sets. For this lab, you will only use train and test sets because you will need labeled examples to train your model.

The imdb dataset that you downloaded earlier contains a dictionary pointing to tf.data.Dataset objects.

You can preview the raw format of a few examples by using the take() method and iterating over it as shown below:

In [8]:
for example in imdb['train'].take(2):
    print(example)
(<tf.Tensor: shape=(), dtype=string, numpy=b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.">, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)

You can see that each example is a 2-element tuple of tensors containing the text first, then the label (shown in the numpy() property). The next cell below will take all the train and test sentences and labels into separate lists so you can preprocess the text and feed it to the model later.

In [9]:
# Get the train and test sets
train_dataset, test_dataset = imdb['train'], imdb['test']

Now you can do the text preprocessing steps you've learned. You will convert the strings to integer sequences, then pad them to a uniform length. The parameters are separated into its own code cell below so it will be easy for you to tweak it later if you want.

In [10]:
# Parameters

VOCAB_SIZE = 10000
MAX_LENGTH = 120
EMBEDDING_DIM = 16
PADDING_TYPE = 'pre'
TRUNC_TYPE = 'post'

An important thing to note here is you should generate the vocabulary based only on the training set. You should not include the test set because that is meant to represent data that the model hasn't seen before. With that, you can expect more unknown tokens (i.e. the value 1) in the integer sequences of the test data. Also for clarity in demonstrating the transformations, you will first separate the reviews and labels. You will see other ways to implement the data pipeline in the next labs.

In [11]:
# Instantiate vectorization layer
vectorize_layer = tf.keras.layers.TextVectorization(max_tokens=VOCAB_SIZE)

# Get the string inputs and integer outputs of the training set
train_reviews = train_dataset.map(lambda review, label:review)
train_label = train_dataset.map(lambda review, label:label)

# Get the string inputs and integer outputs of the test set
test_reviews = test_dataset.map(lambda review,label:review)
test_label = test_dataset.map(lambda review,label:label)

# Generate the vocabulary based only on the training set
vectorize_layer.adapt(train_reviews)

You will define a padding function to generate the padded sequences. Note that the pad_sequences() function expects an iterable (e.g. list) while the input to this function is a tf.data.Dataset. Here's one way to do the conversion:

  • Put all the elements in a single ragged batch (i.e. batch with elements that have different lengths).
    • You will need to specify the batch size and it has to match the number of all elements in the dataset. From the output of the dataset info earlier, you know that this should be 25000.
    • Instead of specifying a specific number, you can also use the cardinality() method. This computes the number of elements in a tf.data.Dataset.
  • Use the get_single_element() method on the single batch to output a Tensor.
  • Convert back to a tf.data.Dataset. You'll see why this is needed in the next cell.
In [14]:
def padding_func(sequences):
    # Putting all the elements in a single ragged bacth
    sequences = sequences.padded_batch(batch_size=tf.data.experimental.cardinality(sequences).numpy(),
                                       padded_shapes=[None])
    
    # Output a tensor from a single batch
    sequences = sequences.get_single_element()
    
    # Pad the sequences
    padded_sequences = tf.keras.utils.pad_sequences(sequences.numpy(),
                                                    maxlen=MAX_LENGTH,
                                                    truncating=TRUNC_TYPE,
                                                    padding=PADDING_TYPE)
    
    # Convert back to tf.data.Dataset
    padded_sequences=tf.data.Dataset.from_tensor_slices(padded_sequences)
    
    return padded_sequences

ragged_batch() creates a variable-length representation, which cannot be processed directly in your function. padded_batch() ensures fixed-length sequences, making it compatible with .numpy() and pad_sequences(). Since ragged_batch() is deprecated, padded_batch() is the correct approach

This is the pipeline to convert the raw string inputs to padded integer sequences:

  • Use the map() method to pass each string to the TextVectorization layer defined earlier.
  • Use the apply() method to use the padding function on the entire dataset.

The difference between map() and apply() is the mapping function in map() expects its input to be single elements (i.e. element-wise transformations), while the transformation function in apply() expects its input to be the entire dataset in the pipeline.

In [15]:
# Apply the layer to the train and test data
train_sequences = train_reviews.map(lambda text: vectorize_layer(text)).apply(padding_func)
test_sequences = test_reviews.map(lambda text: vectorize_layer(text)).apply(padding_func)
In [16]:
# View 2 training sequences
for example in train_sequences.take(2):
  print(example)
  print()
tf.Tensor(
[  11   14   34  412  384   18   90   28    1    8   33 1320 3555   42
  487    1  191   24   85  152   19   11  217  317   28   65  240  215
    8  489   54   65   85  112   96   22 5652   11   93  639  741   11
   18    7   34  394 9515  170 2464  408    2   88 1216  137   66  144
   51    2    1 7552   66  245   65 2867   16    1 2858    1    1 1428
 5045    3   40    1 1581   17 3555   14  158   19    4 1216  890 8030
    8    4   18   12   14 4054    5   99  146 1240   10  237  707   12
   48   24   93   39   11 7329  152   39 1320    1   50  398   10   96
 1155  850  141    9    0    0    0    0], shape=(120,), dtype=int32)

tf.Tensor(
[  10   26   75  617    6  777 2355  299   95   19   11    7  603  662
    6    4 2128    5  180  571   63 1404  107 2408    3 3902   21    2
    1    3  253   41 4777    4  169  186   21   11 4254   10 1503 2355
   80    2   20   14 1971    2  114  942   14 1737 1297  593    3  356
  180  445    6  597   19   17   57 1772    5   49   14 3997   98   42
  134   10  933   10  194   26 1027  171    5    2   20   19   10  284
    2 2065    5    9    3  279   41  445    6  597    5   30  200    1
  201   99  146 4522   16  229  329   10  175  369   11   20   31   32
    0    0    0    0    0    0    0    0], shape=(120,), dtype=int32)

In [24]:
# Recombine sequences with labels -
# Zipping requires both to be in tf.data.Dataset format

train_dataset_vectorized = tf.data.Dataset.zip((train_sequences, train_label))
test_dataset_vectorized = tf.data.Dataset.zip((test_sequences, test_label))
In [25]:
# View 2 training sequences and its labels
for example in train_dataset_vectorized.take(2):
  print(example)
  print()
(<tf.Tensor: shape=(120,), dtype=int32, numpy=
array([  11,   14,   34,  412,  384,   18,   90,   28,    1,    8,   33,
       1320, 3555,   42,  487,    1,  191,   24,   85,  152,   19,   11,
        217,  317,   28,   65,  240,  215,    8,  489,   54,   65,   85,
        112,   96,   22, 5652,   11,   93,  639,  741,   11,   18,    7,
         34,  394, 9515,  170, 2464,  408,    2,   88, 1216,  137,   66,
        144,   51,    2,    1, 7552,   66,  245,   65, 2867,   16,    1,
       2858,    1,    1, 1428, 5045,    3,   40,    1, 1581,   17, 3555,
         14,  158,   19,    4, 1216,  890, 8030,    8,    4,   18,   12,
         14, 4054,    5,   99,  146, 1240,   10,  237,  707,   12,   48,
         24,   93,   39,   11, 7329,  152,   39, 1320,    1,   50,  398,
         10,   96, 1155,  850,  141,    9,    0,    0,    0,    0])>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)

(<tf.Tensor: shape=(120,), dtype=int32, numpy=
array([  10,   26,   75,  617,    6,  777, 2355,  299,   95,   19,   11,
          7,  603,  662,    6,    4, 2128,    5,  180,  571,   63, 1404,
        107, 2408,    3, 3902,   21,    2,    1,    3,  253,   41, 4777,
          4,  169,  186,   21,   11, 4254,   10, 1503, 2355,   80,    2,
         20,   14, 1971,    2,  114,  942,   14, 1737, 1297,  593,    3,
        356,  180,  445,    6,  597,   19,   17,   57, 1772,    5,   49,
         14, 3997,   98,   42,  134,   10,  933,   10,  194,   26, 1027,
        171,    5,    2,   20,   19,   10,  284,    2, 2065,    5,    9,
          3,  279,   41,  445,    6,  597,    5,   30,  200,    1,  201,
         99,  146, 4522,   16,  229,  329,   10,  175,  369,   11,   20,
         31,   32,    0,    0,    0,    0,    0,    0,    0,    0])>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)

In [27]:
# Lastly you will optimize and batch the dataset

SHUFFLE_BATCH_SIZE = 1000
PREFETCH_BATCH_SIZE = tf.data.AUTOTUNE
BATCH_SIZE = 32

train_dataset_final = (train_dataset_vectorized.cache()
                      .shuffle(SHUFFLE_BATCH_SIZE)
                      .prefetch(PREFETCH_BATCH_SIZE)
                      .batch(BATCH_SIZE))

test_dataset_final = (test_dataset_vectorized.cache()
                     .prefetch(PREFETCH_BATCH_SIZE)
                     .batch(BATCH_SIZE))

With the data already preprocessed, you can proceed to building your sentiment classification model. The input will be an Embedding layer. The main idea here is to represent each word in your vocabulary with vectors. These vectors have trainable weights so as your neural network learns, words that are most likely to appear in a positive tweet will converge towards similar weights. Similarly, words in negative tweets will be clustered more closely together. You can read more about word embeddings here.

After the Embedding layer, you will flatten its output and feed it into a Dense layer. You will explore other architectures for these hidden layers in the next labs.

The output layer would be a single neuron with a sigmoid activation to distinguish between the 2 classes. As is typical with binary classifiers, you will use the binary_crossentropy as your loss function while training.

In [28]:
# Build the model
model = tf.keras.Sequential([
    tf.keras.Input(shape=(MAX_LENGTH,)),
    tf.keras.layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Setup the training parameters
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

# Print the model summary
model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 120, 16)           160000    
                                                                 
 flatten (Flatten)           (None, 1920)              0         
                                                                 
 dense (Dense)               (None, 6)                 11526     
                                                                 
 dense_1 (Dense)             (None, 1)                 7         
                                                                 
=================================================================
Total params: 171,533
Trainable params: 171,533
Non-trainable params: 0
_________________________________________________________________
In [29]:
NUM_EPOCHS = 5

# Train the model
model.fit(train_dataset_final, epochs=NUM_EPOCHS, validation_data=test_dataset_final)
Epoch 1/5
782/782 [==============================] - 15s 7ms/step - loss: 0.5048 - accuracy: 0.7288 - val_loss: 0.3851 - val_accuracy: 0.8261
Epoch 2/5
782/782 [==============================] - 5s 6ms/step - loss: 0.2394 - accuracy: 0.9054 - val_loss: 0.4269 - val_accuracy: 0.8194
Epoch 3/5
782/782 [==============================] - 5s 6ms/step - loss: 0.0943 - accuracy: 0.9745 - val_loss: 0.5099 - val_accuracy: 0.8128
Epoch 4/5
782/782 [==============================] - 4s 6ms/step - loss: 0.0234 - accuracy: 0.9965 - val_loss: 0.5992 - val_accuracy: 0.8093
Epoch 5/5
782/782 [==============================] - 4s 5ms/step - loss: 0.0072 - accuracy: 0.9992 - val_loss: 0.6804 - val_accuracy: 0.8055
Out[29]:
<keras.callbacks.History at 0x1940de217b0>

After training, you can visualize the trained weights in the Embedding layer to see words that are clustered together. The Tensorflow Embedding Projector is able to reduce the 16-dimension vectors you defined earlier into fewer components so it can be plotted in the projector. First, you will need to get these weights and you can do that with the cell below:

In [30]:
# Get the embedding layer from the model (i.e. first layer)
embedding_layer = model.layers[0]

# Get the weights of the embedding layer
embedding_weights = embedding_layer.get_weights()[0]

# Print the shape. Expected is (vocab_size, embedding_dim)
print(embedding_weights.shape)
(10000, 16)

You will need to generate two files:

  • vecs.tsv - contains the vector weights of each word in the vocabulary
  • meta.tsv - contains the words in the vocabulary

You will get the word list from the TextVectorization layer you adapted earler, then start the loop to generate the files. You will loop vocab_size-1 times, skipping the 0 key because it is just for the padding.

In [31]:
# Open writeable files
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')

# Get the word list
vocabulary = vectorize_layer.get_vocabulary()

# Initialize the loop. Start counting at `1` because `0` is just for the padding
for word_num in range(1, len(vocabulary)):

  # Get the word associated withAttributeError the current index
  word_name = vocabulary[word_num]

  # Get the embedding weights associated with the current index
  word_embedding = embedding_weights[word_num]

  # Write the word name
  out_m.write(word_name + "\n")

  # Write the word embedding
  out_v.write('\t'.join([str(x) for x in word_embedding]) + "\n")

# Close the files
out_v.close()
out_m.close()
In [ ]:
 
In [32]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
In [33]:
# Reduce dimensions to 2D for visualization
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
embedding_2d_tsne = tsne.fit_transform(embedding_weights) 
In [ ]:
pca = PCA(n_components=2)
embedding_2d_pca = pca.fit_transform(embedding_weights)
In [38]:
plt.figure(figsize=(12, 8))
sns.scatterplot(x=embedding_2d[:, 0], y=embedding_2d[:, 1], alpha=0.6)

# Annotate some words (subset for clarity)
num_words_to_label = 200  # Adjust based on readability
for i in range(num_words_to_label):
    plt.text(embedding_2d[i, 0], embedding_2d[i, 1], vocabulary[i], fontsize=9)

plt.title("Word Embedding Visualization")
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.show()
No description has been provided for this image
In [37]:
positive_words = {"good", "great", "excellent", "love", "amazing"}
negative_words = {"bad", "terrible", "worst", "hate", "awful"}

plt.figure(figsize=(12, 8))

for i, word in enumerate(vocabulary[:300]):  # Plot only first 300 words for readability
    color = "blue" if word in positive_words else "red" if word in negative_words else "gray"
    plt.scatter(embedding_2d[i, 0], embedding_2d[i, 1], color=color, alpha=0.6)
    if i < 100:  # Annotate only a few for clarity
        plt.text(embedding_2d[i, 0], embedding_2d[i, 1], word, fontsize=9)

plt.title("Word Embedding Clusters (Sentiment-based)")
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.show()
No description has been provided for this image
In [ ]: