Importing Libraries
In [1]:
import io
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pickle
Defining the global variables
In [4]:
VOCAB_SIZE = 1000
EMBEDDING_DIM = 16
MAX_LENGTH = 120
TRAINING_SPLIT = 0.8
Loading and exploring data
In [5]:
data_dir = "bbc-text.csv"
data = np.loadtxt(data_dir, delimiter=',', skiprows=1, dtype='str', comments=None)
print(f"Shape of the data: {data.shape}")
print(f"{data[0]}\n{data[1]}")
Shape of the data: (2225, 2) ['tech' 'tv future in the hands of viewers with home theatre systems plasma high-definition tvs and digital video recorders moving into the living room the way people watch tv will be radically different in five years time. that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend programmes and other content will be delivered to viewers via home networks through cable satellite telecoms companies and broadband service providers to front rooms and portable devices. one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes like the us s tivo and the uk s sky+ system allow people to record store play pause and forward wind tv programmes when they want. essentially the technology allows for much more personalised tv. they are also being built-in to high-definition tv sets which are big business in japan and the us but slower to take off in europe because of the lack of high-definition programming. not only can people forward wind through adverts they can also forget about abiding by network and channel schedules putting together their own a-la-carte entertainment. but some us networks and cable and satellite companies are worried about what it means for them in terms of advertising revenues as well as brand identity and viewer loyalty to channels. although the us leads in this technology at the moment it is also a concern that is being raised in europe particularly with the growing uptake of services like sky+. what happens here today we will see in nine months to a years time in the uk adam hume the bbc broadcast s futurologist told the bbc news website. for the likes of the bbc there are no issues of lost advertising revenue yet. it is a more pressing issue at the moment for commercial uk broadcasters but brand loyalty is important for everyone. we will be talking more about content brands rather than network brands said tim hanlon from brand communications firm starcom mediavest. the reality is that with broadband connections anybody can be the producer of content. he added: the challenge now is that it is hard to promote a programme with so much choice. what this means said stacey jolna senior vice president of tv guide tv group is that the way people find the content they want to watch has to be simplified for tv viewers. it means that networks in us terms or channels could take a leaf out of google s book and be the search engine of the future instead of the scheduler to help people find what they want to watch. this kind of channel model might work for the younger ipod generation which is used to taking control of their gadgets and what they play on them. but it might not suit everyone the panel recognised. older generations are more comfortable with familiar schedules and channel brands because they know what they are getting. they perhaps do not want so much of the choice put into their hands mr hanlon suggested. on the other end you have the kids just out of diapers who are pushing buttons already - everything is possible and available to them said mr hanlon. ultimately the consumer will tell the market they want. of the 50 000 new gadgets and technologies being showcased at ces many of them are about enhancing the tv-watching experience. high-definition tv sets are everywhere and many new models of lcd (liquid crystal display) tvs have been launched with dvr capability built into them instead of being external boxes. one such example launched at the show is humax s 26-inch lcd tv with an 80-hour tivo dvr and dvd recorder. one of the us s biggest satellite tv companies directtv has even launched its own branded dvr at the show with 100-hours of recording capability instant replay and a search function. the set can pause and rewind tv for up to 90 hours. and microsoft chief bill gates announced in his pre-show keynote speech a partnership with tivo called tivotogo which means people can play recorded programmes on windows pcs and mobile devices. all these reflect the increasing trend of freeing up multimedia so that people can watch what they want when they want.'] ['business' 'worldcom boss left books alone former worldcom boss bernie ebbers who is accused of overseeing an $11bn (£5.8bn) fraud never made accounting decisions a witness has told jurors. david myers made the comments under questioning by defence lawyers who have been arguing that mr ebbers was not responsible for worldcom s problems. the phone company collapsed in 2002 and prosecutors claim that losses were hidden to protect the firm s shares. mr myers has already pleaded guilty to fraud and is assisting prosecutors. on monday defence lawyer reid weingarten tried to distance his client from the allegations. during cross examination he asked mr myers if he ever knew mr ebbers make an accounting decision . not that i am aware of mr myers replied. did you ever know mr ebbers to make an accounting entry into worldcom books mr weingarten pressed. no replied the witness. mr myers has admitted that he ordered false accounting entries at the request of former worldcom chief financial officer scott sullivan. defence lawyers have been trying to paint mr sullivan who has admitted fraud and will testify later in the trial as the mastermind behind worldcom s accounting house of cards. mr ebbers team meanwhile are looking to portray him as an affable boss who by his own admission is more pe graduate than economist. whatever his abilities mr ebbers transformed worldcom from a relative unknown into a $160bn telecoms giant and investor darling of the late 1990s. worldcom s problems mounted however as competition increased and the telecoms boom petered out. when the firm finally collapsed shareholders lost about $180bn and 20 000 workers lost their jobs. mr ebbers trial is expected to last two months and if found guilty the former ceo faces a substantial jail sentence. he has firmly declared his innocence.']
In [6]:
print(f"There are {len(data)} sentence-label pairs in the dataset.\n")
print(f"First sentence has {len((data[0,1]).split())} words.\n")
print(f"The first 5 labels are {data[:5,0]}")
There are 2225 sentence-label pairs in the dataset. First sentence has 737 words. The first 5 labels are ['tech' 'business' 'sport' 'sport' 'entertainment']
Splitting into Training and Validation dataset¶
In [7]:
def train_val_datasets(data):
'''
Splits data into traning and validations sets
Args:
data (np.array): array with two columns, first one is the label, the second is the text
Returns:
(tf.data.Dataset, tf.data.Dataset): tuple containing the train and validation datasets
'''
# Compute the number of sentences that will be used for training (should be an integer)
train_size = int(len(data)*TRAINING_SPLIT)
# Slice the dataset to get only the texts. Remember that texts are on the second column
texts = data[:,1]
# Slice the dataset to get only the labels. Remember that labels are on the first column
labels = data[:,0]
# Split the sentences and labels into train/validation splits. Write your own code below
train_texts = texts[:train_size]
validation_texts = texts[train_size:]
train_labels = labels[:train_size]
validation_labels = labels[train_size:]
# create the train and validation datasets from the splits
train_dataset = tf.data.Dataset.from_tensor_slices((train_texts, train_labels))
validation_dataset = tf.data.Dataset.from_tensor_slices((validation_texts, validation_labels))
return train_dataset, validation_dataset
In [8]:
# Create the datasets
train_dataset, validation_dataset = train_val_datasets(data)
print(f"There are {train_dataset.cardinality()} sentence-label pairs for training.\n")
print(f"There are {validation_dataset.cardinality()} sentence-label pairs for validation.\n")
There are 1780 sentence-label pairs for training. There are 445 sentence-label pairs for validation.
Standardizing or Cleaning the text and Vectorizing the text¶
In [9]:
def standardize_func(sentence):
# List of stopwords
stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "her", "here", "hers", "herself", "him", "himself", "his", "how", "i", "if", "in", "into", "is", "it", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "should", "so", "some", "such", "than", "that", "the", "their", "theirs", "them", "themselves", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "were", "what", "when", "where", "which", "while", "who", "whom", "why", "why", "with", "would", "you", "your", "yours", "yourself", "yourselves", "'m", "'d", "'ll", "'re", "'ve", "'s", "'d"]
# Sentence converted to lowercase-only
sentence = tf.strings.lower(sentence)
# Remove stopwords
for word in stopwords:
if word[0] == "'":
sentence = tf.strings.regex_replace(sentence, rf"{word}\b", "")
else:
sentence = tf.strings.regex_replace(sentence, rf"\b{word}\b", "")
# Remove punctuation
sentence = tf.strings.regex_replace(sentence, r'[!"#$%&()\*\+,-\./:;<=>?@\[\\\]^_`{|}~\']', "")
return sentence
In [10]:
def fit_vectorizer(train_sentences, standardize_func):
'''
Defines and adapts the text vectorizer
Args:
train_sentences (tf.data.Dataset): sentences from the train dataset to fit the TextVectorization layer
standardize_func (FunctionType): function to remove stopwords and punctuation, and lowercase texts.
Returns:
TextVectorization: adapted instance of TextVectorization layer
'''
# Instantiate the TextVectorization class, passing in the correct values for the given parameters below
vectorizer = tf.keras.layers.TextVectorization(
standardize=standardize_func,
max_tokens=VOCAB_SIZE,
output_sequence_length=MAX_LENGTH
)
# Adapt the vectorizer to the training sentences
vectorizer.adapt(train_sentences)
return vectorizer
In [11]:
# Create the vectorizer
text_only_dataset = train_dataset.map(lambda text, label: text)
vectorizer = fit_vectorizer(text_only_dataset, standardize_func)
vocab_size = vectorizer.vocabulary_size()
print(f"Vocabulary contains {vocab_size} words\n")
Vocabulary contains 1000 words
Encoding the labels too¶
In [12]:
def fit_label_encoder(train_labels, validation_labels):
"""Creates an instance of a StringLookup, and trains it on all labels
Args:
train_labels (tf.data.Dataset): dataset of train labels
validation_labels (tf.data.Dataset): dataset of validation labels
Returns:
tf.keras.layers.StringLookup: adapted encoder for train and validation labels
"""
# join the two label datasets
labels = train_labels.concatenate(validation_labels) #concatenate the two datasets.
# Instantiate the StringLookup layer. Remember that you don't want any OOV tokens
label_encoder = tf.keras.layers.StringLookup(oov_token='[UNK]', num_oov_indices=0)
# Fit the TextVectorization layer on the train_labels
label_encoder.adapt(labels)
return label_encoder
In [13]:
# Create the label encoder
train_labels_only = train_dataset.map(lambda text, label: label)
validation_labels_only = validation_dataset.map(lambda text, label: label)
label_encoder = fit_label_encoder(train_labels_only,validation_labels_only)
print(f'Unique labels: {label_encoder.get_vocabulary()}')
Unique labels: ['sport', 'business', 'politics', 'tech', 'entertainment']
Preprocess the dataset for model training¶
In [14]:
def preprocess_dataset(dataset, text_vectorizer, label_encoder):
"""Apply the preprocessing to a dataset
Args:
dataset (tf.data.Dataset): dataset to preprocess
text_vectorizer (tf.keras.layers.TextVectorization ): text vectorizer
label_encoder (tf.keras.layers.StringLookup): label encoder
Returns:
tf.data.Dataset: transformed dataset
"""
# Convert the Dataset sentences to sequences, and encode the text labels
dataset = dataset.map(
lambda text, label: (text_vectorizer(text), label_encoder(label))
)
dataset = dataset.batch(32)# Set a batchsize of 32
return dataset
In [15]:
# Preprocess your dataset
train_proc_dataset = preprocess_dataset(train_dataset, vectorizer, label_encoder)
validation_proc_dataset = preprocess_dataset(validation_dataset, vectorizer, label_encoder)
print(f"Number of batches in the train dataset: {train_proc_dataset.cardinality()}")
print(f"Number of batches in the validation dataset: {validation_proc_dataset.cardinality()}")
Number of batches in the train dataset: 56 Number of batches in the validation dataset: 14
In [16]:
train_batch = next(train_proc_dataset.as_numpy_iterator())
validation_batch = next(validation_proc_dataset.as_numpy_iterator())
print(f"Shape of the train batch: {train_batch[0].shape}")
print(f"Shape of the validation batch: {validation_batch[0].shape}")
Shape of the train batch: (32, 120) Shape of the validation batch: (32, 120)
Model Creation¶
In [17]:
# GRADED FUNCTION: create_model
def create_model():
"""
Creates a text classifier model
Returns:
tf.keras Model: the text classifier model
"""
# Define your model
model = tf.keras.Sequential([
tf.keras.Input(shape=(MAX_LENGTH,)),
tf.keras.layers.Embedding(input_dim=VOCAB_SIZE, output_dim=EMBEDDING_DIM), # Embedding layer with vocab size 10,000 and output dimension 16
tf.keras.layers.GlobalAveragePooling1D(), # GlobalAveragePooling1D layer
tf.keras.layers.Dense(16, activation='relu'), # Hidden Dense layer with ReLU activation
tf.keras.layers.Dense(5, activation='softmax') # Output Dense layer with 5 units and softmax activation
])
# Compile model. Set an appropriate loss, optimizer and metrics
model.compile(
loss='sparse_categorical_crossentropy', # Loss function for multi-class classification
optimizer='adam', # Adam optimizer
metrics=['accuracy'] # Metric for evaluation
)
return model
In [18]:
# Get the untrained model
model = create_model()
In [19]:
example_batch = train_proc_dataset.take(1)
try:
model.evaluate(example_batch, verbose=False)
except:
print("Your model is not compatible with the dataset you defined earlier. Check that the loss function and last layer are compatible with one another.")
else:
predictions = model.predict(example_batch, verbose=False)
print(f"predictions have shape: {predictions.shape}")
predictions have shape: (32, 5)
In [20]:
history = model.fit(train_proc_dataset, epochs=30, validation_data=validation_proc_dataset)
Epoch 1/30 56/56 [==============================] - 3s 34ms/step - loss: 1.6002 - accuracy: 0.2287 - val_loss: 1.5867 - val_accuracy: 0.2921 Epoch 2/30 56/56 [==============================] - 2s 34ms/step - loss: 1.5657 - accuracy: 0.3573 - val_loss: 1.5304 - val_accuracy: 0.4090 Epoch 3/30 56/56 [==============================] - 2s 35ms/step - loss: 1.4752 - accuracy: 0.4584 - val_loss: 1.4097 - val_accuracy: 0.5169 Epoch 4/30 56/56 [==============================] - 2s 34ms/step - loss: 1.3158 - accuracy: 0.5489 - val_loss: 1.2291 - val_accuracy: 0.5888 Epoch 5/30 56/56 [==============================] - 2s 34ms/step - loss: 1.1124 - accuracy: 0.6472 - val_loss: 1.0352 - val_accuracy: 0.7191 Epoch 6/30 56/56 [==============================] - 2s 33ms/step - loss: 0.9138 - accuracy: 0.7978 - val_loss: 0.8637 - val_accuracy: 0.8202 Epoch 7/30 56/56 [==============================] - 2s 34ms/step - loss: 0.7398 - accuracy: 0.8916 - val_loss: 0.7181 - val_accuracy: 0.8854 Epoch 8/30 56/56 [==============================] - 2s 33ms/step - loss: 0.5916 - accuracy: 0.9303 - val_loss: 0.5962 - val_accuracy: 0.9056 Epoch 9/30 56/56 [==============================] - 2s 36ms/step - loss: 0.4698 - accuracy: 0.9534 - val_loss: 0.4981 - val_accuracy: 0.9124 Epoch 10/30 56/56 [==============================] - 2s 35ms/step - loss: 0.3739 - accuracy: 0.9624 - val_loss: 0.4219 - val_accuracy: 0.9146 Epoch 11/30 56/56 [==============================] - 2s 35ms/step - loss: 0.3013 - accuracy: 0.9646 - val_loss: 0.3646 - val_accuracy: 0.9258 Epoch 12/30 56/56 [==============================] - 2s 35ms/step - loss: 0.2469 - accuracy: 0.9730 - val_loss: 0.3219 - val_accuracy: 0.9303 Epoch 13/30 56/56 [==============================] - 2s 36ms/step - loss: 0.2060 - accuracy: 0.9770 - val_loss: 0.2899 - val_accuracy: 0.9371 Epoch 14/30 56/56 [==============================] - 2s 34ms/step - loss: 0.1747 - accuracy: 0.9792 - val_loss: 0.2657 - val_accuracy: 0.9438 Epoch 15/30 56/56 [==============================] - 2s 34ms/step - loss: 0.1502 - accuracy: 0.9820 - val_loss: 0.2470 - val_accuracy: 0.9438 Epoch 16/30 56/56 [==============================] - 2s 35ms/step - loss: 0.1306 - accuracy: 0.9848 - val_loss: 0.2323 - val_accuracy: 0.9438 Epoch 17/30 56/56 [==============================] - 2s 34ms/step - loss: 0.1145 - accuracy: 0.9860 - val_loss: 0.2207 - val_accuracy: 0.9438 Epoch 18/30 56/56 [==============================] - 2s 34ms/step - loss: 0.1011 - accuracy: 0.9888 - val_loss: 0.2114 - val_accuracy: 0.9438 Epoch 19/30 56/56 [==============================] - 2s 32ms/step - loss: 0.0897 - accuracy: 0.9899 - val_loss: 0.2039 - val_accuracy: 0.9438 Epoch 20/30 56/56 [==============================] - 2s 33ms/step - loss: 0.0800 - accuracy: 0.9933 - val_loss: 0.1977 - val_accuracy: 0.9461 Epoch 21/30 56/56 [==============================] - 2s 34ms/step - loss: 0.0716 - accuracy: 0.9938 - val_loss: 0.1926 - val_accuracy: 0.9461 Epoch 22/30 56/56 [==============================] - 2s 33ms/step - loss: 0.0642 - accuracy: 0.9944 - val_loss: 0.1885 - val_accuracy: 0.9461 Epoch 23/30 56/56 [==============================] - 2s 33ms/step - loss: 0.0577 - accuracy: 0.9949 - val_loss: 0.1850 - val_accuracy: 0.9461 Epoch 24/30 56/56 [==============================] - 2s 33ms/step - loss: 0.0520 - accuracy: 0.9949 - val_loss: 0.1823 - val_accuracy: 0.9461 Epoch 25/30 56/56 [==============================] - 2s 33ms/step - loss: 0.0469 - accuracy: 0.9972 - val_loss: 0.1799 - val_accuracy: 0.9461 Epoch 26/30 56/56 [==============================] - 2s 33ms/step - loss: 0.0424 - accuracy: 0.9978 - val_loss: 0.1780 - val_accuracy: 0.9483 Epoch 27/30 56/56 [==============================] - 2s 34ms/step - loss: 0.0384 - accuracy: 0.9983 - val_loss: 0.1765 - val_accuracy: 0.9483 Epoch 28/30 56/56 [==============================] - 2s 33ms/step - loss: 0.0348 - accuracy: 0.9989 - val_loss: 0.1752 - val_accuracy: 0.9483 Epoch 29/30 56/56 [==============================] - 2s 33ms/step - loss: 0.0316 - accuracy: 0.9989 - val_loss: 0.1743 - val_accuracy: 0.9483 Epoch 30/30 56/56 [==============================] - 2s 33ms/step - loss: 0.0288 - accuracy: 0.9989 - val_loss: 0.1735 - val_accuracy: 0.9506
In [21]:
def plot_graphs(history, metric):
plt.plot(history.history[metric])
plt.plot(history.history[f'val_{metric}'])
plt.xlabel("Epochs")
plt.ylabel(metric)
plt.legend([metric, f'val_{metric}'])
plt.show()
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")
In [22]:
with open('history.pkl', 'wb') as f:
pickle.dump(history.history, f)
In [ ]: