import io
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pickle

VOCAB_SIZE = 1000
EMBEDDING_DIM = 16
MAX_LENGTH = 120
TRAINING_SPLIT = 0.8

data_dir = "bbc-text.csv"
data = np.loadtxt(data_dir, delimiter=',', skiprows=1, dtype='str', comments=None)
print(f"Shape of the data: {data.shape}")
print(f"{data[0]}\n{data[1]}")

Shape of the data: (2225, 2)
['tech'
 'tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high-definition tv sets  which are big business in japan and the us  but slower to take off in europe because of the lack of high-definition programming. not only can people forward wind through adverts  they can also forget about abiding by network and channel schedules  putting together their own a-la-carte entertainment. but some us networks and cable and satellite companies are worried about what it means for them in terms of advertising revenues as well as  brand identity  and viewer loyalty to channels. although the us leads in this technology at the moment  it is also a concern that is being raised in europe  particularly with the growing uptake of services like sky+.  what happens here today  we will see in nine months to a years  time in the uk   adam hume  the bbc broadcast s futurologist told the bbc news website. for the likes of the bbc  there are no issues of lost advertising revenue yet. it is a more pressing issue at the moment for commercial uk broadcasters  but brand loyalty is important for everyone.  we will be talking more about content brands rather than network brands   said tim hanlon  from brand communications firm starcom mediavest.  the reality is that with broadband connections  anybody can be the producer of content.  he added:  the challenge now is that it is hard to promote a programme with so much choice.   what this means  said stacey jolna  senior vice president of tv guide tv group  is that the way people find the content they want to watch has to be simplified for tv viewers. it means that networks  in us terms  or channels could take a leaf out of google s book and be the search engine of the future  instead of the scheduler to help people find what they want to watch. this kind of channel model might work for the younger ipod generation which is used to taking control of their gadgets and what they play on them. but it might not suit everyone  the panel recognised. older generations are more comfortable with familiar schedules and channel brands because they know what they are getting. they perhaps do not want so much of the choice put into their hands  mr hanlon suggested.  on the other end  you have the kids just out of diapers who are pushing buttons already - everything is possible and available to them   said mr hanlon.  ultimately  the consumer will tell the market they want.   of the 50 000 new gadgets and technologies being showcased at ces  many of them are about enhancing the tv-watching experience. high-definition tv sets are everywhere and many new models of lcd (liquid crystal display) tvs have been launched with dvr capability built into them  instead of being external boxes. one such example launched at the show is humax s 26-inch lcd tv with an 80-hour tivo dvr and dvd recorder. one of the us s biggest satellite tv companies  directtv  has even launched its own branded dvr at the show with 100-hours of recording capability  instant replay  and a search function. the set can pause and rewind tv for up to 90 hours. and microsoft chief bill gates announced in his pre-show keynote speech a partnership with tivo  called tivotogo  which means people can play recorded programmes on windows pcs and mobile devices. all these reflect the increasing trend of freeing up multimedia so that people can watch what they want  when they want.']
['business'
 'worldcom boss  left books alone  former worldcom boss bernie ebbers  who is accused of overseeing an $11bn (Â£5.8bn) fraud  never made accounting decisions  a witness has told jurors.  david myers made the comments under questioning by defence lawyers who have been arguing that mr ebbers was not responsible for worldcom s problems. the phone company collapsed in 2002 and prosecutors claim that losses were hidden to protect the firm s shares. mr myers has already pleaded guilty to fraud and is assisting prosecutors.  on monday  defence lawyer reid weingarten tried to distance his client from the allegations. during cross examination  he asked mr myers if he ever knew mr ebbers  make an accounting decision  .  not that i am aware of   mr myers replied.  did you ever know mr ebbers to make an accounting entry into worldcom books   mr weingarten pressed.  no   replied the witness. mr myers has admitted that he ordered false accounting entries at the request of former worldcom chief financial officer scott sullivan. defence lawyers have been trying to paint mr sullivan  who has admitted fraud and will testify later in the trial  as the mastermind behind worldcom s accounting house of cards.  mr ebbers  team  meanwhile  are looking to portray him as an affable boss  who by his own admission is more pe graduate than economist. whatever his abilities  mr ebbers transformed worldcom from a relative unknown into a $160bn telecoms giant and investor darling of the late 1990s. worldcom s problems mounted  however  as competition increased and the telecoms boom petered out. when the firm finally collapsed  shareholders lost about $180bn and 20 000 workers lost their jobs. mr ebbers  trial is expected to last two months and if found guilty the former ceo faces a substantial jail sentence. he has firmly declared his innocence.']

print(f"There are {len(data)} sentence-label pairs in the dataset.\n")
print(f"First sentence has {len((data[0,1]).split())} words.\n")
print(f"The first 5 labels are {data[:5,0]}")

There are 2225 sentence-label pairs in the dataset.

First sentence has 737 words.

The first 5 labels are ['tech' 'business' 'sport' 'sport' 'entertainment']

def train_val_datasets(data):
    '''
    Splits data into traning and validations sets
    
    Args:
        data (np.array): array with two columns, first one is the label, the second is the text
    
    Returns:
        (tf.data.Dataset, tf.data.Dataset): tuple containing the train and validation datasets
    '''

    # Compute the number of sentences that will be used for training (should be an integer)
    train_size = int(len(data)*TRAINING_SPLIT)

    # Slice the dataset to get only the texts. Remember that texts are on the second column
    texts = data[:,1]
    
    # Slice the dataset to get only the labels. Remember that labels are on the first column
    labels = data[:,0]
    
    # Split the sentences and labels into train/validation splits. Write your own code below
    train_texts = texts[:train_size]
    validation_texts = texts[train_size:]
    train_labels = labels[:train_size]
    validation_labels = labels[train_size:]
    
    # create the train and validation datasets from the splits
    train_dataset = tf.data.Dataset.from_tensor_slices((train_texts, train_labels))
    validation_dataset = tf.data.Dataset.from_tensor_slices((validation_texts, validation_labels))
    
    return train_dataset, validation_dataset

# Create the datasets
train_dataset, validation_dataset = train_val_datasets(data)

print(f"There are {train_dataset.cardinality()} sentence-label pairs for training.\n")
print(f"There are {validation_dataset.cardinality()} sentence-label pairs for validation.\n")

There are 1780 sentence-label pairs for training.

There are 445 sentence-label pairs for validation.

def standardize_func(sentence):
    # List of stopwords
    stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "her", "here",  "hers", "herself", "him", "himself", "his", "how",  "i", "if", "in", "into", "is", "it", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she",  "should", "so", "some", "such", "than", "that",  "the", "their", "theirs", "them", "themselves", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we",  "were", "what",  "when", "where", "which", "while", "who", "whom", "why", "why", "with", "would", "you",  "your", "yours", "yourself", "yourselves", "'m",  "'d", "'ll", "'re", "'ve", "'s", "'d"]
 
    # Sentence converted to lowercase-only
    sentence = tf.strings.lower(sentence)
    
    # Remove stopwords
    for word in stopwords:
        if word[0] == "'":
            sentence = tf.strings.regex_replace(sentence, rf"{word}\b", "")
        else:
            sentence = tf.strings.regex_replace(sentence, rf"\b{word}\b", "")
    
    # Remove punctuation
    sentence = tf.strings.regex_replace(sentence, r'[!"#$%&()\*\+,-\./:;<=>?@\[\\\]^_`{|}~\']', "")


    return sentence

def fit_vectorizer(train_sentences, standardize_func):
    '''
    Defines and adapts the text vectorizer

    Args:
        train_sentences (tf.data.Dataset): sentences from the train dataset to fit the TextVectorization layer
        standardize_func (FunctionType): function to remove stopwords and punctuation, and lowercase texts.
    Returns:
        TextVectorization: adapted instance of TextVectorization layer
    '''
    
    # Instantiate the TextVectorization class, passing in the correct values for the given parameters below
    vectorizer = tf.keras.layers.TextVectorization( 
        standardize=standardize_func,
        max_tokens=VOCAB_SIZE,
        output_sequence_length=MAX_LENGTH
    ) 
    
    # Adapt the vectorizer to the training sentences
    vectorizer.adapt(train_sentences)
    
    return vectorizer

# Create the vectorizer
text_only_dataset = train_dataset.map(lambda text, label: text)
vectorizer = fit_vectorizer(text_only_dataset, standardize_func)
vocab_size = vectorizer.vocabulary_size()

print(f"Vocabulary contains {vocab_size} words\n")

Vocabulary contains 1000 words

def fit_label_encoder(train_labels, validation_labels):
    """Creates an instance of a StringLookup, and trains it on all labels

    Args:
        train_labels (tf.data.Dataset): dataset of train labels
        validation_labels (tf.data.Dataset): dataset of validation labels

    Returns:
        tf.keras.layers.StringLookup: adapted encoder for train and validation labels
    """
    
    # join the two label datasets
    labels = train_labels.concatenate(validation_labels) #concatenate the two datasets.
    
    # Instantiate the StringLookup layer. Remember that you don't want any OOV tokens
    label_encoder = tf.keras.layers.StringLookup(oov_token='[UNK]', num_oov_indices=0)
    
    # Fit the TextVectorization layer on the train_labels
    label_encoder.adapt(labels)
    
    return label_encoder

# Create the label encoder
train_labels_only = train_dataset.map(lambda text, label: label)
validation_labels_only = validation_dataset.map(lambda text, label: label)

label_encoder = fit_label_encoder(train_labels_only,validation_labels_only)
                                  
print(f'Unique labels: {label_encoder.get_vocabulary()}')

Unique labels: ['sport', 'business', 'politics', 'tech', 'entertainment']

def preprocess_dataset(dataset, text_vectorizer, label_encoder):
    """Apply the preprocessing to a dataset

    Args:
        dataset (tf.data.Dataset): dataset to preprocess
        text_vectorizer (tf.keras.layers.TextVectorization ): text vectorizer
        label_encoder (tf.keras.layers.StringLookup): label encoder

    Returns:
        tf.data.Dataset: transformed dataset
    """

    # Convert the Dataset sentences to sequences, and encode the text labels
    dataset = dataset.map(
        lambda text, label: (text_vectorizer(text), label_encoder(label))
    )
    
    dataset = dataset.batch(32)# Set a batchsize of 32
    
    
    return dataset

# Preprocess your dataset
train_proc_dataset = preprocess_dataset(train_dataset, vectorizer, label_encoder)
validation_proc_dataset = preprocess_dataset(validation_dataset, vectorizer, label_encoder)

print(f"Number of batches in the train dataset: {train_proc_dataset.cardinality()}")
print(f"Number of batches in the validation dataset: {validation_proc_dataset.cardinality()}")

Number of batches in the train dataset: 56
Number of batches in the validation dataset: 14

train_batch = next(train_proc_dataset.as_numpy_iterator())
validation_batch = next(validation_proc_dataset.as_numpy_iterator())

print(f"Shape of the train batch: {train_batch[0].shape}")
print(f"Shape of the validation batch: {validation_batch[0].shape}")

Shape of the train batch: (32, 120)
Shape of the validation batch: (32, 120)

# GRADED FUNCTION: create_model
def create_model():
    """
    Creates a text classifier model
    Returns:
      tf.keras Model: the text classifier model
    """

    # Define your model
    model = tf.keras.Sequential([
        tf.keras.Input(shape=(MAX_LENGTH,)),
        tf.keras.layers.Embedding(input_dim=VOCAB_SIZE, output_dim=EMBEDDING_DIM),  # Embedding layer with vocab size 10,000 and output dimension 16
        tf.keras.layers.GlobalAveragePooling1D(),  # GlobalAveragePooling1D layer
        tf.keras.layers.Dense(16, activation='relu'),  # Hidden Dense layer with ReLU activation
        tf.keras.layers.Dense(5, activation='softmax')  # Output Dense layer with 5 units and softmax activation
    ])
    
    # Compile model. Set an appropriate loss, optimizer and metrics
    model.compile(
        loss='sparse_categorical_crossentropy',  # Loss function for multi-class classification
        optimizer='adam',  # Adam optimizer
        metrics=['accuracy']  # Metric for evaluation
    )


    return model

# Get the untrained model
model = create_model()

example_batch = train_proc_dataset.take(1)

try:
    model.evaluate(example_batch, verbose=False)
except:
    print("Your model is not compatible with the dataset you defined earlier. Check that the loss function and last layer are compatible with one another.")
else:
    predictions = model.predict(example_batch, verbose=False)
    print(f"predictions have shape: {predictions.shape}")

predictions have shape: (32, 5)

history = model.fit(train_proc_dataset, epochs=30, validation_data=validation_proc_dataset)

Epoch 1/30
56/56 [==============================] - 3s 34ms/step - loss: 1.6002 - accuracy: 0.2287 - val_loss: 1.5867 - val_accuracy: 0.2921
Epoch 2/30
56/56 [==============================] - 2s 34ms/step - loss: 1.5657 - accuracy: 0.3573 - val_loss: 1.5304 - val_accuracy: 0.4090
Epoch 3/30
56/56 [==============================] - 2s 35ms/step - loss: 1.4752 - accuracy: 0.4584 - val_loss: 1.4097 - val_accuracy: 0.5169
Epoch 4/30
56/56 [==============================] - 2s 34ms/step - loss: 1.3158 - accuracy: 0.5489 - val_loss: 1.2291 - val_accuracy: 0.5888
Epoch 5/30
56/56 [==============================] - 2s 34ms/step - loss: 1.1124 - accuracy: 0.6472 - val_loss: 1.0352 - val_accuracy: 0.7191
Epoch 6/30
56/56 [==============================] - 2s 33ms/step - loss: 0.9138 - accuracy: 0.7978 - val_loss: 0.8637 - val_accuracy: 0.8202
Epoch 7/30
56/56 [==============================] - 2s 34ms/step - loss: 0.7398 - accuracy: 0.8916 - val_loss: 0.7181 - val_accuracy: 0.8854
Epoch 8/30
56/56 [==============================] - 2s 33ms/step - loss: 0.5916 - accuracy: 0.9303 - val_loss: 0.5962 - val_accuracy: 0.9056
Epoch 9/30
56/56 [==============================] - 2s 36ms/step - loss: 0.4698 - accuracy: 0.9534 - val_loss: 0.4981 - val_accuracy: 0.9124
Epoch 10/30
56/56 [==============================] - 2s 35ms/step - loss: 0.3739 - accuracy: 0.9624 - val_loss: 0.4219 - val_accuracy: 0.9146
Epoch 11/30
56/56 [==============================] - 2s 35ms/step - loss: 0.3013 - accuracy: 0.9646 - val_loss: 0.3646 - val_accuracy: 0.9258
Epoch 12/30
56/56 [==============================] - 2s 35ms/step - loss: 0.2469 - accuracy: 0.9730 - val_loss: 0.3219 - val_accuracy: 0.9303
Epoch 13/30
56/56 [==============================] - 2s 36ms/step - loss: 0.2060 - accuracy: 0.9770 - val_loss: 0.2899 - val_accuracy: 0.9371
Epoch 14/30
56/56 [==============================] - 2s 34ms/step - loss: 0.1747 - accuracy: 0.9792 - val_loss: 0.2657 - val_accuracy: 0.9438
Epoch 15/30
56/56 [==============================] - 2s 34ms/step - loss: 0.1502 - accuracy: 0.9820 - val_loss: 0.2470 - val_accuracy: 0.9438
Epoch 16/30
56/56 [==============================] - 2s 35ms/step - loss: 0.1306 - accuracy: 0.9848 - val_loss: 0.2323 - val_accuracy: 0.9438
Epoch 17/30
56/56 [==============================] - 2s 34ms/step - loss: 0.1145 - accuracy: 0.9860 - val_loss: 0.2207 - val_accuracy: 0.9438
Epoch 18/30
56/56 [==============================] - 2s 34ms/step - loss: 0.1011 - accuracy: 0.9888 - val_loss: 0.2114 - val_accuracy: 0.9438
Epoch 19/30
56/56 [==============================] - 2s 32ms/step - loss: 0.0897 - accuracy: 0.9899 - val_loss: 0.2039 - val_accuracy: 0.9438
Epoch 20/30
56/56 [==============================] - 2s 33ms/step - loss: 0.0800 - accuracy: 0.9933 - val_loss: 0.1977 - val_accuracy: 0.9461
Epoch 21/30
56/56 [==============================] - 2s 34ms/step - loss: 0.0716 - accuracy: 0.9938 - val_loss: 0.1926 - val_accuracy: 0.9461
Epoch 22/30
56/56 [==============================] - 2s 33ms/step - loss: 0.0642 - accuracy: 0.9944 - val_loss: 0.1885 - val_accuracy: 0.9461
Epoch 23/30
56/56 [==============================] - 2s 33ms/step - loss: 0.0577 - accuracy: 0.9949 - val_loss: 0.1850 - val_accuracy: 0.9461
Epoch 24/30
56/56 [==============================] - 2s 33ms/step - loss: 0.0520 - accuracy: 0.9949 - val_loss: 0.1823 - val_accuracy: 0.9461
Epoch 25/30
56/56 [==============================] - 2s 33ms/step - loss: 0.0469 - accuracy: 0.9972 - val_loss: 0.1799 - val_accuracy: 0.9461
Epoch 26/30
56/56 [==============================] - 2s 33ms/step - loss: 0.0424 - accuracy: 0.9978 - val_loss: 0.1780 - val_accuracy: 0.9483
Epoch 27/30
56/56 [==============================] - 2s 34ms/step - loss: 0.0384 - accuracy: 0.9983 - val_loss: 0.1765 - val_accuracy: 0.9483
Epoch 28/30
56/56 [==============================] - 2s 33ms/step - loss: 0.0348 - accuracy: 0.9989 - val_loss: 0.1752 - val_accuracy: 0.9483
Epoch 29/30
56/56 [==============================] - 2s 33ms/step - loss: 0.0316 - accuracy: 0.9989 - val_loss: 0.1743 - val_accuracy: 0.9483
Epoch 30/30
56/56 [==============================] - 2s 33ms/step - loss: 0.0288 - accuracy: 0.9989 - val_loss: 0.1735 - val_accuracy: 0.9506

def plot_graphs(history, metric):
    plt.plot(history.history[metric])
    plt.plot(history.history[f'val_{metric}'])
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend([metric, f'val_{metric}'])
    plt.show()
    
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

with open('history.pkl', 'wb') as f:
    pickle.dump(history.history, f)

Splitting into Training and Validation dataset¶

Standardizing or Cleaning the text and Vectorizing the text¶

Encoding the labels too¶

Preprocess the dataset for model training¶

Model Creation¶