In NLP, the first step is building a vocabulary from the text corpus, followed by converting text into numerical features for training neural networks. TensorFlow and Keras simplify this with APIs like the TextVectorization preprocessing layer. Using its adapt() method, it processes a list of sentences, assigning each unique word an integer ID.
The TextVectorization layer tokenizes text and converts words into numerical representations. Key hyperparameters include max_tokens (vocabulary size), output_mode (int, binary, TF-IDF, or n-grams), output_sequence_length (fixed-length sequences), and standardize (text preprocessing options). It supports the adapt() method for learning vocabulary from data and is used in NLP pipelines before embedding layers or models
import tensorflow as tf
sentences = ['i love my dog', 'I, love my cat']
vectorize_layer = tf.keras.layers.TextVectorization()
vectorize_layer.adapt(sentences)
vocabulary = vectorize_layer.get_vocabulary(include_special_tokens=False)
- tf.keras.layers.TextVectorization() : Initializes a TextVectorization layer, which will process text and generate a vocabulary. Default settings:
- max_tokens=None (unlimited vocabulary)
- output_mode='int' (integer tokenization)
- standardize='lower_and_strip_punctuation' (lowercases text and removes punctuation)
vectorize_layer.adapt(sentences) - Learns the vocabulary from the given sentences. The text is lowercased, and punctuation is removed (default standardization)
vocabulary = vectorize_layer.get_vocabulary(include_special_tokens=False). Retrieves the learned vocabulary as a list of words, excluding special tokens (e.g., [UNK] for unknown words).
vocabulary # Unique words are stored in order of appearance
['my', 'love', 'i', 'dog', 'cat']
# let's add another sentence in it
sentences = ['i love my dog', 'I, love my cat', 'You love my dog!']
vectorize_layer = tf.keras.layers.TextVectorization()
vectorize_layer.adapt(sentences)
vocabulary = vectorize_layer.get_vocabulary(include_special_tokens=False)
for index, word in enumerate(vocabulary): # More frequent words have lower indices when using TextVectorization
print(index, ' - ', word)
0 - my 1 - love 2 - i 3 - dog 4 - you 5 - cat
sentences = ['i love my dog', 'I, love my cat', 'You love my dog!']
vectorize_layer = tf.keras.layers.TextVectorization()
vectorize_layer.adapt(sentences)
vocabulary = vectorize_layer.get_vocabulary(include_special_tokens=False)
vocabulary
['my', 'love', 'i', 'dog', 'you', 'cat']
# Including special tokens for handling unknown words or for padding
vocabulary = vectorize_layer.get_vocabulary()
vocabulary # 0 is for padding and 1 is used for out of vocabulary words.
['', '[UNK]', 'my', 'love', 'i', 'dog', 'you', 'cat']
Text data has to be converted into numeric sequence and needs to be of uniform size before feeling into the model.
sentences = ['i love my dog', 'I, love my cat', 'You love my dog!', 'Do you think my dog is amazing?']
vectorize_layer = tf.keras.layers.TextVectorization()
vectorize_layer.adapt(sentences)
vocabulary = vectorize_layer.get_vocabulary()
for index, words in enumerate(vocabulary):
print(index, ' - ', words)
0 - 1 - [UNK] 2 - my 3 - love 4 - dog 5 - you 6 - i 7 - think 8 - is 9 - do 10 - cat 11 - amazing
Now we can use this result to convert sentences into integer sequences
sample_input = 'I love my dog'
sequence = vectorize_layer(sample_input)
sequence
<tf.Tensor: shape=(4,), dtype=int64, numpy=array([6, 3, 2, 4], dtype=int64)>
String is passed to the layer that has learned the vocabulary and it will output integer sequence as a tf.Tensor. Now for a given list of input sequences, this layer has to be applied to each input.
sentences_dataset = tf.data.Dataset.from_tensor_slices(sentences) # converting list to tf.data.Dataset
sequences = sentences_dataset.map(vectorize_layer)
for sentence, sequence in zip(sentences, sequences):
print(f'{sentence} ----> {sequence}')
i love my dog ----> [6 3 2 4] I, love my cat ----> [ 6 3 2 10] You love my dog! ----> [5 3 2 4] Do you think my dog is amazing? ----> [ 9 5 7 2 4 8 11]
Integer sequences have varying lengths, making them unsuitable for direct model input. To standardize them, we apply either padding or truncation, with padding being the preferred approach to retain information. The vocabulary assigns index 0 as a special token for padding. When passing a list of string inputs to the layer, post-padding is applied, adding 0s to sequences until they match the longest sequence length.
sequence_post = vectorize_layer(sentences)
for sentence, sequence in zip(sentences, sequence_post):
print(f'{sentence} ----> {sequence}')
i love my dog ----> [6 3 2 4 0 0 0] I, love my cat ----> [ 6 3 2 10 0 0 0] You love my dog! ----> [5 3 2 4 0 0 0] Do you think my dog is amazing? ----> [ 9 5 7 2 4 8 11]
If you want pre-padding then you can use the pad_sequences() utility to prepend a padding token to the sequences. Notice that the padding argument is set to pre. This is just for clarity. The function already has this set as the default so can opt to drop it.
sequences_pre = tf.keras.utils.pad_sequences(sequences, padding='pre')
for sentence, sequence in zip(sentences, sequences_pre):
print(f'{sentence} ----> {sequence}')
i love my dog ----> [0 0 0 6 3 2 4] I, love my cat ----> [ 0 0 0 6 3 2 10] You love my dog! ----> [0 0 0 5 3 2 4] Do you think my dog is amazing? ----> [ 9 5 7 2 4 8 11]
# You can set max length of padding too
sequences_pre = tf.keras.utils.pad_sequences(sequences, padding='pre', maxlen=5) # Will keep the last 5
for sentence, sequence in zip(sentences, sequences_pre):
print(f'{sentence} ----> {sequence}')
i love my dog ----> [0 6 3 2 4] I, love my cat ----> [ 0 6 3 2 10] You love my dog! ----> [0 5 3 2 4] Do you think my dog is amazing? ----> [ 7 2 4 8 11]
# By default, the tokens will truncate from the front as seen above
# You can truncate the tokens from the end too
sequences_pre = tf.keras.utils.pad_sequences(sequences, padding='post', maxlen=5, truncating='post')
for sentence, sequence in zip(sentences, sequences_pre):
print(f'{sentence} ----> {sequence}')
i love my dog ----> [6 3 2 4 0] I, love my cat ----> [ 6 3 2 10 0] You love my dog! ----> [5 3 2 4 0] Do you think my dog is amazing? ----> [9 5 7 2 4]
Another way to prepare for prepadding is to set the TextVectorization to output ragged tensor. This means the output will not be automatically post padded -
vectorize_layer = tf.keras.layers.TextVectorization(ragged=True)
vectorize_layer.adapt(sentences)
ragged_sequences = vectorize_layer(sentences)
ragged_sequences
<tf.RaggedTensor [[6, 3, 2, 4], [6, 3, 2, 10], [5, 3, 2, 4], [9, 5, 7, 2, 4, 8, 11]]>
sequences_pre = tf.keras.utils.pad_sequences(ragged_sequences.numpy())
sequences_pre
array([[ 0, 0, 0, 6, 3, 2, 4], [ 0, 0, 0, 6, 3, 2, 10], [ 0, 0, 0, 5, 3, 2, 4], [ 9, 5, 7, 2, 4, 8, 11]])
sequence_post = tf.keras.utils.pad_sequences(ragged_sequences.numpy(), padding='post')
sequence_post
array([[ 6, 3, 2, 4, 0, 0, 0], [ 6, 3, 2, 10, 0, 0, 0], [ 5, 3, 2, 4, 0, 0, 0], [ 9, 5, 7, 2, 4, 8, 11]])
Now to look into the Out of vocabulary words. The layer will use the token of 1 to handle out of vocabulary words.
sentences_with_oov = ['i really love my dog', 'my dog loves my manatee']
sequences_with_oov = vectorize_layer(sentences_with_oov)
for sentence, sequence in zip(sentences_with_oov, sequences_with_oov):
print(f'{sentence} -----> {sequence}')
i really love my dog -----> [6 1 3 2 4] my dog loves my manatee -----> [2 4 1 2 1]
Let's conclude all at once here now¶
Creating Tokens from sentences with post padding and handling OOV words¶
import tensorflow as tf
# List of sentences
sentences = ['i love my dog', 'I, love my cat', 'I have been to the place multiple times']
# Initializing text vectorization layer
vectorize_layer = tf.keras.layers.TextVectorization()
# Adapting the layer with the sentences
vectorize_layer.adapt(sentences)
# To look at the vocabulary
vocabulary = vectorize_layer.get_vocabulary() # include_special_tokens=False
# Vectorizing the sentences with post padding
sequence_post = vectorize_layer(sentences)
for sentence, sequence in zip(sentences, sequence_post):
print(sentence, ' ----> ', sequence)
i love my dog ----> tf.Tensor([ 2 4 3 11 0 0 0 0], shape=(8,), dtype=int64) I, love my cat ----> tf.Tensor([ 2 4 3 12 0 0 0 0], shape=(8,), dtype=int64) I have been to the place multiple times ----> tf.Tensor([ 2 10 13 5 7 8 9 6], shape=(8,), dtype=int64)
# Other way of post padding
vectorize_layer = tf.keras.layers.TextVectorization(ragged=True)
vectorize_layer.adapt(sentences)
ragged_sequences = vectorize_layer(sentences)
sequence_post = tf.keras.utils.pad_sequences(ragged_sequences.numpy(), padding='post') #maxlen, truncating
for sentence, sequence in zip(sentences_with_oov, sequences_with_oov):
print(f'{sentence} -----> {sequence}')
i love my dog ----> [ 2 4 3 11 0 0 0 0] I, love my cat ----> [ 2 4 3 12 0 0 0 0] I have been to the place multiple times ----> [ 2 10 13 5 7 8 9 6]
sentences_with_oov = ['I love to travel']
sequences_with_oov = vectorize_layer(sentences_with_oov)
sequences_with_oov = tf.keras.utils.pad_sequences(sequences_with_oov.numpy(), padding='post')
for sentence, sequence in zip(sentences_with_oov, sequences_with_oov):
print(f'{sentence} -----> {sequence}')
I love to travel -----> [2 4 5 1]
Creating Tokens from sentences with pre padding and handling OOV words¶
vectorize_layer = tf.keras.layers.TextVectorization(ragged=True)
vectorize_layer.adapt(sentences)
ragged_sequences = vectorize_layer(sentences)
sequence_pre = tf.keras.utils.pad_sequences(ragged_sequences.numpy(), padding='pre') #maxlen, truncating
for sentence, sequence in zip(sentences, sequence_pre):
print(sentence, ' ----> ', sequence)
i love my dog ----> [ 0 0 0 0 2 4 3 11] I, love my cat ----> [ 0 0 0 0 2 4 3 12] I have been to the place multiple times ----> [ 2 10 13 5 7 8 9 6]
sentences_with_oov = ['I love to travel', 'I love']
sequences_with_oov = vectorize_layer(sentences_with_oov)
sequences_with_oov = tf.keras.utils.pad_sequences(sequences_with_oov.numpy(), padding='pre')
for sentence, sequence in zip(sentences_with_oov, sequences_with_oov):
print(f'{sentence} -----> {sequence}')
I love to travel -----> [2 4 5 1] I love -----> [0 0 2 4]