GoEmotions Dataset: Generating Text with Specific Emotions
In a previous blog post, we covered how to download the GoEmotions dataset and train a simple classifier using the BERT architecture. Here, we'll take it a step further and train a generative model that can write text expressing a specific emotion.
This text expresses several emotions at once, including excitement, approval and gratitude. |
Downloading and re-classifying the sentiment140 dataset
The sentiment140 dataset available on TFDS is a dataset created for the purpose of training a classifier to determine the sentiment about a specific subject. The dataset contains a classification of 1.6M tweets into positive, negative and neutral polarity.
Since we already have a model capable of classifying 27 different emotions, this dataset can instead be used for a different purpose – such as training a generative model to create new, short-form pieces of text. But first, let's re-classify the sentiment140 dataset to label each tweet with fine-grained emotions using our pretrained classifier:
import csv
import re
import numpy as np
batch_size = 128
ds_orig = tfds.load('sentiment140', split='train')
output_path = pathlib.Path('.') / 'sentiment140_goemotions.csv'
# Download and load our pretrained classifier.
url_root = 'https://huggingface.co/owahltinez/goemotions_bert/resolve/main'
model_path = tf.keras.utils.get_file(origin=f'{url_root}/bert_model.tar.gz')
classifier = tf.keras.models.load_model(Path(model_path) / 'goemotions_bert')
# Reclassify the dataset by mapping the original text to fine-grained emotions.
ds_emo = ds_orig.map(lambda x: x['text']).batch(batch_size)
ds_emo = ds_emo.map(lambda x: (x, tf.math.argmax(classifier(x), axis=1)))
# Save the resulting dataset into a CSV file.
with open(output_path, 'w') as f:
writer = csv.writer(f)
writer.writerow(['sentence', 'label'])
for x_batch, y_batch in iter(ds_emo):
for x, y in zip(x_batch, y_batch):
label = emotions[y.numpy()]
sentence = x.numpy().decode('utf8')
writer.writerow([sentence, label])
|
This can take a very long time, even with hardware-accelerated processing! The resulting dataset is available as a CSV file here. It's worth noting that only the most probable label is recorded (without any thresholding) and all others are discarded. In other words, the top label (and only the top label) is used for each piece of text.
Extracting text samples for a single emotion
With the re-classified dataset, you can extract all instances of text that contain a specific emotion. Here's how you can filter the dataset:
def filter_dataset_by_emotion(ds, emotion, batch_size):
ds = ds.unbatch()
ds = ds.filter(lambda x: tf.equal(x['label'], emotion))
return ds.batch(batch_size).prefetch(tf.data.AUTOTUNE)
ds_emo = filter_dataset_by_emotion(ds, 'disappointment', batch_size)
|
The resulting dataset will only contain instances of text that were classified as displaying disappointment. Although some of the text is misclassified, most of the examples carry a relatively consistent theme and tone.
Side note here: interestingly, tiredness and boredom tend to be labeled as disappointment by the model. Neither of those two are labeled emotions, so it's possible that the concept labeled as "disappointment" captures that as well in the model.
Implementing the miniature GPT architecture
This part consists of a copy of the described architecture in this great Keras tutorial. Rather than unnecessarily copying all of the contents here, you can refer to the proposed implementation for more details. Alternatively, you can skip to the bottom of this post for a link to the final version of the code.
Preprocessing text for the miniature GPT model
The miniature GPT model requires tokenized text as input. A common preprocessing technique prior to tokenization is normalizing text by, for example, stripping whitespaces and making all text lowercase. In the previous blog post, the pretrained BERT tokenizer was used. Here, both the tokenization and normalization are done by the tf.keras.layers.TextVectorization class.
In written English, which is the language of the training data, capitalization is often used to convey emotion; for example "nice" can be considered neutral / approving whereas "NICE" is often used to represent a more excited / enthusiastic emotion. So, in a real-world application, you should probably not convert text to lowercase as part of the normalization process. Here, that is done in order to speed up training for illustration purposes – otherwise you would need a much larger vocabulary size (since "nice" and "NICE" would be different words, with different model representations) and number of model parameters. Similarly, other forms of written text might make use of subtle cues that need to be taken into account; this can be particularly important in languages other than English or with sources of text that make use of emojis, for example.
Additionally, we need to define our inputs and labels. In this case, we use all but the last word of a given text as the input, and all but the first word as the label. In essence, this is trying to encourage the model to learn what word is likely to follow given an incomplete sentence.
This is how you can preprocess a text dataset for model training:
def prepare_lm_inputs_labels(text, vectorized_layer):
text = tf.expand_dims(text, -1)
tokenized_sentences = vectorize_layer(text)
x = tokenized_sentences[:, :-1]
y = tokenized_sentences[:, 1:]
return x, y
maxlen = 80 # Max sequence size
vocab_size = 20000 # Only consider the top 20k words
# Create a vectorization layer to tokenize text.
vectorize_layer = tf.keras.layers.TextVectorization(
max_tokens=vocab_size - 1,
output_mode="int",
output_sequence_length=maxlen + 1,
)
# Adapt the vectorization layer to our text samples.
vectorize_layer.adapt(ds.map(lambda x: x['sentence']).take(10_000))
vocab = vectorize_layer.get_vocabulary()
# Build a tokenization index for input prompts.
word_to_index = {}
for index, word in enumerate(vocab):
word_to_index[word] = index
text_ds = ds_emo.map(lambda x: x['sentence'])
text_ds = text_ds.map(lambda x: prepare_lm_inputs_labels(x, vectorize_layer))
text_ds = text_ds.prefetch(tf.data.AUTOTUNE)
|
One thing worth noting is that we're not feeding the entire dataset to the TextVectorization.adapt() method, because it would be too slow to iterate through the whole dataset. In this case, we simply use the first 10,000 batches, but other approaches might be equally valid as long as they contain a representative portion of all the vocabulary contained in the dataset.
Training the model and monitoring results
The final step of the process is to train the model. Nothing out of the ordinary here, except for the use of Keras' callback API to help us monitor the model outputs during training:
start_prompt = "i feel like"
start_tokens = [word_to_index.get(_, 1) for _ in start_prompt.split()]
num_tokens_generated = 40
text_gen_callback = TextGenerator(num_tokens_generated, start_tokens, vocab)
model = create_model()
optimizer = tf.keras.optimizers.Adam(learning_rate=1E-4)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer, loss=[loss_fn, None], jit_compile=True)
model.fit(
text_ds, epochs=100, steps_per_epoch=1000, callbacks=[text_gen_callback]) |
Since the model training only takes a couple of hours and the number of parameters is relatively small, you shouldn't expect results as good as those reported by state-of-the-art methods! However, after cherry-picking some coherent sample outputs, the model seems to work relatively well:
Not only are some of the outputs coherent and similar to the inputs, but they also appear to capture the same concept of "disappointment".
Another interesting point worth mentioning is the presence of the [UNK] token, which represents a word that the model guesses that it should go in the sentence, but does not exist in the vocabulary.
Code and data availability
The code used in this blog post is available as part of the following Github repositories and gists:
Models and data can be found in these HuggingFace Spaces.