GoEmotions Dataset: Training a Classifier
Last year, we announced the release of the GoEmotions dataset. In this blog post, we'll cover how the data can be downloaded, and how a simple model can be trained to classify fine-grained emotions given a piece of text.
GoEmotions taxonomy: Includes 28 emotion categories, including “neutral”. |
Downloading the GoEmotions dataset from TFDS
The GoEmotions dataset is very easy to download! As per the Github repo's instructions, it's split across 3 CSV files:
Alternatively, if you are a TensorFlow user, you can get the data directly through TensorFlow Datasets:
The data is quite straightforward, as described in the Github repo. These are the fields:
- text: The text of the comment (with masked tokens, as described in the paper).
- id: The unique id of the comment.
- author: The Reddit username of the comment's author.
- subreddit: The subreddit that the comment belongs to.
- link_id: The link id of the comment.
- parent_id: The parent id of the comment.
- created_utc: The timestamp of the comment.
- rater_id: The unique id of the annotator.
- example_very_unclear: Whether the annotator marked the example as being very unclear or difficult to label (in this case they did not choose any emotion labels).
- separate columns representing each of the emotion categories, with binary labels (0 or 1)
To see the list of the labels, you can inspect the columns of the downloaded data or you can also get them directly from this file:
For the purposes of training, you may want to preprocess the data. Here's how you would get the different data splits and use one-hot encoding for the different emotions:
Downloading pretrained BERT weights and defining a model
One of the most obvious uses of this dataset is to train a classifier to determine what emotions are depicted in a piece of text. There are many different models that can be used for natural language processing (NLP) tasks. A relatively safe choice is the BERT model, which is capable of turning free-form text into an embedding.
There are many different ways to preprocess text, and an even greater number of ways to encode the preprocessed text into an embedding. This is an illustration of a simplified version of BERT's architecture:
Illustration of the BERT model architecture. |
Using a pre-trained BERT model from TFHub and Keras' functional model API, you can create your own model based on BERT in just a few lines of code:
Data augmentation with the nlpaug library
Although extremely useful, the GoEmotions dataset is relatively small compared to many other NLP datasets. To encourage generalization and prevent overfitting, you can make use of the nlpaug library to perform data augmentation and synthetically increase the size of the dataset:
This way, for every real example in the dataset, you are generating two additional synthetic examples with a spelling mistake in a randomly chosen word and two more with two randomly chosen words swapped in the text.
While this might seem like a pretty crude technique, it works very well because it will encourage a model to learn that a word and its possible misspellings should map to similar embeddings. Similarly, it will also encourage the trained model to have a similar understanding for a sentence and a variation with its words slightly out of order.
Fine-Tuning the BERT model
With all the pieces in place, you can fine-tune the BERT model using the battle-tested Keras API:
After a couple of hours of training time (using a GPU-accelerated Colab runtime), you will have a classifier that uses the BERT architecture and is capable of somewhat-accurately determining the emotions depicted in a piece of text. Inspecting some of the predictions for unseen pieces of text from the test subset can be helpful in evaluating the model:
In many cases, the model is uncertain of the classification (no class has a value higher than 0.5) and it does not predict any emotions. In many others, the prediction is wrong according to the dataset but arguably it could be considered a pretty decent classification:
Code and data availability
The code used in this blog post is available as part of the following Github repositories and gists:
Models and data can be found in the following HugginFace spaces: