Classifying movie reviews using Sentiment Analysis and ULMFit
We will create a language model that will classify movie reviews into positive reviews and negative reviews, based on their sentiment.
- Preprocessing the text
- Training a Text Classifier
- Creating the Classifier DataLoaders
- Fine-Tuning the Classifier
In this blog post, we will create a language model that will classify movie reviews into positive reviews and negative reviews, based on their sentiment.
A language model is basically a model that can guess the next word in a text, while having read the ones before. This kind of task is called self-supervised learning, wherein we train a model using labels that are embeeded in the independent variable, rather than requiring external labels.
The dataset we'll be using is the IMDb Large Movie Review Dataset, which contains 25,000 highly polarized movie reviews for training, and 25,000 for testing.
For this task, we will be using the Universal Language Model Fine-Tuning (ULMFit) approach. First, we will train our language model using a model pretrained on Wikipedia. Then we will go one step further by fine-tuning our pretrained language model to the IMBD corpus, and then use that as the base for our classifer. Basically, we'll be fine-tuning the sequence-based language model prior to fine-tuning the classification model. Even if our language model knows the basics of the language we are using in the task, it is benefitial to understand the style of the corpus we are targeting. For e.g., in the IMDB dataset, there will be lots of names of movie directors and actors, and often a less formal style of language than that seen in Wikipedia.
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()
from fastbook import *
Let's download the dataset.
from fastai.text.all import *
path = untar_data(URLs.IMDB)
Path.BASE_PATH = path
path.ls()
We'll grab the text files using get_text_files
, which gets all the text files in a pth. We can optionally pass folders
to restrict the search to a particular list of subfolders.
files = get_text_files(path, folders=['train', 'test', 'unsup'])
Here's a random review we can look at.
txt = files[0].open().read()
txt
Tokenization
Tokenization converts the text into a list of words or characters or substrings, depending on the granularity of the model.
There are three main approaches: word-based, subword based, and character-based.
We'll use word tokenization here since we're just dealing with the plain English language and words are separated by spaces.
Word Tokenization
We'll use WordTokenizer
for word tokenization. It uses fastai's current default word tokenizer, which is spaCy for now. The spaCy library has a sophisticated rules engine with special rules for URLs, individual special English words, and much more.
We'll use fastai's coll_repr(collection, n)
function to display the results. This displays the first n items of a collection, along with the full size. Also, fastai's tokenizers take a collection of documents to tokenize, so we have to wrap txt
in a list.
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))
As you can see, the words and punctuations have been separated.
Let's use the Tokenizer
class to add some additional functionality to the tokenization process.
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 35))
Everything is now lower-cased. There are now some tokens that start with the characters "xx". These are special tokens. Fastai adds these tokens by default, by applying a number of rules when preprocessing text, which are designed to make it easier for a model to recognize the important parts of a sentence.
Some of the main special tokens are:
-
xxbos
: indicates the beginning of a text -
xxmaj
: indicates the next word begins with a capital -
xxunk
: indicates the next word is unknown
Numericalization
Numericalization is the process of mapping tokens to integers. It makes a list of all the unique words that appear (the vocab), and convert each word into a number, by looking up its index in the vocab.
Let's take a look at this in action. We need to call setup
(a special fastai method) on Numericalize
to create the vocab.
txts = L(o.open().read() for o in files[:2000])
txts[0][:]
toks200 = txts[:200].map(tkn)
toks200[0]
We can pass this to setup
to create our vocab.
num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab, 20)
The special tokens appear first and then every word appears once, in the descending order of frequency.
Once we've created our Numericalize
object, we can use it as if it were a function.
nums = num(toks)[:20]
nums
Our tokens have been converted to a tensor of integers that our model can receive. We can check if they map back to the original text.
' '.join(num.vocab[o] for o in nums)
Now that we have numbers, we need to put them in batches for our model.
Putting our Texts into Batches for a Language Model
Fastai provides an LMDataLoader
class which automatically handles creating a dependent variable that is offset from the independent variable by one token. It automatically shuffles the collection of documents at every epoch and concatenates them into a stream of tokens. It then cuts that stream into a batch of fixed-size consecutive mini-streams. Our model will then read the mini-streams in order, and thanks to an inner state, it will product the same activation whatever sequence length we picked.
nums200 = toks200.map(num)
dl = LMDataLoader(nums200)
Let's confirm that this gives the expected results, by grabbing the first batch.
x, y = first(dl)
x.shape, y.shape
Here 64
is the batch size, and 72
is the sequence length.
Let's look at the first row of the independent variable, which should be the start of the first text.
' '.join(num.vocab[o] for o in x[0][:200])
This concludes all the preprocessing steps we need to apply to our data. We are now ready to train our text classifier.
Training a Text Classifier
As we discussed earlier, there are two steps to training a state-of-the-art text classifier using transfer learning: first we need to fine-tune our language model pretrained on Wikipedia to the corpus of IMDB reviews, and then we can use that model to train a classifier.
Language Model using DataBlock
Fastai handles tokenization and numericalization automatically when TextBlock
is passed to DataBlock
.
Let's create a language model using TextBlock
.
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])
dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)
Setting up the numericalizer's vocab can take a long time. But TextBlock
performs a few optimizations:
- it saves the tokenized documents in a temporary folder, so it doesn't have to tokenize them more than once.
- it runs multiple tokenization processes in parallel, to take advantage of your computer's CPU.
The from_folder
tells TextBlock
how to access the texts for the initial preprocessing.
We can look at a couple of examples in the model.
dls_lm.show_batch(max_n=2)
Now that our data is ready, we can fine-tune the pretrained language model.
Fine-tuning the Language Model
To convert the integer word indices into activations that we can use for our neural network, we will use embeddings. We'll feed those embeddings into a recurrent neural network (RNN), using an architecture called AWD-LSTM.
The embeddings in the pretrained model are merged with random embeddings added for words that weren't in the pretraining vocabulary. This is handled automatically inside language_model_learner
.
learn = language_model_learner(
dls_lm, AWD_LSTM, drop_mult=0.3,
metrics=[accuracy, Perplexity()]
).to_fp16()
Since this is a classification problem, the loss function used by default is cross-entropy loss. The Perplexity
metric is the exponential of the loss (i.e. torch.exp(cross_entropy)
) and is often used in NLP problems. The accuracy
metric is to see how many times our model is right while trying to predict the next word.
The first stage of the process is over i.e. we have fine-tuned our language model pretrained on Wikipedia to the corpus of IMDB reviews, and we've built the DataLoaders
and Learner
for the second stage.
It takes a long time to train each epoch, so we'll be saving the intermediate model results during the training process. Since fine_tune
doesn't do that for us, we'll use fit_one_cycle
. language_model_learner
automatically calls freeze
when using a pretrained model (which is the default), so this will only train the embeddings (the only part of the model that contains randomly initialized weights - i.e. embeddings for words that are in our IMDB vocab, but aren't in the pretrained model vocab).
learn.fit_one_cycle(3, 2e-2)
Since the model took a long time to train, let's save the state of our model.
learn.save('lang_model')
Once the initial training has completed, we can continue fine-tuning the model after unfreezing.
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)
Once this is done, we save all of our model except the final layer that converts activation to probabilities of picking each token in our vocabulary. The model not including the final layer is called encoder. We can save it with save_encoder
.
learn.save_encoder('finetuned')
This completes the second stage of the process i.e. fine-tuning the language model.
Creating the Classifier DataLoaders
Now that we have fine-tuned the language model, we need to fine-tune the classifier. The language model only predicts the next word of a document so it doesn't need any external labels. Our classifier, however, needs to predict the sentiment of a moview review.
Let's create a DataBlock
for our classifier.
dls_clas = DataBlock(
blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
get_y=parent_label,
get_items=partial(get_text_files, folders=['train', 'test']),
splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)
In this DataBlock, TextBlock.from_folder
no longer has the is_lm=True
parameter. Instead we pass the vocab
we created for the language model fine-tuning. The reason we pass the vocab
of the language model is to make sure that we use the same correspondence of token to index. Otherwise the embeddings we learned in our fine-tuned language model won't make any sense to this model, and the fine-tuning step won't be of any use.
By not passing is_lm
, we tell TextBlock
that we have regular labeled data, rather than using the next tokens as labels.
show_batch
can show us the dependent variable or sentiment in this case, with each indepedent variable or movie review.
dls_clas.show_batch(max_n=3)
We do have to collate multiple documents into a mini-batch. We will expand the shortest texts to make them all the same size using a special padding token that will be ignored by our model. Also, to avoid memory issues and improve performance, we will batch together texts that are roughly the same lengths by sorting the documents in the training set by length prior to each epoch. Hence, the documents collated into a single batch will tend to be of similar lengths. We won't pad every batch to the same size, but will instead use the size of the largest document in each batch as the target size.
The sorting and padding are done automatically by the data block API when using a TextBlock
, with is_lm=False
.
Let's now create a model to classify our texts.
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5,
metrics=accuracy).to_fp16()
The final step prior to training the classifer is to load the encoder from our fine-tuned model. We use load_encoder
instead of load
because we only have pretrained weights available for the encoder; load
be default raises an exception if an incomplete model is loaded.
learn = learn.load_encoder('finetuned')
learn.fit_one_cycle(1, 2e-2)
We can pass -2
to freeze_to
to freeze all except the last two parameter groups.
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4), 5e-3))
We can unfreeze a bit more and continue training.
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4), 5e-3))
And now we'll finally unfreeze the whole model.
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4), 1e-3))
Considering the resources we used and the time taken, a 94.3% accuracy is amazing! Let's check some of the results achieved by our model.
learn.show_results()