This blog post is basically a continuation of my previous post titled Classifying movie reviews using Sentiment Analysis and ULMFit, and you should definitely read that if you want to better understand the methodology behind the process used in this task.

The dataset we’ll be using is the IMDb Large Movie Review Dataset, which contains 25,000 highly polarized movie reviews for training, and 25,000 for testing.

!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

     |████████████████████████████████| 727kB 4.4MB/s 
     |████████████████████████████████| 204kB 37.9MB/s 
     |████████████████████████████████| 51kB 5.7MB/s 
     |████████████████████████████████| 1.2MB 38.5MB/s 
     |████████████████████████████████| 61kB 5.9MB/s 
     |████████████████████████████████| 61kB 7.2MB/s 
Mounted at /content/gdrive

from fastbook import *

from fastai.text.all import *
path = untar_data(URLs.IMDB)

Path.BASE_PATH = path
path.ls()

(#7) [Path('imdb.vocab'),Path('train'),Path('README'),Path('tmp_clas'),Path('test'),Path('tmp_lm'),Path('unsup')]

We’ll grab the text files using get_text_files, which gets all the text files in a pth. We can optionally pass folders to restrict the search to a particular list of subfolders.

files = get_text_files(path, folders=['train', 'test', 'unsup'])

txt = files[0].open().read()
txt

"Dressed to Kill (1980) is a mystery horror film from Brian De Palma and it really works.The atmosphere is right there.The atmosphere that makes you scared.And isn't that what a horror film is supposed to do.All the actors are in the right places.Michael Caine is perfect as Dr. Robert Elliott, the shrink with a little secret.Angie Dickinson as Kate Miller, the sexually frustrated mature woman is terrific.Keith Gordon as her son Peter is brilliant.Nancy Allen as Liz Blake the call girl is fantastic.Dennis Franz does his typical detective role.His Detective Marino is one of the most colorful in this movie.There are plenty of creepy scenes in this movie.The elevator scene is one of them.There have been made comparisons between this and Alfred Hitchcock's Psycho (1960).There are some similarities between these two movies.Both of these movies may cause some sleepless nights."

Training a Text Classifier

Language Model using DataBlock

Fastai handles tokenization and numericalization automatically when TextBlock is passed to DataBlock. Let’s create a language model using TextBlock.

get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=72)

dls_lm.show_batch(max_n=2)

Now that our data is ready, we can fine-tune the pretrained language model.

Fine-tuning the Language Model

To convert the integer word indices into activations that we can use for our neural network, we will use embeddings. We’ll feed those embeddings into a recurrent neural network (RNN), using an architecture called AWD-LSTM. The embeddings in the pretrained model are merged with random embeddings added for words that weren’t in the pretraining vocabulary. This is handled automatically inside language_model_learner.

learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3,
    metrics=[accuracy, Perplexity()]
).to_fp16()

learn.fit_one_cycle(3, 2e-2)

learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)

Text Generation

Let’s use our model to generate random reviews. Since it is trained to guess what the next word of the sentence is, we can use the model to write new reviews.

TEXT = 'This movie is terrible'
N_WORDS = 70
N_SENTENCES = 5
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75)
for _ in range(N_SENTENCES)]

print('\n\n'.join(preds))

This movie is terrible ! It 's terrible . The actors are ALL bad . The story is bad . The special effects are TERRIBLE . And that 's really the only thing that will save this movie . The plot is pathetic . The movie is just boring . a group of people are about to end up in a mental hospital where they

This movie is terrible . My friend went to see it and we were so disappointed . I 'm not usually a fan of the book but i had earlier read that Chris Columbus wrote some of the best writing , directing and directing since , well , there are no words to describe how bad this piece of garbage was . It was a complete waste of time

This movie is terrible . And it has been very long . i did n't think it was even worth the rental , but it was very recommended . If you are into action movies , be sure to rent Titanic . You will be disappointed . It is a well made movie . The acting is good enough to keep your interest . Everything about this

This movie is terrible . Not only is it offensive in spots , it only gets worse . It has no story line . No acting and dead and cheap special effects . What a waste of talent . My 3 year old son was laughing , not laughing . Well , i really loved the first film . This one is clearly one of the

This movie is terrible , i do n't know why i could n't find it , it was so awful that i had to leave the room after this horrible film was finished . 

 The plot was so stupid that it went on way too long . It was painful to watch . The fact that the audience was so bored was incredible . The only reason this film

	text	text_
0	xxbos xxmaj being that i am not a fan of xxmaj snoop xxmaj dogg , as an actor , that made me even more anxious to check out this flick . i remember he was interviewed on " jay xxmaj leno , " and said that he turned down a role in the big - budget xxmaj adam xxmaj sandler comedy " the xxmaj longest xxmaj yard " to be in this	xxmaj being that i am not a fan of xxmaj snoop xxmaj dogg , as an actor , that made me even more anxious to check out this flick . i remember he was interviewed on " jay xxmaj leno , " and said that he turned down a role in the big - budget xxmaj adam xxmaj sandler comedy " the xxmaj longest xxmaj yard " to be in this film
1	viewer , the first number in the series does provide an unexpected element of suspense in addition to capable costuming from xxmaj ha xxmaj nguyen , fine stunt performing , and a polished turn from xxmaj carr . xxmaj an unrated version is available that seemingly promises to provide additional footage of the ardent romantic actions shared by the mismatched lovers . xxbos xxmaj the xxmaj minion is about … well ,	, the first number in the series does provide an unexpected element of suspense in addition to capable costuming from xxmaj ha xxmaj nguyen , fine stunt performing , and a polished turn from xxmaj carr . xxmaj an unrated version is available that seemingly promises to provide additional footage of the ardent romantic actions shared by the mismatched lovers . xxbos xxmaj the xxmaj minion is about … well , a

epoch	train_loss	valid_loss	accuracy	perplexity	time
0	4.128321	4.070849	0.284800	58.606724	29:54
1	3.995339	3.938066	0.296213	51.319229	29:57
2	3.860701	3.867283	0.303124	47.812309	30:00

epoch	train_loss	valid_loss	accuracy	perplexity	time
0	3.675387	3.746690	0.317715	42.380569	32:10
1	3.645742	3.704438	0.322705	40.627209	32:08
2	3.605402	3.664308	0.327991	39.029121	31:54
3	3.535574	3.633687	0.331826	37.852131	31:51
4	3.451682	3.618303	0.334019	37.274242	31:41
5	3.417034	3.603825	0.336183	36.738476	31:49
6	3.359589	3.594853	0.337721	36.410355	31:44
7	3.266180	3.592850	0.338945	36.337505	31:36
8	3.213485	3.597207	0.339176	36.496162	31:34
9	3.178523	3.602469	0.339008	36.688713	31:36