Download this tutorial

LSTM-based Language Models

A statistical language model is simply a probability distribution over sequences of words or characters [1]. In this tutorial, we’ll restrict our attention to word-based language models. Given a reliable language model, we can answer questions like which among the following strings are we more likely to encounter?

  1. ‘On Monday, Mr. Lamar’s “DAMN.” took home an even more elusive honor, one that may never have even seemed within reach: the Pulitzer Prize”

  2. “Frog zealot flagged xylophone the bean wallaby anaphylaxis extraneous porpoise into deleterious carrot banana apricot.”

Even if we’ve never seen either of these sentences in our entire lives, and even though no rapper has previously been awarded a Pulitzer Prize, we wouldn’t be shocked to see the first sentence in the New York Times. By comparison, we can all agree that the second sentence, consisting of incoherent babble, is comparatively unlikely. A statistical language model can assign precise probabilities to each of these and other strings of words.

Given a large corpus of text, we can estimate (or, in this case, train) a language model \(\hat{p}(x_1, ..., x_n)\). And given such a model, we can sample strings \(\mathbf{x} \sim \hat{p}(x_1, ..., x_n)\), generating new strings according to their estimated probability. Among other useful applications, we can use language models to score candidate transcriptions from speech recognition models, given a preference to sentences that seem more probable (at the expense of those deemed anomalous).

These days recurrent neural networks (RNNs) are the preferred method for language models. In this notebook, we will go through an example of using GluonNLP to

  1. implement a typical LSTM language model architecture

  2. train the language model on a corpus of real data

  3. bring in your own dataset for training

  4. grab off-the-shelf pre-trained state-of-the-art language models (i.e., AWD language model) using GluonNLP.

What is a language model (LM)?

The standard approach to language modeling consists of training a model that given a trailing window of text, predicts the next word in the sequence. When we train the model we feed in the inputs \(x_1, x_2, ...\) and try at each time step to predict the corresponding next word \(x_2, ..., x_{n+1}\). To generate text from a language model, we can iteratively predict the next word, and then feed this word as an input to the model at the subsequent time step. The image included below demonstrates this idea.

Train your own language model

Now let’s go through the step-by-step process on how to train your own language model using GluonNLP.

Preparation

We’ll start by taking care of our basic dependencies and setting up our environment.

Firstly, we import the required modules for GluonNLP and the LM.

In [1]:
import warnings
warnings.filterwarnings('ignore')

import glob
import time
import math

import mxnet as mx
from mxnet import gluon, autograd
from mxnet.gluon.utils import download

import gluonnlp as nlp

Then we setup the environment for GluonNLP.

Please note that we should change num_gpus according to how many NVIDIA GPUs are available on the target machine in the following code.

In [2]:
num_gpus = 1
context = [mx.gpu(i) for i in range(num_gpus)] if num_gpus else [mx.cpu()]
log_interval = 200

Next we setup the hyperparameters for the LM we are using.

Note that BPTT stands for “back propagation through time,” and LR stands for learning rate. A link to more information on truncated BPTT can be found here.

In [3]:
batch_size = 20 * len(context)
lr = 20
epochs = 3
bptt = 35
grad_clip = 0.25

Loading the dataset

Now, we load the dataset, extract the vocabulary, numericalize, and batchify in order to perform truncated BPTT.

In [4]:
dataset_name = 'wikitext-2'

# Load the dataset
train_dataset, val_dataset, test_dataset = [
    nlp.data.WikiText2(
        segment=segment, bos=None, eos='<eos>', skip_empty=False)
    for segment in ['train', 'val', 'test']
]

# Extract the vocabulary and numericalize with "Counter"
vocab = nlp.Vocab(
    nlp.data.Counter(train_dataset), padding_token=None, bos_token=None)

# Batchify for BPTT
bptt_batchify = nlp.data.batchify.CorpusBPTTBatchify(
    vocab, bptt, batch_size, last_batch='discard')
train_data, val_data, test_data = [
    bptt_batchify(x) for x in [train_dataset, val_dataset, test_dataset]
]
Downloading /root/.mxnet/datasets/wikitext-2/wikitext-2-v1.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/wikitext-2/wikitext-2-v1.zip...

And then we load the pre-defined language model architecture as so:

In [5]:
model_name = 'standard_lstm_lm_200'
model, vocab = nlp.model.get_model(model_name, vocab=vocab, dataset_name=None)
print(model)
print(vocab)

# Initialize the model
model.initialize(mx.init.Xavier(), ctx=context)

# Initialize the trainer and optimizer and specify some hyperparameters
trainer = gluon.Trainer(model.collect_params(), 'sgd', {
    'learning_rate': lr,
    'momentum': 0,
    'wd': 0
})

# Specify the loss function, in this case, cross-entropy with softmax.
loss = gluon.loss.SoftmaxCrossEntropyLoss()
StandardRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 200, float32)
    (1): Dropout(p = 0.2, axes=())
  )
  (encoder): LSTM(200 -> 200, TNC, num_layers=2, dropout=0.2)
  (decoder): HybridSequential(
    (0): Dense(200 -> 33278, linear)
  )
)
Vocab(size=33278, unk="<unk>", reserved="['<eos>']")

Training the LM

Now that everything is ready, we can start training the model.

We first define a helper function for detaching the gradients on specific states for easier truncated BPTT.

In [6]:
def detach(hidden):
    if isinstance(hidden, (tuple, list)):
        hidden = [detach(i) for i in hidden]
    else:
        hidden = hidden.detach()
    return hidden

And then a helper evaluation function.

In [7]:
# Note that ctx is short for context
def evaluate(model, data_source, batch_size, ctx):
    total_L = 0.0
    ntotal = 0
    hidden = model.begin_state(
        batch_size=batch_size, func=mx.nd.zeros, ctx=ctx)
    for i, (data, target) in enumerate(data_source):
        data = data.as_in_context(ctx)
        target = target.as_in_context(ctx)
        output, hidden = model(data, hidden)
        hidden = detach(hidden)
        L = loss(output.reshape(-3, -1), target.reshape(-1))
        total_L += mx.nd.sum(L).asscalar()
        ntotal += L.size
    return total_L / ntotal

The main training loop

Our loss function will be the standard cross-entropy loss function used for multi-class classification, applied at each time step to compare the model’s predictions to the true next word in the sequence. We can calculate gradients with respect to our parameters using truncated BPTT. In this case, we’ll back propagate for \(35\) time steps, updating our weights with stochastic gradient descent and a learning rate of \(20\); these correspond to the hyperparameters that we specified earlier in the notebook.

In [8]:
# Function for actually training the model
def train(model, train_data, val_data, test_data, epochs, lr):
    best_val = float("Inf")
    start_train_time = time.time()
    parameters = model.collect_params().values()

    for epoch in range(epochs):
        total_L = 0.0
        start_epoch_time = time.time()
        start_log_interval_time = time.time()
        hiddens = [model.begin_state(batch_size//len(context), func=mx.nd.zeros, ctx=ctx)
                   for ctx in context]

        for i, (data, target) in enumerate(train_data):
            data_list = gluon.utils.split_and_load(data, context,
                                                   batch_axis=1, even_split=True)
            target_list = gluon.utils.split_and_load(target, context,
                                                     batch_axis=1, even_split=True)
            hiddens = detach(hiddens)
            L = 0
            Ls = []

            with autograd.record():
                for j, (X, y, h) in enumerate(zip(data_list, target_list, hiddens)):
                    output, h = model(X, h)
                    batch_L = loss(output.reshape(-3, -1), y.reshape(-1,))
                    L = L + batch_L.as_in_context(context[0]) / (len(context) * X.size)
                    Ls.append(batch_L / (len(context) * X.size))
                    hiddens[j] = h
            L.backward()
            grads = [p.grad(x.context) for p in parameters for x in data_list]
            gluon.utils.clip_global_norm(grads, grad_clip)

            trainer.step(1)

            total_L += sum([mx.nd.sum(l).asscalar() for l in Ls])

            if i % log_interval == 0 and i > 0:
                cur_L = total_L / log_interval
                print('[Epoch %d Batch %d/%d] loss %.2f, ppl %.2f, '
                      'throughput %.2f samples/s'%(
                    epoch, i, len(train_data), cur_L, math.exp(cur_L),
                    batch_size * log_interval / (time.time() - start_log_interval_time)))
                total_L = 0.0
                start_log_interval_time = time.time()

        mx.nd.waitall()

        print('[Epoch %d] throughput %.2f samples/s'%(
                    epoch, len(train_data)*batch_size / (time.time() - start_epoch_time)))

        val_L = evaluate(model, val_data, batch_size, context[0])
        print('[Epoch %d] time cost %.2fs, valid loss %.2f, valid ppl %.2f'%(
            epoch, time.time()-start_epoch_time, val_L, math.exp(val_L)))

        if val_L < best_val:
            best_val = val_L
            test_L = evaluate(model, test_data, batch_size, context[0])
            model.save_parameters('{}_{}-{}.params'.format(model_name, dataset_name, epoch))
            print('test loss %.2f, test ppl %.2f'%(test_L, math.exp(test_L)))
        else:
            lr = lr*0.25
            print('Learning rate now %f'%(lr))
            trainer.set_learning_rate(lr)

    print('Total training throughput %.2f samples/s'%(
                            (batch_size * len(train_data) * epochs) /
                            (time.time() - start_train_time)))

We can now actually perform the training

In [9]:
train(model, train_data, val_data, test_data, epochs, lr)
[Epoch 0 Batch 200/2983] loss 7.65, ppl 2098.04, throughput 1509.25 samples/s
[Epoch 0 Batch 400/2983] loss 6.74, ppl 843.74, throughput 1581.06 samples/s
[Epoch 0 Batch 600/2983] loss 6.34, ppl 568.86, throughput 1568.91 samples/s
[Epoch 0 Batch 800/2983] loss 6.19, ppl 485.83, throughput 1565.35 samples/s
[Epoch 0 Batch 1000/2983] loss 6.05, ppl 422.86, throughput 1561.80 samples/s
[Epoch 0 Batch 1200/2983] loss 5.97, ppl 390.63, throughput 1566.37 samples/s
[Epoch 0 Batch 1400/2983] loss 5.85, ppl 348.17, throughput 1573.16 samples/s
[Epoch 0 Batch 1600/2983] loss 5.86, ppl 349.49, throughput 1543.06 samples/s
[Epoch 0 Batch 1800/2983] loss 5.70, ppl 299.47, throughput 1578.21 samples/s
[Epoch 0 Batch 2000/2983] loss 5.68, ppl 291.84, throughput 1551.47 samples/s
[Epoch 0 Batch 2200/2983] loss 5.58, ppl 264.42, throughput 1512.75 samples/s
[Epoch 0 Batch 2400/2983] loss 5.58, ppl 264.22, throughput 1565.65 samples/s
[Epoch 0 Batch 2600/2983] loss 5.57, ppl 262.79, throughput 1579.01 samples/s
[Epoch 0 Batch 2800/2983] loss 5.46, ppl 235.67, throughput 1538.36 samples/s
[Epoch 0] throughput 1556.55 samples/s
[Epoch 0] time cost 39.67s, valid loss 5.41, valid ppl 223.20
test loss 5.33, test ppl 206.37
[Epoch 1 Batch 200/2983] loss 5.47, ppl 236.74, throughput 1487.53 samples/s
[Epoch 1 Batch 400/2983] loss 5.45, ppl 232.48, throughput 1572.55 samples/s
[Epoch 1 Batch 600/2983] loss 5.28, ppl 197.12, throughput 1557.06 samples/s
[Epoch 1 Batch 800/2983] loss 5.30, ppl 199.92, throughput 1532.08 samples/s
[Epoch 1 Batch 1000/2983] loss 5.27, ppl 194.17, throughput 1578.75 samples/s
[Epoch 1 Batch 1200/2983] loss 5.26, ppl 191.98, throughput 1549.28 samples/s
[Epoch 1 Batch 1400/2983] loss 5.26, ppl 192.71, throughput 1578.71 samples/s
[Epoch 1 Batch 1600/2983] loss 5.32, ppl 204.99, throughput 1521.92 samples/s
[Epoch 1 Batch 1800/2983] loss 5.19, ppl 178.83, throughput 1563.78 samples/s
[Epoch 1 Batch 2000/2983] loss 5.20, ppl 181.93, throughput 1546.04 samples/s
[Epoch 1 Batch 2200/2983] loss 5.12, ppl 167.45, throughput 1531.01 samples/s
[Epoch 1 Batch 2400/2983] loss 5.15, ppl 171.91, throughput 1566.16 samples/s
[Epoch 1 Batch 2600/2983] loss 5.16, ppl 174.85, throughput 1564.70 samples/s
[Epoch 1 Batch 2800/2983] loss 5.09, ppl 162.19, throughput 1579.29 samples/s
[Epoch 1] throughput 1550.90 samples/s
[Epoch 1] time cost 39.82s, valid loss 5.18, valid ppl 177.86
test loss 5.10, test ppl 163.81
[Epoch 2 Batch 200/2983] loss 5.13, ppl 169.14, throughput 1528.17 samples/s
[Epoch 2 Batch 400/2983] loss 5.16, ppl 173.68, throughput 1562.42 samples/s
[Epoch 2 Batch 600/2983] loss 4.97, ppl 144.35, throughput 1551.41 samples/s
[Epoch 2 Batch 800/2983] loss 5.02, ppl 151.14, throughput 1567.79 samples/s
[Epoch 2 Batch 1000/2983] loss 5.01, ppl 150.12, throughput 1550.63 samples/s
[Epoch 2 Batch 1200/2983] loss 5.01, ppl 150.15, throughput 1556.96 samples/s
[Epoch 2 Batch 1400/2983] loss 5.04, ppl 154.77, throughput 1518.96 samples/s
[Epoch 2 Batch 1600/2983] loss 5.11, ppl 165.59, throughput 1548.73 samples/s
[Epoch 2 Batch 1800/2983] loss 4.99, ppl 146.23, throughput 1538.83 samples/s
[Epoch 2 Batch 2000/2983] loss 5.02, ppl 150.75, throughput 1550.50 samples/s
[Epoch 2 Batch 2200/2983] loss 4.92, ppl 137.59, throughput 1564.40 samples/s
[Epoch 2 Batch 2400/2983] loss 4.96, ppl 142.35, throughput 1557.69 samples/s
[Epoch 2 Batch 2600/2983] loss 4.99, ppl 146.40, throughput 1558.95 samples/s
[Epoch 2 Batch 2800/2983] loss 4.92, ppl 136.88, throughput 1562.97 samples/s
[Epoch 2] throughput 1551.95 samples/s
[Epoch 2] time cost 39.77s, valid loss 5.07, valid ppl 158.44
test loss 4.99, test ppl 147.28
Total training throughput 1443.03 samples/s

Using your own dataset

When we train a language model, we fit to the statistics of a given dataset. While many papers focus on a few standard datasets, such as WikiText or the Penn Tree Bank, that’s just to provide a standard benchmark for the purpose of comparing models against one another. In general, for any given use case, you’ll want to train your own language model using a dataset of your own choice. Here, for demonstration, we’ll grab some .txt files corresponding to Sherlock Holmes novels.

We first download the new dataset.

In [10]:
TRAIN_PATH = "./sherlockholmes.train.txt"
VALID_PATH = "./sherlockholmes.valid.txt"
TEST_PATH = "./sherlockholmes.test.txt"
PREDICT_PATH = "./tinyshakespeare/input.txt"
download(
    "https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.train.txt",
    TRAIN_PATH,
    sha1_hash="d65a52baaf32df613d4942e0254c81cff37da5e8")
download(
    "https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.valid.txt",
    VALID_PATH,
    sha1_hash="71133db736a0ff6d5f024bb64b4a0672b31fc6b3")
download(
    "https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.test.txt",
    TEST_PATH,
    sha1_hash="b7ccc4778fd3296c515a3c21ed79e9c2ee249f70")
download(
    "https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/tinyshakespeare/input.txt",
    PREDICT_PATH,
    sha1_hash="04486597058d11dcc2c556b1d0433891eb639d2e")

print(glob.glob("sherlockholmes.*.txt"))
Downloading ./sherlockholmes.train.txt from https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.train.txt...
Downloading ./sherlockholmes.valid.txt from https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.valid.txt...
Downloading ./sherlockholmes.test.txt from https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.test.txt...
Downloading ./tinyshakespeare/input.txt from https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/tinyshakespeare/input.txt...
['sherlockholmes.train.txt', 'sherlockholmes.test.txt', 'sherlockholmes.valid.txt']

Then we specify the tokenizer as well as batchify the dataset.

In [11]:
import nltk
moses_tokenizer = nlp.data.SacreMosesTokenizer()

sherlockholmes_datasets = [
    nlp.data.CorpusDataset(
        'sherlockholmes.{}.txt'.format(name),
        sample_splitter=nltk.tokenize.sent_tokenize,
        tokenizer=moses_tokenizer,
        flatten=True,
        eos='<eos>') for name in ['train', 'valid', 'test']
]

sherlockholmes_train_data, sherlockholmes_val_data, sherlockholmes_test_data = [
    bptt_batchify(dataset) for dataset in sherlockholmes_datasets
]

We setup the evaluation to see whether our previous model trained on the other dataset does well on the new dataset.

In [12]:
sherlockholmes_L = evaluate(model, sherlockholmes_val_data, batch_size,
                            context[0])
print('Best validation loss %.2f, test ppl %.2f' %
      (sherlockholmes_L, math.exp(sherlockholmes_L)))
Best validation loss 4.71, test ppl 110.73

Or we have the option of training the model on the new dataset with just one line of code.

In [13]:
train(
    model,
    sherlockholmes_train_data, # This is your input training data, we leave batchifying and tokenizing as an exercise for the reader
    sherlockholmes_val_data,
    sherlockholmes_test_data, # This would be your test data, again left as an exercise for the reader
    epochs=3,
    lr=20)
[Epoch 0] throughput 1511.17 samples/s
[Epoch 0] time cost 2.28s, valid loss 3.01, valid ppl 20.38
test loss 2.94, test ppl 18.98
[Epoch 1] throughput 1568.36 samples/s
[Epoch 1] time cost 2.20s, valid loss 2.91, valid ppl 18.40
test loss 2.88, test ppl 17.86
[Epoch 2] throughput 1514.27 samples/s
[Epoch 2] time cost 2.27s, valid loss 3.02, valid ppl 20.57
Learning rate now 5.000000
Total training throughput 1275.65 samples/s

Using a pre-trained AWD LSTM language model

AWD LSTM language model is the state-of-the-art RNN language model [1]. The main technique leveraged is to add weight-dropout on the recurrent hidden to hidden matrices to prevent overfitting on the recurrent connections.

Load the vocabulary and the pre-trained model

In [14]:
awd_model_name = 'awd_lstm_lm_1150'
awd_model, vocab = nlp.model.get_model(
    awd_model_name,
    vocab=vocab,
    dataset_name=dataset_name,
    pretrained=True,
    ctx=context[0])

print(awd_model)
print(vocab)
Vocab file is not found. Downloading.
Downloading /root/.mxnet/models/1568637698.2623506wikitext-2-be36dc52.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/vocab/wikitext-2-be36dc52.zip...
Downloading /root/.mxnet/models/awd_lstm_lm_1150_wikitext-2-f9562ed0.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/awd_lstm_lm_1150_wikitext-2-f9562ed0.zip...
AWDRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 400, float32)
    (1): Dropout(p = 0.65, axes=(0,))
  )
  (encoder): Sequential(
    (0): LSTM(400 -> 1150, TNC)
    (1): LSTM(1150 -> 1150, TNC)
    (2): LSTM(1150 -> 400, TNC)
  )
  (decoder): HybridSequential(
    (0): Dense(400 -> 33278, linear)
  )
)
Vocab(size=33278, unk="<unk>", reserved="['<eos>']")

Evaluate the pre-trained model on the validation and test datasets

In [15]:
val_L = evaluate(awd_model, val_data, batch_size, context[0])
test_L = evaluate(awd_model, test_data, batch_size, context[0])

print('Best validation loss %.2f, val ppl %.2f' % (val_L, math.exp(val_L)))
print('Best test loss %.2f, test ppl %.2f' % (test_L, math.exp(test_L)))
Best validation loss 4.23, val ppl 68.80
Best test loss 4.19, test ppl 65.73

Using a cache LSTM LM

Cache LSTM language model [2] adds a cache-like memory to neural network language models. It can be used in conjunction with the aforementioned AWD LSTM language model or other LSTM models. It exploits the hidden outputs to define a probability distribution over the words in the cache. It generates state-of-the-art results at inference time.

Load the pre-trained model and define the hyperparameters

In [16]:
window = 2
theta = 0.662
lambdas = 0.1279
bptt = 2000
cache_model = nlp.model.train.get_cache_model(name=awd_model_name,
                                             dataset_name=dataset_name,
                                             window=window,
                                             theta=theta,
                                             lambdas=lambdas,
                                             ctx=context[0])

print(cache_model)
CacheCell(
  (lm_model): AWDRNN(
    (embedding): HybridSequential(
      (0): Embedding(33278 -> 400, float32)
      (1): Dropout(p = 0.65, axes=(0,))
    )
    (encoder): Sequential(
      (0): LSTM(400 -> 1150, TNC)
      (1): LSTM(1150 -> 1150, TNC)
      (2): LSTM(1150 -> 400, TNC)
    )
    (decoder): HybridSequential(
      (0): Dense(400 -> 33278, linear)
    )
  )
)

Define specific get_batch and evaluation helper functions for the cache model

Note that these helper functions are very similar to the ones we defined above, but are slightly different.

In [17]:
val_test_batch_size = 1
val_test_batchify = nlp.data.batchify.CorpusBatchify(vocab, val_test_batch_size)
val_data = val_test_batchify(val_dataset)
test_data = val_test_batchify(test_dataset)
In [18]:
def get_batch(data_source, i, seq_len=None):
    seq_len = min(seq_len if seq_len else bptt, len(data_source) - 1 - i)
    data = data_source[i:i + seq_len]
    target = data_source[i + 1:i + 1 + seq_len]
    return data, target
In [19]:
def evaluate_cache(model, data_source, batch_size, ctx):
    total_L = 0.0
    hidden = model.begin_state(
        batch_size=batch_size, func=mx.nd.zeros, ctx=ctx)
    next_word_history = None
    cache_history = None
    for i in range(0, len(data_source) - 1, bptt):
        if i > 0:
            print('Batch %d, ppl %f' % (i, math.exp(total_L / i)))
        if i == bptt:
            return total_L / i
        data, target = get_batch(data_source, i)
        data = data.as_in_context(ctx)
        target = target.as_in_context(ctx)
        L = 0
        outs, next_word_history, cache_history, hidden = model(
            data, target, next_word_history, cache_history, hidden)
        for out in outs:
            L += (-mx.nd.log(out)).asscalar()
        total_L += L / data.shape[1]
        hidden = detach(hidden)
    return total_L / len(data_source)

Evaluate the pre-trained model on the validation and test datasets

In [20]:
val_L = evaluate_cache(cache_model, val_data, val_test_batch_size, context[0])
test_L = evaluate_cache(cache_model, test_data, val_test_batch_size, context[0])

print('Best validation loss %.2f, val ppl %.2f'%(val_L, math.exp(val_L)))
print('Best test loss %.2f, test ppl %.2f'%(test_L, math.exp(test_L)))
Batch 2000, ppl 60.767822
Batch 2000, ppl 67.390510
Best validation loss 4.11, val ppl 60.77
Best test loss 4.21, test ppl 67.39

References

[1] Merity, S., et al. “Regularizing and optimizing LSTM language models”. ICLR 2018

[2] Grave, E., et al. “Improving neural language models with a continuous cache”. ICLR 2017