Download this tutorial

LSTM-based Language Models

A statistical language model is simply a probability distribution over sequences of words or characters [1]. In this tutorial, we’ll restrict our attention to word-based language models. Given a reliable language model, we can answer questions like which among the following strings are we more likely to encounter?

  1. ‘On Monday, Mr. Lamar’s “DAMN.” took home an even more elusive honor, one that may never have even seemed within reach: the Pulitzer Prize”
  2. “Frog zealot flagged xylophone the bean wallaby anaphylaxis extraneous porpoise into deleterious carrot banana apricot.”

Even if we’ve never seen either of these sentences in our entire lives, and even though no rapper has previously been awarded a Pulitzer Prize, we wouldn’t be shocked to see the first sentence in the New York Times. By comparison, we can all agree that the second sentence, consisting of incoherent babble, is comparatively unlikely. A statistical language model can assign precise probabilities to each string of words.

Given a large corpus of text, we can estimate (i.e. train) a language model \(\hat{p}(x_1, ..., x_n)\). And given such a model, we can sample strings \(\mathbf{x} \sim \hat{p}(x_1, ..., x_n)\), generating new strings according to their estimated probability. Among other useful applications, we can use language models to score candidate transcriptions from speech recognition models, given a preference to sentences that seem more probable (at the expense of those deemed anomalous).

These days recurrent neural networks (RNNs) are the preferred method for LM. In this notebook, we will go through an example of using GluonNLP to

  1. implement a typical LSTM language model architecture
  2. train the language model on a corpus of real data
  3. bring in your own dataset for training
  4. grab off-the-shelf pre-trained state-of-the-art language models (i.e., AWD language model) using GluonNLP.

Language model definition - one sentence

The standard approach to language modeling consists of training a model that given a trailing window of text, predicts the next word in the sequence. When we train the model we feed in the inputs \(x_1, x_2, ...\) and try at each time step to predict the corresponding next word \(x_2, ..., x_{n+1}\). To generate text from a language model, we can iteratively predict the next word, and then feed this word as an input to the model at the subsequent time step.

Train your own language model

Now let’s step through how to train your own language model using GluonNLP.

Preparation

We’ll start by taking care of our basic dependencies and setting up our environment

Load gluonnlp

In [1]:
import warnings
warnings.filterwarnings('ignore')

import glob
import time
import math

import mxnet as mx
from mxnet import gluon, autograd
from mxnet.gluon.utils import download

import gluonnlp as nlp

Set environment

In [2]:
num_gpus = 1
context = [mx.gpu(i) for i in range(num_gpus)] if num_gpus else [mx.cpu()]
log_interval = 200

Set hyperparameters

In [3]:
batch_size = 20 * len(context)
lr = 20
epochs = 3
bptt = 35
grad_clip = 0.25

Load dataset, extract vocabulary, numericalize, and batchify for truncated Back Propagation Through Time (BPTT)

In [4]:
dataset_name = 'wikitext-2'
train_dataset, val_dataset, test_dataset = [
    nlp.data.WikiText2(
        segment=segment, bos=None, eos='<eos>', skip_empty=False)
    for segment in ['train', 'val', 'test']
]

vocab = nlp.Vocab(
    nlp.data.Counter(train_dataset), padding_token=None, bos_token=None)

bptt_batchify = nlp.data.batchify.CorpusBPTTBatchify(
    vocab, bptt, batch_size, last_batch='discard')
train_data, val_data, test_data = [
    bptt_batchify(x) for x in [train_dataset, val_dataset, test_dataset]
]
Downloading /var/lib/jenkins/workspace/gluon-nlp-gpu-py3@2/tests/data/datasets/wikitext-2/wikitext-2-v1.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/wikitext-2/wikitext-2-v1.zip...

Load pre-defined language model architecture

In [5]:
model_name = 'standard_lstm_lm_200'
model, vocab = nlp.model.get_model(model_name, vocab=vocab, dataset_name=None)
print(model)
print(vocab)

model.initialize(mx.init.Xavier(), ctx=context)

trainer = gluon.Trainer(model.collect_params(), 'sgd', {
    'learning_rate': lr,
    'momentum': 0,
    'wd': 0
})
loss = gluon.loss.SoftmaxCrossEntropyLoss()
StandardRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 200, float32)
    (1): Dropout(p = 0.2, axes=())
  )
  (encoder): LSTM(200 -> 200, TNC, num_layers=2, dropout=0.2)
  (decoder): HybridSequential(
    (0): Dense(200 -> 33278, linear)
  )
)
Vocab(size=33278, unk="<unk>", reserved="['<eos>']")

Training

Now that everything is ready, we can start training the model.

Detach gradients on states for truncated BPTT

In [6]:
def detach(hidden):
    if isinstance(hidden, (tuple, list)):
        hidden = [detach(i) for i in hidden]
    else:
        hidden = hidden.detach()
    return hidden

Evaluation

In [7]:
def evaluate(model, data_source, batch_size, ctx):
    total_L = 0.0
    ntotal = 0
    hidden = model.begin_state(
        batch_size=batch_size, func=mx.nd.zeros, ctx=ctx)
    for i, (data, target) in enumerate(data_source):
        data = data.as_in_context(ctx)
        target = target.as_in_context(ctx)
        output, hidden = model(data, hidden)
        hidden = detach(hidden)
        L = loss(output.reshape(-3, -1), target.reshape(-1))
        total_L += mx.nd.sum(L).asscalar()
        ntotal += L.size
    return total_L / ntotal

Training loop

Our loss function will be the standard cross-entropy loss function used for multiclass classification, applied at each time step to compare our predictions to the true next word in the sequence. We can calculate gradients with respect to our parameters using truncated back-propagation-through-time (BPTT). In this case, we’ll backpropagate for \(35\) time steps, updating our weights with stochastic gradient descent with the learning rate of \(20\), hyperparameters that we chose earlier in the notebook.

In [8]:
def train(model, train_data, val_data, test_data, epochs, lr):
    best_val = float("Inf")
    start_train_time = time.time()
    parameters = model.collect_params().values()
    for epoch in range(epochs):
        total_L = 0.0
        start_epoch_time = time.time()
        start_log_interval_time = time.time()
        hiddens = [model.begin_state(batch_size//len(context), func=mx.nd.zeros, ctx=ctx)
                   for ctx in context]
        for i, (data, target) in enumerate(train_data):
            data_list = gluon.utils.split_and_load(data, context,
                                                   batch_axis=1, even_split=True)
            target_list = gluon.utils.split_and_load(target, context,
                                                     batch_axis=1, even_split=True)
            hiddens = detach(hiddens)
            L = 0
            Ls = []
            with autograd.record():
                for j, (X, y, h) in enumerate(zip(data_list, target_list, hiddens)):
                    output, h = model(X, h)
                    batch_L = loss(output.reshape(-3, -1), y.reshape(-1,))
                    L = L + batch_L.as_in_context(context[0]) / (len(context) * X.size)
                    Ls.append(batch_L / (len(context) * X.size))
                    hiddens[j] = h
            L.backward()
            grads = [p.grad(x.context) for p in parameters for x in data_list]
            gluon.utils.clip_global_norm(grads, grad_clip)

            trainer.step(1)

            total_L += sum([mx.nd.sum(l).asscalar() for l in Ls])

            if i % log_interval == 0 and i > 0:
                cur_L = total_L / log_interval
                print('[Epoch %d Batch %d/%d] loss %.2f, ppl %.2f, '
                      'throughput %.2f samples/s'%(
                    epoch, i, len(train_data), cur_L, math.exp(cur_L),
                    batch_size * log_interval / (time.time() - start_log_interval_time)))
                total_L = 0.0
                start_log_interval_time = time.time()

        mx.nd.waitall()

        print('[Epoch %d] throughput %.2f samples/s'%(
                    epoch, len(train_data)*batch_size / (time.time() - start_epoch_time)))
        val_L = evaluate(model, val_data, batch_size, context[0])
        print('[Epoch %d] time cost %.2fs, valid loss %.2f, valid ppl %.2f'%(
            epoch, time.time()-start_epoch_time, val_L, math.exp(val_L)))

        if val_L < best_val:
            best_val = val_L
            test_L = evaluate(model, test_data, batch_size, context[0])
            model.save_parameters('{}_{}-{}.params'.format(model_name, dataset_name, epoch))
            print('test loss %.2f, test ppl %.2f'%(test_L, math.exp(test_L)))
        else:
            lr = lr*0.25
            print('Learning rate now %f'%(lr))
            trainer.set_learning_rate(lr)

    print('Total training throughput %.2f samples/s'%(
                            (batch_size * len(train_data) * epochs) /
                            (time.time() - start_train_time)))

Train and evaluate

In [9]:
train(model, train_data, val_data, test_data, epochs, lr)
[Epoch 0 Batch 200/2983] loss 7.66, ppl 2116.74, throughput 473.69 samples/s
[Epoch 0 Batch 400/2983] loss 6.78, ppl 876.86, throughput 481.02 samples/s
[Epoch 0 Batch 600/2983] loss 6.37, ppl 581.20, throughput 484.26 samples/s
[Epoch 0 Batch 800/2983] loss 6.19, ppl 489.33, throughput 472.00 samples/s
[Epoch 0 Batch 1000/2983] loss 6.05, ppl 426.12, throughput 295.67 samples/s
[Epoch 0 Batch 1200/2983] loss 5.97, ppl 390.35, throughput 472.18 samples/s
[Epoch 0 Batch 1400/2983] loss 5.86, ppl 352.17, throughput 487.57 samples/s
[Epoch 0 Batch 1600/2983] loss 5.87, ppl 352.60, throughput 471.31 samples/s
[Epoch 0 Batch 1800/2983] loss 5.72, ppl 304.93, throughput 440.76 samples/s
[Epoch 0 Batch 2000/2983] loss 5.68, ppl 294.30, throughput 282.83 samples/s
[Epoch 0 Batch 2200/2983] loss 5.58, ppl 264.72, throughput 470.13 samples/s
[Epoch 0 Batch 2400/2983] loss 5.59, ppl 267.97, throughput 287.70 samples/s
[Epoch 0 Batch 2600/2983] loss 5.58, ppl 264.02, throughput 288.43 samples/s
[Epoch 0 Batch 2800/2983] loss 5.46, ppl 235.90, throughput 459.25 samples/s
[Epoch 0] throughput 403.16 samples/s
[Epoch 0] time cost 166.48s, valid loss 5.43, valid ppl 227.08
test loss 5.33, test ppl 206.87
[Epoch 1 Batch 200/2983] loss 5.48, ppl 240.74, throughput 494.81 samples/s
[Epoch 1 Batch 400/2983] loss 5.46, ppl 235.02, throughput 468.18 samples/s
[Epoch 1 Batch 600/2983] loss 5.30, ppl 200.56, throughput 478.01 samples/s
[Epoch 1 Batch 800/2983] loss 5.31, ppl 202.93, throughput 460.68 samples/s
[Epoch 1 Batch 1000/2983] loss 5.28, ppl 196.90, throughput 212.99 samples/s
[Epoch 1 Batch 1200/2983] loss 5.27, ppl 194.23, throughput 478.18 samples/s
[Epoch 1 Batch 1400/2983] loss 5.27, ppl 194.27, throughput 490.79 samples/s
[Epoch 1 Batch 1600/2983] loss 5.33, ppl 206.94, throughput 317.55 samples/s
[Epoch 1 Batch 1800/2983] loss 5.21, ppl 182.46, throughput 481.55 samples/s
[Epoch 1 Batch 2000/2983] loss 5.22, ppl 184.75, throughput 490.92 samples/s
[Epoch 1 Batch 2200/2983] loss 5.12, ppl 167.51, throughput 471.95 samples/s
[Epoch 1 Batch 2400/2983] loss 5.16, ppl 173.75, throughput 472.31 samples/s
[Epoch 1 Batch 2600/2983] loss 5.18, ppl 176.93, throughput 485.88 samples/s
[Epoch 1 Batch 2800/2983] loss 5.09, ppl 162.14, throughput 488.83 samples/s
[Epoch 1] throughput 428.65 samples/s
[Epoch 1] time cost 152.19s, valid loss 5.18, valid ppl 178.17
test loss 5.11, test ppl 164.87
[Epoch 2 Batch 200/2983] loss 5.15, ppl 172.26, throughput 312.66 samples/s
[Epoch 2 Batch 400/2983] loss 5.16, ppl 174.83, throughput 308.01 samples/s
[Epoch 2 Batch 600/2983] loss 4.99, ppl 146.82, throughput 475.04 samples/s
[Epoch 2 Batch 800/2983] loss 5.04, ppl 153.94, throughput 451.37 samples/s
[Epoch 2 Batch 1000/2983] loss 5.02, ppl 151.97, throughput 452.73 samples/s
[Epoch 2 Batch 1200/2983] loss 5.02, ppl 151.43, throughput 465.83 samples/s
[Epoch 2 Batch 1400/2983] loss 5.05, ppl 155.47, throughput 315.01 samples/s
[Epoch 2 Batch 1600/2983] loss 5.12, ppl 166.56, throughput 465.08 samples/s
[Epoch 2 Batch 1800/2983] loss 4.99, ppl 147.42, throughput 465.32 samples/s
[Epoch 2 Batch 2000/2983] loss 5.02, ppl 152.01, throughput 474.05 samples/s
[Epoch 2 Batch 2200/2983] loss 4.93, ppl 138.07, throughput 316.31 samples/s
[Epoch 2 Batch 2400/2983] loss 4.96, ppl 142.83, throughput 464.09 samples/s
[Epoch 2 Batch 2600/2983] loss 4.99, ppl 146.75, throughput 452.46 samples/s
[Epoch 2 Batch 2800/2983] loss 4.92, ppl 136.91, throughput 475.76 samples/s
[Epoch 2] throughput 411.21 samples/s
[Epoch 2] time cost 158.51s, valid loss 5.07, valid ppl 159.43
test loss 5.00, test ppl 148.53
Total training throughput 342.15 samples/s

Use your own dataset

When we train a language model, we fit to the statistics of a given dataset. While many papers focus on a few standard datasets, such as WikiText or the Penn Tree Bank, that’s just to provide a standard benchmark for the purpose of comparing models against each other. In general, for any given use case, you’ll want to train your own language model using a dataset of your own choice. Here, for demonstration, we’ll grab some .txt files corresponding to Sherlock Holmes novels.

In [10]:
TRAIN_PATH = "./sherlockholmes.train.txt"
VALID_PATH = "./sherlockholmes.valid.txt"
TEST_PATH = "./sherlockholmes.test.txt"
PREDICT_PATH = "./tinyshakespeare/input.txt"
download(
    "https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.train.txt",
    TRAIN_PATH,
    sha1_hash="d65a52baaf32df613d4942e0254c81cff37da5e8")
download(
    "https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.valid.txt",
    VALID_PATH,
    sha1_hash="71133db736a0ff6d5f024bb64b4a0672b31fc6b3")
download(
    "https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.test.txt",
    TEST_PATH,
    sha1_hash="b7ccc4778fd3296c515a3c21ed79e9c2ee249f70")
download(
    "https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/tinyshakespeare/input.txt",
    PREDICT_PATH,
    sha1_hash="04486597058d11dcc2c556b1d0433891eb639d2e")
sherlockholmes_dataset = glob.glob("sherlockholmes.*.txt")
print(sherlockholmes_dataset)
Downloading ./sherlockholmes.train.txt from https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.train.txt...
Downloading ./sherlockholmes.valid.txt from https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.valid.txt...
Downloading ./sherlockholmes.test.txt from https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.test.txt...
Downloading ./tinyshakespeare/input.txt from https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/tinyshakespeare/input.txt...
['sherlockholmes.train.txt', 'sherlockholmes.test.txt', 'sherlockholmes.valid.txt']
In [11]:
import nltk
moses_tokenizer = nlp.data.SacreMosesTokenizer()

sherlockholmes_val = nlp.data.CorpusDataset(
    'sherlockholmes.valid.txt',
    sample_splitter=nltk.tokenize.sent_tokenize,
    tokenizer=moses_tokenizer,
    flatten=True,
    eos='<eos>')

sherlockholmes_val_data = bptt_batchify(sherlockholmes_val)
In [12]:
sherlockholmes_L = evaluate(model, sherlockholmes_val_data, batch_size,
                            context[0])
print('Best validation loss %.2f, test ppl %.2f' %
      (sherlockholmes_L, math.exp(sherlockholmes_L)))
Best validation loss 4.76, test ppl 116.73
In [13]:
train(
    model,
    sherlockholmes_val_data,
    sherlockholmes_val_data,
    sherlockholmes_val_data,
    epochs=3,
    lr=20)
[Epoch 0] throughput 245.08 samples/s
[Epoch 0] time cost 5.00s, valid loss 3.52, valid ppl 33.76
test loss 3.52, test ppl 33.76
[Epoch 1] throughput 271.07 samples/s
[Epoch 1] time cost 4.95s, valid loss 3.20, valid ppl 24.43
test loss 3.20, test ppl 24.43
[Epoch 2] throughput 243.77 samples/s
[Epoch 2] time cost 5.10s, valid loss 2.82, valid ppl 16.73
test loss 2.82, test ppl 16.73
Total training throughput 96.12 samples/s

Use pre-trained AWD LSTM language model

AWD LSTM language model is the state-of-the-art RNN language model [1]. The main technique is to add weight-dropout on the recurrent hidden to hidden matrices to prevent overfitting on the recurrent connections.

Load vocabulary and pre-trained model

In [14]:
awd_model_name = 'awd_lstm_lm_1150'
awd_model, vocab = nlp.model.get_model(
    awd_model_name,
    vocab=vocab,
    dataset_name=dataset_name,
    pretrained=True,
    ctx=context[0])
print(awd_model)
print(vocab)
AWDRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 400, float32)
    (1): Dropout(p = 0.65, axes=(0,))
  )
  (encoder): Sequential(
    (0): LSTM(400 -> 1150, TNC)
    (1): LSTM(1150 -> 1150, TNC)
    (2): LSTM(1150 -> 400, TNC)
  )
  (decoder): HybridSequential(
    (0): Dense(400 -> 33278, linear)
  )
)
Vocab(size=33278, unk="<unk>", reserved="['<eos>']")

Evaluate the pre-trained model on val and test datasets

In [15]:
val_L = evaluate(awd_model, val_data, batch_size, context[0])
test_L = evaluate(awd_model, test_data, batch_size, context[0])
print('Best validation loss %.2f, val ppl %.2f' % (val_L, math.exp(val_L)))
print('Best test loss %.2f, test ppl %.2f' % (test_L, math.exp(test_L)))
Best validation loss 4.23, val ppl 68.80
Best test loss 4.19, test ppl 65.73

Use Cache LSTM language model

Cache LSTM language model [2] adds a cache-like memory to neural network language models. E.g. AWD LSTM language model. It exploits the hidden outputs to define a probability distribution over the words in the cache. It generates the state-of-the-art results in inference time.

Load pre-trained model and define hyperparameters

In [16]:
window = 2
theta = 0.662
lambdas = 0.1279
bptt = 2000
cache_model = nlp.model.train.get_cache_model(name=awd_model_name,
                                             dataset_name=dataset_name,
                                             window=window,
                                             theta=theta,
                                             lambdas=lambdas,
                                             ctx=context[0])
print(cache_model)
CacheCell(
  (lm_model): AWDRNN(
    (embedding): HybridSequential(
      (0): Embedding(33278 -> 400, float32)
      (1): Dropout(p = 0.65, axes=(0,))
    )
    (encoder): Sequential(
      (0): LSTM(400 -> 1150, TNC)
      (1): LSTM(1150 -> 1150, TNC)
      (2): LSTM(1150 -> 400, TNC)
    )
    (decoder): HybridSequential(
      (0): Dense(400 -> 33278, linear)
    )
  )
)

Define specific get_batch and evaluation for cache model

In [17]:
val_test_batch_size = 1
val_test_batchify = nlp.data.batchify.CorpusBatchify(vocab, val_test_batch_size)
val_data = val_test_batchify(val_dataset)
test_data = val_test_batchify(test_dataset)
In [18]:
def get_batch(data_source, i, seq_len=None):
    seq_len = min(seq_len if seq_len else bptt, len(data_source) - 1 - i)
    data = data_source[i:i + seq_len]
    target = data_source[i + 1:i + 1 + seq_len]
    return data, target
In [19]:
def evaluate_cache(model, data_source, batch_size, ctx):
    total_L = 0.0
    hidden = model.begin_state(
        batch_size=batch_size, func=mx.nd.zeros, ctx=ctx)
    next_word_history = None
    cache_history = None
    for i in range(0, len(data_source) - 1, bptt):
        if i > 0:
            print('Batch %d, ppl %f' % (i, math.exp(total_L / i)))
        if i == bptt:
            return total_L / i
        data, target = get_batch(data_source, i)
        data = data.as_in_context(ctx)
        target = target.as_in_context(ctx)
        L = 0
        outs, next_word_history, cache_history, hidden = model(
            data, target, next_word_history, cache_history, hidden)
        for out in outs:
            L += (-mx.nd.log(out)).asscalar()
        total_L += L / data.shape[1]
        hidden = detach(hidden)
    return total_L / len(data_source)

Evaluate the pre-trained model on val and test datasets

In [20]:
val_L = evaluate_cache(cache_model, val_data, val_test_batch_size, context[0])
test_L = evaluate_cache(cache_model, test_data, val_test_batch_size, context[0])
print('Best validation loss %.2f, val ppl %.2f'%(val_L, math.exp(val_L)))
print('Best test loss %.2f, test ppl %.2f'%(test_L, math.exp(test_L)))
Batch 2000, ppl 60.767823
Batch 2000, ppl 67.390510
Best validation loss 4.11, val ppl 60.77
Best test loss 4.21, test ppl 67.39
In [21]:
val_L = evaluate_cache(cache_model, val_data, val_test_batch_size, context[0])
test_L = evaluate_cache(cache_model, test_data, val_test_batch_size, context[0])
print('Best validation loss %.2f, val ppl %.2f' % (val_L, math.exp(val_L)))
print('Best test loss %.2f, test ppl %.2f' % (test_L, math.exp(test_L)))
Batch 2000, ppl 60.767823
Batch 2000, ppl 67.390510
Best validation loss 4.11, val ppl 60.77
Best test loss 4.21, test ppl 67.39

Reference

[1] Merity, S., et al. “Regularizing and optimizing LSTM language models”. ICLR 2018

[2] Grave, E., et al. “Improving neural language models with a continuous cache”. ICLR 2017