[Download]

Language Modeling

A statistical model is simple a probability distribution over sequences of words or characters [1]. In this tutorial, we’ll restrict our attention to word-based language models. Given a reliable language model we can answer questions like which among the following strings are we more likely to encounter?

  1. ‘On Monday, Mr. Lamar’s “DAMN.” took home an even more elusive honor, one that may never have even seemed within reach: the Pulitzer Prize”
  2. “Frog zealot flagged xylophone the bean wallaby anaphylaxis extraneous porpoise into deleterious carrot banana apricot.”

Even if we’ve never seen either of these sentences in our entire lives, and even though no rapper has previously been awarded a Pulitzer Prize, we wouldn’t be shocked to see the first sentence in the New York Times. By comparison, we can all agree that the second sentence, consisting of incoherent babble, is comparatively unlikely. A statistical language model can assign precise probabilities to each string of words.

Given a large corpus of text, we can estimate (i.e., train) a language model \(\hat{p}(x_1, ..., x_n)\). And given such a model, we can sample strings \(\mathbf{x} \sim \hat{p}(x_1, ..., x_n)\), generating new strings according to their estimated probability. Among other useful applications, we can use language models to score candidate transcriptions from speech recognition models, given preference to sentences that seem more probable (at the expense of those deemed anomalous).

These days recurrent neural networks (RNNs) are the preferred method for LM. In this notebook, we will go through an example of using GluonNLP to (i) implement a typical LSTM language model architecture, (ii) train the language model on a corpus of real data; and (iii) bring in your own dataset for training; and (iv) grab off-the-shelf pre-trained state-of-the-art languague models (i.e., AWD lanaguge model) using GluonNLP.

Language model definition - one sentence

The standard approach to language modeling consists of training a model that given a trailing window of text, predicts the next word in the sequence. When we train the model we feed in the inputs \(x1, x_2, ...\) and try at each time step to predict the corresponding next word \(x_2, ..., x_{n+1}\). To generate text from a language model, we can iteratively predict the next word, and then feed this word as the input to the model at the subsequent time step.

Train your own language model

Now let’s step through how to train your own language model using GluonNLP.

Preparation

We’ll start by taking care of our basic dependencies and setting up our environment

Load gluonnlp

In [1]:
import warnings
warnings.filterwarnings('ignore')

import glob
import time
import math

import mxnet as mx
from mxnet import gluon, autograd
from mxnet.gluon.utils import download

import gluonnlp as nlp

Set environment

In [2]:
num_gpus = 1
context = [mx.gpu(i) for i in range(num_gpus)] if num_gpus else [mx.cpu()]
log_interval = 200

Set hyperparameters

In [3]:
batch_size = 20 * len(context)
lr = 20
epochs = 3
bptt = 35
grad_clip = 0.25

Load dataset, extract vocabulary, numericalize, and batchify for truncated BPTT

In [4]:
dataset_name = 'wikitext-2'
train_dataset, val_dataset, test_dataset = [
    nlp.data.WikiText2(
        segment=segment, bos=None, eos='<eos>', skip_empty=False)
    for segment in ['train', 'val', 'test']
]

vocab = nlp.Vocab(
    nlp.data.Counter(train_dataset), padding_token=None, bos_token=None)

bptt_batchify = nlp.data.batchify.CorpusBPTTBatchify(
    vocab, bptt, batch_size, last_batch='discard')
train_data, val_data, test_data = [
    bptt_batchify(x) for x in [train_dataset, val_dataset, test_dataset]
]

Load pre-defined language model architecture

In [5]:
model_name = 'standard_lstm_lm_200'
model, vocab = nlp.model.get_model(model_name, vocab=vocab, dataset_name=None)
print(model)
print(vocab)

model.initialize(mx.init.Xavier(), ctx=context)

trainer = gluon.Trainer(model.collect_params(), 'sgd', {
    'learning_rate': lr,
    'momentum': 0,
    'wd': 0
})
loss = gluon.loss.SoftmaxCrossEntropyLoss()
StandardRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 200, float32)
    (1): Dropout(p = 0.2, axes=())
  )
  (encoder): LSTM(200 -> 200, TNC, num_layers=2, dropout=0.2)
  (decoder): HybridSequential(
    (0): Dense(200 -> 33278, linear)
  )
)
Vocab(size=33278, unk="<unk>", reserved="['<eos>']")

Training

Now that everything is ready, we can start training the model.

Detach gradients on states for truncated BPTT

In [6]:
def detach(hidden):
    if isinstance(hidden, (tuple, list)):
        hidden = [detach(i) for i in hidden]
    else:
        hidden = hidden.detach()
    return hidden

Evaluation

In [7]:
def evaluate(model, data_source, batch_size, ctx):
    total_L = 0.0
    ntotal = 0
    hidden = model.begin_state(
        batch_size=batch_size, func=mx.nd.zeros, ctx=ctx)
    for i, (data, target) in enumerate(data_source):
        data = data.as_in_context(ctx)
        target = target.as_in_context(ctx)
        output, hidden = model(data, hidden)
        hidden = detach(hidden)
        L = loss(output.reshape(-3, -1), target.reshape(-1))
        total_L += mx.nd.sum(L).asscalar()
        ntotal += L.size
    return total_L / ntotal

Training loop

Our loss function will be the standard cross-entropy loss function used for multiclass classification, applied at each time step to compare our predictions to the true next word in the sequence. We can calculate gradients with respect to our parameters using truncated back-propagation-through-time (BPTT). In this case, we’ll backpropagate for \(35\) time steps, updating our weights with stochastic gradient descent with the learning rate of \(20\), hyperparameters that we chose earlier in the notebook.

In [8]:
def train(model, train_data, val_data, test_data, epochs, lr):
    best_val = float("Inf")
    start_train_time = time.time()
    parameters = model.collect_params().values()
    for epoch in range(epochs):
        total_L = 0.0
        start_epoch_time = time.time()
        start_log_interval_time = time.time()
        hiddens = [model.begin_state(batch_size//len(context), func=mx.nd.zeros, ctx=ctx)
                   for ctx in context]
        for i, (data, target) in enumerate(train_data):
            data_list = gluon.utils.split_and_load(data, context,
                                                   batch_axis=1, even_split=True)
            target_list = gluon.utils.split_and_load(target, context,
                                                     batch_axis=1, even_split=True)
            hiddens = detach(hiddens)
            L = 0
            Ls = []
            with autograd.record():
                for j, (X, y, h) in enumerate(zip(data_list, target_list, hiddens)):
                    output, h = model(X, h)
                    batch_L = loss(output.reshape(-3, -1), y.reshape(-1,))
                    L = L + batch_L.as_in_context(context[0]) / (len(context) * X.size)
                    Ls.append(batch_L / (len(context) * X.size))
                    hiddens[j] = h
            L.backward()
            grads = [p.grad(x.context) for p in parameters for x in data_list]
            gluon.utils.clip_global_norm(grads, grad_clip)

            trainer.step(1)

            total_L += sum([mx.nd.sum(l).asscalar() for l in Ls])

            if i % log_interval == 0 and i > 0:
                cur_L = total_L / log_interval
                print('[Epoch %d Batch %d/%d] loss %.2f, ppl %.2f, '
                      'throughput %.2f samples/s'%(
                    epoch, i, len(train_data), cur_L, math.exp(cur_L),
                    batch_size * log_interval / (time.time() - start_log_interval_time)))
                total_L = 0.0
                start_log_interval_time = time.time()

        mx.nd.waitall()

        print('[Epoch %d] throughput %.2f samples/s'%(
                    epoch, len(train_data)*batch_size / (time.time() - start_epoch_time)))
        val_L = evaluate(model, val_data, batch_size, context[0])
        print('[Epoch %d] time cost %.2fs, valid loss %.2f, valid ppl %.2f'%(
            epoch, time.time()-start_epoch_time, val_L, math.exp(val_L)))

        if val_L < best_val:
            best_val = val_L
            test_L = evaluate(model, test_data, batch_size, context[0])
            model.save_parameters('{}_{}-{}.params'.format(model_name, dataset_name, epoch))
            print('test loss %.2f, test ppl %.2f'%(test_L, math.exp(test_L)))
        else:
            lr = lr*0.25
            print('Learning rate now %f'%(lr))
            trainer.set_learning_rate(lr)

    print('Total training throughput %.2f samples/s'%(
                            (batch_size * len(train_data) * epochs) /
                            (time.time() - start_train_time)))

Train and evaluate

In [9]:
train(model, train_data, val_data, test_data, epochs, lr)
[Epoch 0 Batch 200/2983] loss 7.65, ppl 2092.05, throughput 406.32 samples/s
[Epoch 0 Batch 400/2983] loss 6.77, ppl 868.06, throughput 455.34 samples/s
[Epoch 0 Batch 600/2983] loss 6.35, ppl 574.50, throughput 453.26 samples/s
[Epoch 0 Batch 800/2983] loss 6.18, ppl 484.26, throughput 442.66 samples/s
[Epoch 0 Batch 1000/2983] loss 6.05, ppl 424.09, throughput 465.71 samples/s
[Epoch 0 Batch 1200/2983] loss 5.97, ppl 391.27, throughput 443.73 samples/s
[Epoch 0 Batch 1400/2983] loss 5.86, ppl 352.38, throughput 448.44 samples/s
[Epoch 0 Batch 1600/2983] loss 5.87, ppl 353.18, throughput 461.51 samples/s
[Epoch 0 Batch 1800/2983] loss 5.71, ppl 303.17, throughput 435.81 samples/s
[Epoch 0 Batch 2000/2983] loss 5.68, ppl 294.33, throughput 461.47 samples/s
[Epoch 0 Batch 2200/2983] loss 5.57, ppl 262.26, throughput 461.88 samples/s
[Epoch 0 Batch 2400/2983] loss 5.58, ppl 265.06, throughput 437.62 samples/s
[Epoch 0 Batch 2600/2983] loss 5.57, ppl 262.18, throughput 436.76 samples/s
[Epoch 0 Batch 2800/2983] loss 5.46, ppl 235.60, throughput 445.21 samples/s
[Epoch 0] throughput 430.88 samples/s
[Epoch 0] time cost 152.98s, valid loss 5.42, valid ppl 226.34
test loss 5.34, test ppl 207.81
[Epoch 1 Batch 200/2983] loss 5.47, ppl 237.44, throughput 438.16 samples/s
[Epoch 1 Batch 400/2983] loss 5.45, ppl 233.24, throughput 463.20 samples/s
[Epoch 1 Batch 600/2983] loss 5.29, ppl 198.16, throughput 463.65 samples/s
[Epoch 1 Batch 800/2983] loss 5.31, ppl 201.69, throughput 440.04 samples/s
[Epoch 1 Batch 1000/2983] loss 5.28, ppl 195.54, throughput 465.51 samples/s
[Epoch 1 Batch 1200/2983] loss 5.27, ppl 194.81, throughput 457.99 samples/s
[Epoch 1 Batch 1400/2983] loss 5.26, ppl 192.87, throughput 447.62 samples/s
[Epoch 1 Batch 1600/2983] loss 5.33, ppl 206.20, throughput 461.00 samples/s
[Epoch 1 Batch 1800/2983] loss 5.20, ppl 181.51, throughput 450.60 samples/s
[Epoch 1 Batch 2000/2983] loss 5.22, ppl 184.40, throughput 442.42 samples/s
[Epoch 1 Batch 2200/2983] loss 5.12, ppl 167.19, throughput 451.90 samples/s
[Epoch 1 Batch 2400/2983] loss 5.14, ppl 171.00, throughput 453.57 samples/s
[Epoch 1 Batch 2600/2983] loss 5.16, ppl 175.02, throughput 465.34 samples/s
[Epoch 1 Batch 2800/2983] loss 5.08, ppl 161.46, throughput 441.33 samples/s
[Epoch 1] throughput 438.88 samples/s
[Epoch 1] time cost 150.07s, valid loss 5.17, valid ppl 175.38
test loss 5.09, test ppl 162.49
[Epoch 2 Batch 200/2983] loss 5.14, ppl 170.54, throughput 450.55 samples/s
[Epoch 2 Batch 400/2983] loss 5.15, ppl 173.01, throughput 451.59 samples/s
[Epoch 2 Batch 600/2983] loss 4.98, ppl 145.53, throughput 464.55 samples/s
[Epoch 2 Batch 800/2983] loss 5.03, ppl 152.89, throughput 463.36 samples/s
[Epoch 2 Batch 1000/2983] loss 5.02, ppl 151.35, throughput 457.40 samples/s
[Epoch 2 Batch 1200/2983] loss 5.02, ppl 150.75, throughput 453.92 samples/s
[Epoch 2 Batch 1400/2983] loss 5.04, ppl 154.63, throughput 459.28 samples/s
[Epoch 2 Batch 1600/2983] loss 5.12, ppl 167.17, throughput 445.95 samples/s
[Epoch 2 Batch 1800/2983] loss 4.99, ppl 146.89, throughput 438.52 samples/s
[Epoch 2 Batch 2000/2983] loss 5.02, ppl 151.49, throughput 455.41 samples/s
[Epoch 2 Batch 2200/2983] loss 4.93, ppl 137.92, throughput 449.28 samples/s
[Epoch 2 Batch 2400/2983] loss 4.95, ppl 141.49, throughput 463.58 samples/s
[Epoch 2 Batch 2600/2983] loss 4.98, ppl 145.40, throughput 440.56 samples/s
[Epoch 2 Batch 2800/2983] loss 4.91, ppl 135.64, throughput 452.87 samples/s
[Epoch 2] throughput 441.13 samples/s
[Epoch 2] time cost 149.42s, valid loss 5.07, valid ppl 159.14
test loss 5.01, test ppl 149.26
Total training throughput 358.38 samples/s

Use your own dataset

When we train a language model, we fit to the statistics of a given dataset. While many papers focus on a few standard datasets, such as WikiText or the Penn Tree Bank, that’s just to provide a standard benchmark for the purpose of comparing models against each other. In general, for any given use case, you’ll want to train your own language model using a dataset of your choosing. Here, for demonstration, we’ll grab some .txt files corresponding to Sherlock Holmes novels.

In [10]:
TRAIN_PATH = "./sherlockholmes.train.txt"
VALID_PATH = "./sherlockholmes.valid.txt"
TEST_PATH = "./sherlockholmes.test.txt"
PREDICT_PATH = "./tinyshakespeare/input.txt"
download(
    "https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.train.txt",
    TRAIN_PATH,
    sha1_hash="d65a52baaf32df613d4942e0254c81cff37da5e8")
download(
    "https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.valid.txt",
    VALID_PATH,
    sha1_hash="71133db736a0ff6d5f024bb64b4a0672b31fc6b3")
download(
    "https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.test.txt",
    TEST_PATH,
    sha1_hash="b7ccc4778fd3296c515a3c21ed79e9c2ee249f70")
download(
    "https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/tinyshakespeare/input.txt",
    PREDICT_PATH,
    sha1_hash="04486597058d11dcc2c556b1d0433891eb639d2e")
sherlockholmes_dataset = glob.glob("sherlockholmes.*.txt")
print(sherlockholmes_dataset)
Downloading ./sherlockholmes.train.txt from https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.train.txt...
Downloading ./sherlockholmes.valid.txt from https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.valid.txt...
Downloading ./sherlockholmes.test.txt from https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.test.txt...
Downloading ./tinyshakespeare/input.txt from https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/tinyshakespeare/input.txt...
['sherlockholmes.train.txt', 'sherlockholmes.test.txt', 'sherlockholmes.valid.txt']
In [11]:
import nltk
moses_tokenizer = nlp.data.SacreMosesTokenizer()

sherlockholmes_val = nlp.data.CorpusDataset(
    'sherlockholmes.valid.txt',
    sample_splitter=nltk.tokenize.sent_tokenize,
    tokenizer=moses_tokenizer,
    flatten=True,
    eos='<eos>')

sherlockholmes_val_data = bptt_batchify(sherlockholmes_val)
In [12]:
sherlockholmes_L = evaluate(model, sherlockholmes_val_data, batch_size,
                            context[0])
print('Best validation loss %.2f, test ppl %.2f' %
      (sherlockholmes_L, math.exp(sherlockholmes_L)))
Best validation loss 4.85, test ppl 128.32
In [13]:
train(
    model,
    sherlockholmes_val_data,
    sherlockholmes_val_data,
    sherlockholmes_val_data,
    epochs=3,
    lr=20)
[Epoch 0] throughput 255.21 samples/s
[Epoch 0] time cost 4.97s, valid loss 3.57, valid ppl 35.67
test loss 3.57, test ppl 35.67
[Epoch 1] throughput 258.54 samples/s
[Epoch 1] time cost 4.82s, valid loss 3.28, valid ppl 26.49
test loss 3.28, test ppl 26.49
[Epoch 2] throughput 568.23 samples/s
[Epoch 2] time cost 3.10s, valid loss 3.07, valid ppl 21.52
test loss 3.07, test ppl 21.52
Total training throughput 126.97 samples/s

Use pre-trained AWD LSTM language model

AWD LSTM language model is the state-of-the-art RNN language model [1]. The main technique is to add weight-dropout on the recurrent hidden to hidden matrices to prevent the overfitting from occurring on the recurrent connections.

Load vocabulary and pre-trained model

In [14]:
awd_model_name = 'awd_lstm_lm_1150'
awd_model, vocab = nlp.model.get_model(
    awd_model_name,
    vocab=vocab,
    dataset_name=dataset_name,
    pretrained=True,
    ctx=context[0])
print(awd_model)
print(vocab)
AWDRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 400, float32)
    (1): Dropout(p = 0.65, axes=(0,))
  )
  (encoder): Sequential(
    (0): LSTM(400 -> 1150, TNC)
    (1): LSTM(1150 -> 1150, TNC)
    (2): LSTM(1150 -> 400, TNC)
  )
  (decoder): HybridSequential(
    (0): Dense(400 -> 33278, linear)
  )
)
Vocab(size=33278, unk="<unk>", reserved="['<eos>']")

Evaluate the pre-trained model on val and test datasets

In [15]:
val_L = evaluate(awd_model, val_data, batch_size, context[0])
test_L = evaluate(awd_model, test_data, batch_size, context[0])
print('Best validation loss %.2f, val ppl %.2f' % (val_L, math.exp(val_L)))
print('Best test loss %.2f, test ppl %.2f' % (test_L, math.exp(test_L)))
Best validation loss 4.23, val ppl 68.80
Best test loss 4.19, test ppl 65.73

Use Cache LSTM language model

Cache LSTM language model [2] adds a cache-like memory to neural network language models, e.g., AWD LSTM language model. It exploits the hidden outputs to define a probability distribution over the words in the cache. It generates the state-of-the-art results in inference time.

Load pre-trained model and define hyperparameters

In [16]:
window = 2
theta = 0.662
lambdas = 0.1279
bptt = 2000
cache_model = nlp.model.train.get_cache_model(name=awd_model_name,
                                             dataset_name=dataset_name,
                                             window=window,
                                             theta=theta,
                                             lambdas=lambdas,
                                             ctx=context[0])
print(cache_model)
CacheCell(
  (lm_model): AWDRNN(
    (embedding): HybridSequential(
      (0): Embedding(33278 -> 400, float32)
      (1): Dropout(p = 0.65, axes=(0,))
    )
    (encoder): Sequential(
      (0): LSTM(400 -> 1150, TNC)
      (1): LSTM(1150 -> 1150, TNC)
      (2): LSTM(1150 -> 400, TNC)
    )
    (decoder): HybridSequential(
      (0): Dense(400 -> 33278, linear)
    )
  )
)

Define specific get_batch and evaluation for cache model

In [17]:
val_test_batch_size = 1
val_test_batchify = nlp.data.batchify.CorpusBatchify(vocab, val_test_batch_size)
val_data = val_test_batchify(val_dataset)
test_data = val_test_batchify(test_dataset)
In [18]:
def get_batch(data_source, i, seq_len=None):
    seq_len = min(seq_len if seq_len else bptt, len(data_source) - 1 - i)
    data = data_source[i:i + seq_len]
    target = data_source[i + 1:i + 1 + seq_len]
    return data, target
In [19]:
def evaluate_cache(model, data_source, batch_size, ctx):
    total_L = 0.0
    hidden = model.begin_state(
        batch_size=batch_size, func=mx.nd.zeros, ctx=ctx)
    next_word_history = None
    cache_history = None
    for i in range(0, len(data_source) - 1, bptt):
        if i > 0:
            print('Batch %d, ppl %f' % (i, math.exp(total_L / i)))
        if i == bptt:
            return total_L / i
        data, target = get_batch(data_source, i)
        data = data.as_in_context(ctx)
        target = target.as_in_context(ctx)
        L = 0
        outs, next_word_history, cache_history, hidden = model(
            data, target, next_word_history, cache_history, hidden)
        for out in outs:
            L += (-mx.nd.log(out)).asscalar()
        total_L += L / data.shape[1]
        hidden = detach(hidden)
    return total_L / len(data_source)

Evaluate the pre-trained model on val and test datasets

In [20]:
val_L = evaluate_cache(cache_model, val_data, val_test_batch_size, context[0])
test_L = evaluate_cache(cache_model, test_data, val_test_batch_size, context[0])
print('Best validation loss %.2f, val ppl %.2f'%(val_L, math.exp(val_L)))
print('Best test loss %.2f, test ppl %.2f'%(test_L, math.exp(test_L)))
Batch 2000, ppl 60.767823
Batch 2000, ppl 67.390510
Best validation loss 4.11, val ppl 60.77
Best test loss 4.21, test ppl 67.39
In [21]:
val_L = evaluate_cache(cache_model, val_data, val_test_batch_size, context[0])
test_L = evaluate_cache(cache_model, test_data, val_test_batch_size, context[0])
print('Best validation loss %.2f, val ppl %.2f' % (val_L, math.exp(val_L)))
print('Best test loss %.2f, test ppl %.2f' % (test_L, math.exp(test_L)))
Batch 2000, ppl 60.767823
Batch 2000, ppl 67.390510
Best validation loss 4.11, val ppl 60.77
Best test loss 4.21, test ppl 67.39

Reference

[1] Merity, S., et al. “Regularizing and optimizing LSTM language models”. ICLR 2018

[2] Grave, E., et al. “Improving neural language models with a continuous cache”. ICLR 2017