gluonnlp.model.train

GluonNLP Toolkit supplies models with train-mode since the corresponding models have different behaviors in training
and inference, e.g., the number and type of the outputs from the forward pass are different.

Language Modeling

AWDRNN AWD language model by salesforce.
StandardRNN Standard RNN language model.
CacheCell Cache language model.
get_cache_model Returns a cache model using a pre-trained language model.
BigRNN Big language model with LSTMP and importance sampling.

Word Embeddings

EmbeddingModel Abstract base class for embedding models for training.
SimpleEmbeddingModel A trainable embedding model.
FasttextEmbeddingModel FastText embedding model.

API Reference

NLP training model.

class gluonnlp.model.train.AWDRNN(mode, vocab_size, embed_size=400, hidden_size=1150, num_layers=3, tie_weights=True, dropout=0.4, weight_drop=0.5, drop_h=0.2, drop_i=0.65, drop_e=0.1, **kwargs)[source]

AWD language model by salesforce.

Reference: https://github.com/salesforce/awd-lstm-lm

License: BSD 3-Clause

Parameters:
  • mode (str) – The type of RNN to use. Options are ‘lstm’, ‘gru’, ‘rnn_tanh’, ‘rnn_relu’.
  • vocab_size (int) – Size of the input vocabulary.
  • embed_size (int) – Dimension of embedding vectors.
  • hidden_size (int) – Number of hidden units for RNN.
  • num_layers (int) – Number of RNN layers.
  • tie_weights (bool, default False) – Whether to tie the weight matrices of output dense layer and input embedding layer.
  • dropout (float) – Dropout rate to use for encoder output.
  • weight_drop (float) – Dropout rate to use on encoder h2h weights.
  • drop_h (float) – Dropout rate to on the output of intermediate layers of encoder.
  • drop_i (float) – Dropout rate to on the output of embedding.
  • drop_e (float) – Dropout rate to use on the embedding layer.
forward(inputs, begin_state=None)[source]

Implement the forward computation that the awd language model and cache model use.

Parameters:
  • inputs (NDArray) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”.
  • begin_state (list) – initial recurrent state tensor with length equals to num_layers. the initial state with shape (1, batch_size, num_hidden)
Returns:

  • out (NDArray) – output tensor with shape (sequence_length, batch_size, input_size) when layout is “TNC”.
  • out_states (list) – output recurrent state tensor with length equals to num_layers. the state with shape (1, batch_size, num_hidden)
  • encoded_raw (list) – The list of outputs of the model’s encoder with length equals to num_layers. the shape of every encoder’s output (sequence_length, batch_size, num_hidden)
  • encoded_dropped (list) – The list of outputs with dropout of the model’s encoder with length equals to num_layers. The shape of every encoder’s dropped output (sequence_length, batch_size, num_hidden)

class gluonnlp.model.train.StandardRNN(mode, vocab_size, embed_size, hidden_size, num_layers, dropout=0.5, tie_weights=False, **kwargs)[source]

Standard RNN language model.

Parameters:
  • mode (str) – The type of RNN to use. Options are ‘lstm’, ‘gru’, ‘rnn_tanh’, ‘rnn_relu’.
  • vocab_size (int) – Size of the input vocabulary.
  • embed_size (int) – Dimension of embedding vectors.
  • hidden_size (int) – Number of hidden units for RNN.
  • num_layers (int) – Number of RNN layers.
  • dropout (float) – Dropout rate to use for encoder output.
  • tie_weights (bool, default False) – Whether to tie the weight matrices of output dense layer and input embedding layer.
forward(inputs, begin_state=None)[source]

Defines the forward computation. Arguments can be either NDArray or Symbol.

Parameters:
  • inputs (NDArray) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”.
  • begin_state (list) – initial recurrent state tensor with length equals to num_layers-1. the initial state with shape (num_layers, batch_size, num_hidden)
Returns:

  • out (NDArray) – output tensor with shape (sequence_length, batch_size, input_size) when layout is “TNC”.
  • out_states (list) – output recurrent state tensor with length equals to num_layers-1. the state with shape (num_layers, batch_size, num_hidden)
  • encoded_raw (list) – The list of last output of the model’s encoder. the shape of last encoder’s output (sequence_length, batch_size, num_hidden)
  • encoded_dropped (list) – The list of last output with dropout of the model’s encoder. the shape of last encoder’s dropped output (sequence_length, batch_size, num_hidden)

class gluonnlp.model.train.BigRNN(vocab_size, embed_size, hidden_size, num_layers, projection_size, num_sampled, embed_dropout=0.0, encode_dropout=0.0, sparse_weight=True, sparse_grad=True, **kwargs)[source]

Big language model with LSTMP and importance sampling.

Reference: https://github.com/rafaljozefowicz/lm

License: MIT

Parameters:
  • vocab_size (int) – Size of the input vocabulary.
  • embed_size (int) – Dimension of embedding vectors.
  • hidden_size (int) – Number of hidden units for LSTMP.
  • num_layers (int) – Number of LSTMP layers.
  • projection_size (int) – Number of projection units for LSTMP.
  • num_sampled (int) – Number of sampled classes for the decoder.
  • embed_dropout (float) – Dropout rate to use for embedding output.
  • encoder_dropout (float) – Dropout rate to use for encoder output.
  • sparse_weight (bool) – Whether to use RewSparseNDArray for weights of input and output embeddings.
  • sparse_grad (bool) – Whether to use RowSparseNDArray for the gradients w.r.t. weights of input and output embeddings.
  • note (.) – embeddings will be sparse. Only a subset of optimizers support sparse gradients, including SGD, AdaGrad and Adam. By default lazy_update is turned on for these optimizers, which may perform differently from standard updates. For more details, please check the Optimization API at: https://mxnet.incubator.apache.org/api/python/optimization/optimization.html
  • note – decoder block will be stored in row_sparse format, which helps reduce memory consumption and communication overhead during multi-GPU training. However, sparse parameters cannot be shared with other blocks, nor could we hybridize a block containinng sparse parameters.
forward(inputs, label, begin_state, sampled_values)[source]

Defines the forward computation.

Parameters:
  • inputs (NDArray) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”.
  • begin_state (list) – initial recurrent state tensor with length equals to num_layers*2. For each layer the two initial states have shape (batch_size, num_hidden) and (batch_size, num_projection)
  • sampled_values (list) – a list of three tensors for sampled_classes with shape (num_samples,), expected_count_sampled with shape (num_samples,), and expected_count_true with shape (sequence_length, batch_size).
Returns:

  • out (NDArray) – output tensor with shape (sequence_length, batch_size, 1+num_samples) when layout is “TNC”.
  • out_states (list) – output recurrent state tensor with length equals to num_layers*2. For each layer the two initial states have shape (batch_size, num_hidden) and (batch_size, num_projection)
  • new_target (NDArray) – output tensor with shape (sequence_length, batch_size) when layout is “TNC”.

class gluonnlp.model.train.CacheCell(lm_model, vocab_size, window, theta, lambdas, **kwargs)[source]

Cache language model.

We implement the neural cache language model proposed in the following work:

@article{grave2016improving,
title={Improving neural language models with a continuous cache},
author={Grave, Edouard and Joulin, Armand and Usunier, Nicolas},
journal={ICLR},
year={2017}
}
Parameters:
  • lm_model (gluonnlp.model.StandardRNN or gluonnlp.model.AWDRNN) – The type of RNN to use. Options are ‘gluonnlp.model.StandardRNN’, ‘gluonnlp.model.AWDRNN’.
  • vocab_size (int) – Size of the input vocabulary.
  • window (int) – Size of cache window
  • theta (float) –

    The scala controls the flatness of the cache distribution that predict the next word as shown below:

    \[p_{cache} \propto \sum_{i=1}^{t-1} \mathbb{1}_{w=x_{i+1}} exp(\theta {h_t}^T h_i)\]

    where \(p_{cache}\) is the cache distribution, \(\mathbb{1}\) is the identity function, and \(h_i\) is the output of timestep i.

  • lambdas (float) –

    Linear scalar between only cache and vocab distribution, the formulation is as below:

    \[p = (1 - \lambda) p_{vocab} + \lambda p_{cache}\]

    where \(p_{vocab}\) is the vocabulary distribution and \(p_{cache}\) is the cache distribution.

begin_state(*args, **kwargs)[source]

Initialize the hidden states.

forward(inputs, target, next_word_history, cache_history, begin_state=None)[source]

Defines the forward computation for cache cell. Arguments can be either NDArray or Symbol.

Parameters:
  • inputs (NDArray) – The input data
  • target (NDArray) – The label
  • next_word_history (NDArray) – The next word in memory
  • cache_history (NDArray) – The hidden state in cache history
Returns:

  • out (NDArray) – The linear interpolation of the cache language model with the regular word-level language model
  • next_word_history (NDArray) – The next words to be kept in the memory for look up (size is equal to the window size)
  • cache_history (NDArray) – The hidden states to be kept in the memory for look up (size is equal to the window size)

load_parameters(filename, ctx=cpu(0))[source]

Load parameters from file.

filename : str
Path to parameter file.
ctx : Context or list of Context, default cpu()
Context(s) initialize loaded parameters on.
save_parameters(filename)[source]

Save parameters to file.

filename : str
Path to file.
class gluonnlp.model.train.EmbeddingModel(embedding_size, **kwargs)[source]

Abstract base class for embedding models for training.

An embedding model is a Gluon block with helper methods to directly work with the textual token representation.

Parameters:embedding_size (int) – Dimension of embeddings.
class gluonnlp.model.train.SimpleEmbeddingModel(token_to_idx, embedding_size, weight_initializer=None, sparse_grad=True, dtype='float32', **kwargs)[source]

A trainable embedding model.

This class is a simple wrapper around the mxnet.gluon.nn.Embedding. It trains independent embedding vectors for every token. It implements the gluonnlp.model.train.EmbeddingModel interface which provides convenient helper methods.

Parameters:
  • token_to_idx (dict) – token_to_idx mapping of the vocabulary that this model is to be trained with. token_to_idx is used for __getitem__ and __contains__. For initialization len(token_to_idx) is used to specify the size of the subword embedding matrix.
  • embedding_size (int) – Dimension of embeddings.
  • weight_initializer (mxnet.initializer.Initializer, optional) – Initializer for the embeddings matrix.
  • sparse_grad (bool, default True) – Specifies mxnet.gluon.nn.Embedding sparse_grad argument.
  • dtype (str, default 'float32') – dtype argument passed to gluon.nn.Embedding
forward(words, wordsmask=None)[source]

Compute embedding of words in batch.

Parameters:
  • words (mx.nd.NDArray) – Array of token indices.
  • wordsmask (mx.nd.NDArray) – Mask for embeddings returned by the word level embedding operator.
class gluonnlp.model.train.FasttextEmbeddingModel(token_to_idx, subword_function, embedding_size, weight_initializer=None, sparse_grad=True, dtype='float32', **kwargs)[source]

FastText embedding model.

The FasttextEmbeddingModel combines a word level embedding matrix and a subword level embedding matrix. It implements the gluonnlp.model.train.EmbeddingModel interface which provides convenient functions.

Parameters:
  • token_to_idx (dict) – token_to_idx mapping of the vocabulary that this model is to be trained with. token_to_idx is used for __getitem__ and __contains__. For initialization len(token_to_idx) is used to specify the size of the subword embedding matrix..
  • subword_function (gluonnlp.vocab.SubwordFunction) – The subword function used to obtain the subword indices during training this model. The subword_function is used for __getitem__ and __contains__. For initialization len(subword_function) is used to specify the size of the subword embedding matrix..
  • embedding_size (int) – Dimension of embeddings.
  • weight_initializer (mxnet.initializer.Initializer, optional) – Initializer for the embeddings and subword embeddings matrix.
  • sparse_grad (bool, default True) – Specifies mxnet.gluon.nn.Embedding sparse_grad argument.
  • dtype (str, default 'float32') – dtype argument passed to gluon.nn.Embedding
forward(words, subwords, wordsmask=None, subwordsmask=None, words_to_unique_subwords_indices=None)[source]

Compute embedding of words in batch.

Parameters:
  • words (mx.nd.NDArray) – Array of token indices.
  • subwords (mx.nd.NDArray) – The subwords associated with the tokens in words. If words_to_unique_subwords_indices is specified may contain the subwords of the unique tokens in words with words_to_unique_subwords_indices containing the reverse mapping.
  • wordsmask (mx.nd.NDArray, optional) – Mask for embeddings returned by the word level embedding operator.
  • subwordsmask (mx.nd.NDArray, optional) – A mask for the subword embeddings looked up from subwords. Applied before sum reducing the subword embeddings.
  • words_to_unique_subwords_indices (mx.nd.NDArray, optional) – Mapping from the position in the words array to the position in the words_to_unique_subwords_indices` array.
classmethod load_fasttext_format(path, ctx=cpu(0), **kwargs)[source]

Create an instance of the class and load weights.

Load the weights from the fastText binary format created by https://github.com/facebookresearch/fastText

Parameters:
  • path (str) – Path to the .bin model file.
  • ctx (mx.Context, default mx.cpu()) – Context to initialize the weights on.
  • kwargs (dict) – Keyword arguments are passed to the class initializer.
gluonnlp.model.train.get_cache_model(name, dataset_name='wikitext-2', window=2000, theta=0.6, lambdas=0.2, ctx=cpu(0), **kwargs)[source]

Returns a cache model using a pre-trained language model.

We implement the neural cache language model proposed in the following work:

@article{grave2016improving,
title={Improving neural language models with a continuous cache},
author={Grave, Edouard and Joulin, Armand and Usunier, Nicolas},
journal={ICLR},
year={2017}
}
Parameters:
  • name (str) – Name of the cache language model.
  • dataset_name (str or None, default 'wikitext-2'.) – The dataset name on which the pre-trained model is trained. Options are ‘wikitext-2’. If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned.
  • window (int) – Size of cache window
  • theta (float) –

    The scala controls the flatness of the cache distribution that predict the next word as shown below:

    \[p_{cache} \propto \sum_{i=1}^{t-1} \mathbb{1}_{w=x_{i+1}} exp(\theta {h_t}^T h_i)\]

    where \(p_{cache}\) is the cache distribution, \(\mathbb{1}\) is the identity function, and \(h_i\) is the output of timestep i.

  • lambdas (float) –

    Linear scalar between only cache and vocab distribution, the formulation is as below:

    \[p = (1 - \lambda) p_{vocab} + \lambda p_{cache}\]

    where \(p_{vocab}\) is the vocabulary distribution and \(p_{cache}\) is the cache distribution.

  • vocab (gluonnlp.Vocab or None, default None) – Vocabulary object to be used with the language model. Required when dataset_name is not specified.
  • pretrained (bool, default False) – Whether to load the pre-trained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pre-trained weights.
  • root (str, default '~/.mxnet/models') – Location for keeping the pre-trained model parameters.
Returns:

The model.

Return type:

Block