# gluonnlp.model.train¶

GluonNLP Toolkit supplies models with train-mode since the corresponding models have different behaviors in training
and inference, e.g., the number and type of the outputs from the forward pass are different.

## Language Modeling¶

 AWDRNN AWD language model by salesforce. StandardRNN Standard RNN language model. CacheCell Cache language model. get_cache_model Returns a cache model using a pre-trained language model. BigRNN Big language model with LSTMP and importance sampling.

## Word Embeddings¶

 EmbeddingModel Abstract base class for embedding models for training. CSREmbeddingModel A trainable embedding model. FasttextEmbeddingModel FastText embedding model.

## API Reference¶

NLP training model.

class gluonnlp.model.train.AWDRNN(mode, vocab_size, embed_size=400, hidden_size=1150, num_layers=3, tie_weights=True, dropout=0.4, weight_drop=0.5, drop_h=0.2, drop_i=0.65, drop_e=0.1, **kwargs)[source]

AWD language model by salesforce.

Parameters: mode (str) – The type of RNN to use. Options are ‘lstm’, ‘gru’, ‘rnn_tanh’, ‘rnn_relu’. vocab_size (int) – Size of the input vocabulary. embed_size (int) – Dimension of embedding vectors. hidden_size (int) – Number of hidden units for RNN. num_layers (int) – Number of RNN layers. tie_weights (bool, default False) – Whether to tie the weight matrices of output dense layer and input embedding layer. dropout (float) – Dropout rate to use for encoder output. weight_drop (float) – Dropout rate to use on encoder h2h weights. drop_h (float) – Dropout rate to on the output of intermediate layers of encoder. drop_i (float) – Dropout rate to on the output of embedding. drop_e (float) – Dropout rate to use on the embedding layer.
forward(inputs, begin_state=None)[source]

Implement the forward computation that the awd language model and cache model use.

Parameters: inputs (NDArray) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”. begin_state (list) – initial recurrent state tensor with length equals to num_layers. the initial state with shape (1, batch_size, num_hidden) out (NDArray) – output tensor with shape (sequence_length, batch_size, input_size) when layout is “TNC”. out_states (list) – output recurrent state tensor with length equals to num_layers. the state with shape (1, batch_size, num_hidden) encoded_raw (list) – The list of outputs of the model’s encoder with length equals to num_layers. the shape of every encoder’s output (sequence_length, batch_size, num_hidden) encoded_dropped (list) – The list of outputs with dropout of the model’s encoder with length equals to num_layers. The shape of every encoder’s dropped output (sequence_length, batch_size, num_hidden)
class gluonnlp.model.train.StandardRNN(mode, vocab_size, embed_size, hidden_size, num_layers, dropout=0.5, tie_weights=False, **kwargs)[source]

Standard RNN language model.

Parameters: mode (str) – The type of RNN to use. Options are ‘lstm’, ‘gru’, ‘rnn_tanh’, ‘rnn_relu’. vocab_size (int) – Size of the input vocabulary. embed_size (int) – Dimension of embedding vectors. hidden_size (int) – Number of hidden units for RNN. num_layers (int) – Number of RNN layers. dropout (float) – Dropout rate to use for encoder output. tie_weights (bool, default False) – Whether to tie the weight matrices of output dense layer and input embedding layer.
forward(inputs, begin_state=None)[source]

Defines the forward computation. Arguments can be either NDArray or Symbol.

Parameters: inputs (NDArray) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”. begin_state (list) – initial recurrent state tensor with length equals to num_layers-1. the initial state with shape (num_layers, batch_size, num_hidden) out (NDArray) – output tensor with shape (sequence_length, batch_size, input_size) when layout is “TNC”. out_states (list) – output recurrent state tensor with length equals to num_layers-1. the state with shape (num_layers, batch_size, num_hidden) encoded_raw (list) – The list of last output of the model’s encoder. the shape of last encoder’s output (sequence_length, batch_size, num_hidden) encoded_dropped (list) – The list of last output with dropout of the model’s encoder. the shape of last encoder’s dropped output (sequence_length, batch_size, num_hidden)
class gluonnlp.model.train.BigRNN(vocab_size, embed_size, hidden_size, num_layers, projection_size, num_sampled, embed_dropout=0.0, encode_dropout=0.0, sparse_weight=True, sparse_grad=True, **kwargs)[source]

Big language model with LSTMP and importance sampling.

Reference: https://github.com/rafaljozefowicz/lm

Parameters: vocab_size (int) – Size of the input vocabulary. embed_size (int) – Dimension of embedding vectors. hidden_size (int) – Number of hidden units for LSTMP. num_layers (int) – Number of LSTMP layers. projection_size (int) – Number of projection units for LSTMP. num_sampled (int) – Number of sampled classes for the decoder. embed_dropout (float) – Dropout rate to use for embedding output. encoder_dropout (float) – Dropout rate to use for encoder output. sparse_weight (bool) – Whether to use RewSparseNDArray for weights of input and output embeddings. sparse_grad (bool) – Whether to use RowSparseNDArray for the gradients w.r.t. weights of input and output embeddings. note (.) – embeddings will be sparse. Only a subset of optimizers support sparse gradients, including SGD, AdaGrad and Adam. By default lazy_update is turned on for these optimizers, which may perform differently from standard updates. For more details, please check the Optimization API at: https://mxnet.incubator.apache.org/api/python/optimization/optimization.html note – decoder block will be stored in row_sparse format, which helps reduce memory consumption and communication overhead during multi-GPU training. However, sparse parameters cannot be shared with other blocks, nor could we hybridize a block containinng sparse parameters.
forward(inputs, label, begin_state, sampled_values)[source]

Defines the forward computation.

Parameters: inputs (NDArray) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”. begin_state (list) – initial recurrent state tensor with length equals to num_layers*2. For each layer the two initial states have shape (batch_size, num_hidden) and (batch_size, num_projection) sampled_values (list) – a list of three tensors for sampled_classes with shape (num_samples,), expected_count_sampled with shape (num_samples,), and expected_count_true with shape (sequence_length, batch_size). out (NDArray) – output tensor with shape (sequence_length, batch_size, 1+num_samples) when layout is “TNC”. out_states (list) – output recurrent state tensor with length equals to num_layers*2. For each layer the two initial states have shape (batch_size, num_hidden) and (batch_size, num_projection) new_target (NDArray) – output tensor with shape (sequence_length, batch_size) when layout is “TNC”.
class gluonnlp.model.train.CacheCell(lm_model, vocab_size, window, theta, lambdas, **kwargs)[source]

Cache language model.

We implement the neural cache language model proposed in the following work:

@article{grave2016improving,
title={Improving neural language models with a continuous cache},
author={Grave, Edouard and Joulin, Armand and Usunier, Nicolas},
journal={ICLR},
year={2017}
}

Parameters: lm_model (gluonnlp.model.StandardRNN or gluonnlp.model.AWDRNN) – The type of RNN to use. Options are ‘gluonnlp.model.StandardRNN’, ‘gluonnlp.model.AWDRNN’. vocab_size (int) – Size of the input vocabulary. window (int) – Size of cache window theta (float) – The scala controls the flatness of the cache distribution that predict the next word as shown below: $p_{cache} \propto \sum_{i=1}^{t-1} \mathbb{1}_{w=x_{i+1}} exp(\theta {h_t}^T h_i)$ where $$p_{cache}$$ is the cache distribution, $$\mathbb{1}$$ is the identity function, and $$h_i$$ is the output of timestep i. lambdas (float) – Linear scalar between only cache and vocab distribution, the formulation is as below: $p = (1 - \lambda) p_{vocab} + \lambda p_{cache}$ where $$p_{vocab}$$ is the vocabulary distribution and $$p_{cache}$$ is the cache distribution.
begin_state(*args, **kwargs)[source]

Initialize the hidden states.

forward(inputs, target, next_word_history, cache_history, begin_state=None)[source]

Defines the forward computation for cache cell. Arguments can be either NDArray or Symbol.

Parameters: inputs (NDArray) – The input data target (NDArray) – The label next_word_history (NDArray) – The next word in memory cache_history (NDArray) – The hidden state in cache history out (NDArray) – The linear interpolation of the cache language model with the regular word-level language model next_word_history (NDArray) – The next words to be kept in the memory for look up (size is equal to the window size) cache_history (NDArray) – The hidden states to be kept in the memory for look up (size is equal to the window size)
load_parameters(filename, ctx=cpu(0))[source]

filename : str
Path to parameter file.
ctx : Context or list of Context, default cpu()
save_parameters(filename)[source]

Save parameters to file.

filename : str
Path to file.
class gluonnlp.model.train.EmbeddingModel(prefix=None, params=None)[source]

Abstract base class for embedding models for training.

An embedding model is a Gluon block with additional __contains__ and __getitem__ support for computing embeddings given a string or list of strings. See the documentation of __contains__ and __getitem__ for details.

class gluonnlp.model.train.CSREmbeddingModel(token_to_idx, output_dim, weight_initializer=None, sparse_grad=True, dtype='float32', **kwargs)[source]

A trainable embedding model.

This class is a simple wrapper around the mxnet.gluon.nn.Embedding. It trains independent embedding vectors for every token. It implements the gluonnlp.model.train.EmbeddingModel interface which provides convenient helper methods.

Parameters: token_to_idx (dict) – token_to_idx mapping of the vocabulary that this model is to be trained with. token_to_idx is used for __getitem__ and __contains__. For initialization len(token_to_idx) is used to specify the size of the subword embedding matrix. output_dim (int) – Dimension of the dense embedding. weight_initializer (mxnet.initializer.Initializer, optional) – Initializer for the embeddings matrix. sparse_grad (bool, default True) – Specifies mxnet.gluon.nn.Embedding sparse_grad argument. dtype (str, default 'float32') – dtype argument passed to gluon.nn.Embedding
hybrid_forward(F, words, weight)[source]

Compute embedding of words in batch.

Parameters: words (mx.nd.NDArray) – Array of token indices.
class gluonnlp.model.train.FasttextEmbeddingModel(token_to_idx, subword_function, output_dim, weight_initializer=None, sparse_grad=True, dtype='float32', **kwargs)[source]

FastText embedding model.

The FasttextEmbeddingModel combines a word level embedding matrix and a subword level embedding matrix. It implements the gluonnlp.model.train.EmbeddingModel interface which provides convenient functions.

Parameters: token_to_idx (dict) – token_to_idx mapping of the vocabulary that this model is to be trained with. token_to_idx is used for __getitem__ and __contains__. For initialization len(token_to_idx) is used to specify the size of the subword embedding matrix.. subword_function (gluonnlp.vocab.SubwordFunction) – The subword function used to obtain the subword indices during training this model. The subword_function is used for __getitem__ and __contains__. For initialization len(subword_function) is used to specify the size of the subword embedding matrix.. output_dim (int) – Dimension of embeddings. weight_initializer (mxnet.initializer.Initializer, optional) – Initializer for the embeddings and subword embeddings matrix. sparse_grad (bool, default True) – Specifies mxnet.gluon.nn.Embedding sparse_grad argument. dtype (str, default 'float32') – dtype argument passed to gluon.nn.Embedding
hybrid_forward(F, words, weight)[source]

Compute embedding of words in batch.

Parameters: words (mxnet.ndarray.sparse.CSRNDArray) – Sparse array containing weights for every word and subword index. Output is the weighted sum of word and subword embeddings.
classmethod load_fasttext_format(path, ctx=cpu(0), **kwargs)[source]

Create an instance of the class and load weights.

Parameters: path (str) – Path to the .bin model file. ctx (mx.Context, default mx.cpu()) – Context to initialize the weights on. kwargs (dict) – Keyword arguments are passed to the class initializer.
gluonnlp.model.train.get_cache_model(name, dataset_name='wikitext-2', window=2000, theta=0.6, lambdas=0.2, ctx=cpu(0), **kwargs)[source]

Returns a cache model using a pre-trained language model.

We implement the neural cache language model proposed in the following work:

@article{grave2016improving,
title={Improving neural language models with a continuous cache},
author={Grave, Edouard and Joulin, Armand and Usunier, Nicolas},
journal={ICLR},
year={2017}
}

Parameters: name (str) – Name of the cache language model. dataset_name (str or None, default 'wikitext-2'.) – The dataset name on which the pre-trained model is trained. Options are ‘wikitext-2’. If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned. window (int) – Size of cache window theta (float) – The scala controls the flatness of the cache distribution that predict the next word as shown below: $p_{cache} \propto \sum_{i=1}^{t-1} \mathbb{1}_{w=x_{i+1}} exp(\theta {h_t}^T h_i)$ where $$p_{cache}$$ is the cache distribution, $$\mathbb{1}$$ is the identity function, and $$h_i$$ is the output of timestep i. lambdas (float) – Linear scalar between only cache and vocab distribution, the formulation is as below: $p = (1 - \lambda) p_{vocab} + \lambda p_{cache}$ where $$p_{vocab}$$ is the vocabulary distribution and $$p_{cache}$$ is the cache distribution. vocab (gluonnlp.Vocab or None, default None) – Vocabulary object to be used with the language model. Required when dataset_name is not specified. pretrained (bool, default False) – Whether to load the pre-trained weights for model. ctx (Context, default CPU) – The context in which to load the pre-trained weights. root (str, default '~/.mxnet/models') – Location for keeping the pre-trained model parameters. The model. Block