gluonnlp.model

Gluon NLP Toolkit supplies models for common NLP tasks with pre-trained weights. By default, all requested pre-trained weights are downloaded from public repo and stored in ~/.mxnet/models/.

Language Modeling

awd_lstm_lm_1150 3-layer LSTM language model with weight-drop, variational dropout, and tied weights.
awd_lstm_lm_600 3-layer LSTM language model with weight-drop, variational dropout, and tied weights.
AWDRNN AWD language model by salesforce.
standard_lstm_lm_200 Standard 2-layer LSTM language model with tied embedding and output weights.
standard_lstm_lm_650 Standard 2-layer LSTM language model with tied embedding and output weights.
standard_lstm_lm_1500 Standard 2-layer LSTM language model with tied embedding and output weights.
big_rnn_lm_2048_512 Big 1-layer LSTMP language model.
StandardRNN Standard RNN language model.
get_model Returns a pre-defined model by name.
BigRNN Big language model with LSTMP for inference.

Convolutional Encoder

ConvolutionalEncoder Convolutional encoder.

Highway Network

Highway Highway network.

Attention Cell

AttentionCell Abstract class for attention cells.
MultiHeadAttentionCell Multi-head Attention Cell.
MLPAttentionCell Concat the query and the key and use a single-hidden-layer MLP to get the attention score.
DotProductAttentionCell Dot product attention between the query and the key.

Other Modeling Utilities

WeightDropParameter A Container holding parameters (weights) of Blocks and performs dropout.
apply_weight_drop Apply weight drop to the parameter of a block.
L2Normalization Normalize the input array by dividing the L2 norm along the given axis.
ISLogits Block that computes sampled output training logits and labels suitable for importance sampled softmax loss.
NCELogits Block that computes sampled output training logits and labels suitable for noise contrastive estimation loss.
SparseISLogits Block that computes sampled output training logits and labels suitable for importance sampled softmax loss.
SparseNCELogits Block that computes sampled output training logits and labels suitable for noise contrastive estimation loss.

API Reference

Module for pre-defined NLP models.

This module contains definitions for the following model architectures: - AWD

You can construct a model with random weights by calling its constructor. Because NLP models are tied to vocabularies, you can either specify a dataset name to load and use the vocabulary of that dataset:

import gluonnlp as nlp
awd, vocab = nlp.model.awd_lstm_lm_1150(dataset_name='wikitext-2')

or directly specify a vocabulary object:

awd, vocab = nlp.model.awd_lstm_lm_1150(None, vocab=custom_vocab)

We provide pre-trained models for all the listed models. These models can constructed by passing pretrained=True:

awd, vocab = nlp.model.awd_lstm_lm_1150(dataset_name='wikitext-2'
                                        pretrained=True)
class gluonnlp.model.AWDRNN(mode, vocab_size, embed_size, hidden_size, num_layers, tie_weights, dropout, weight_drop, drop_h, drop_i, drop_e, **kwargs)[source]

AWD language model by salesforce.

Reference: https://github.com/salesforce/awd-lstm-lm

License: BSD 3-Clause

Parameters:
  • mode (str) – The type of RNN to use. Options are ‘lstm’, ‘gru’, ‘rnn_tanh’, ‘rnn_relu’.
  • vocab_size (int) – Size of the input vocabulary.
  • embed_size (int) – Dimension of embedding vectors.
  • hidden_size (int) – Number of hidden units for RNN.
  • num_layers (int) – Number of RNN layers.
  • tie_weights (bool, default False) – Whether to tie the weight matrices of output dense layer and input embedding layer.
  • dropout (float) – Dropout rate to use for encoder output.
  • weight_drop (float) – Dropout rate to use on encoder h2h weights.
  • drop_h (float) – Dropout rate to on the output of intermediate layers of encoder.
  • drop_i (float) – Dropout rate to on the output of embedding.
  • drop_e (float) – Dropout rate to use on the embedding layer.
forward(inputs, begin_state=None)[source]

Implement forward computation.

Parameters:
  • inputs (NDArray) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”.
  • begin_state (list) – initial recurrent state tensor with length equals to num_layers. the initial state with shape (1, batch_size, num_hidden)
Returns:

  • out (NDArray) – output tensor with shape (sequence_length, batch_size, input_size) when layout is “TNC”.
  • out_states (list) – output recurrent state tensor with length equals to num_layers. the state with shape (1, batch_size, num_hidden)

class gluonnlp.model.StandardRNN(mode, vocab_size, embed_size, hidden_size, num_layers, dropout, tie_weights, **kwargs)[source]

Standard RNN language model.

Parameters:
  • mode (str) – The type of RNN to use. Options are ‘lstm’, ‘gru’, ‘rnn_tanh’, ‘rnn_relu’.
  • vocab_size (int) – Size of the input vocabulary.
  • embed_size (int) – Dimension of embedding vectors.
  • hidden_size (int) – Number of hidden units for RNN.
  • num_layers (int) – Number of RNN layers.
  • dropout (float) – Dropout rate to use for encoder output.
  • tie_weights (bool, default False) – Whether to tie the weight matrices of output dense layer and input embedding layer.
forward(inputs, begin_state=None)[source]

Defines the forward computation. Arguments can be either NDArray or Symbol.

Parameters:
  • inputs (NDArray) –
    input tensor with shape (sequence_length, batch_size)
    when layout is “TNC”.
  • begin_state (list) – initial recurrent state tensor with length equals to num_layers-1. the initial state with shape (num_layers, batch_size, num_hidden)
Returns:

  • out (NDArray) –

    output tensor with shape (sequence_length, batch_size, input_size)

    when layout is “TNC”.

  • out_states (list) – output recurrent state tensor with length equals to num_layers-1. the state with shape (num_layers, batch_size, num_hidden)

class gluonnlp.model.BigRNN(vocab_size, embed_size, hidden_size, num_layers, projection_size, embed_dropout=0.0, encode_dropout=0.0, **kwargs)[source]

Big language model with LSTMP for inference.

Parameters:
  • vocab_size (int) – Size of the input vocabulary.
  • embed_size (int) – Dimension of embedding vectors.
  • hidden_size (int) – Number of hidden units for LSTMP.
  • num_layers (int) – Number of LSTMP layers.
  • projection_size (int) – Number of projection units for LSTMP.
  • embed_dropout (float) – Dropout rate to use for embedding output.
  • encode_dropout (float) – Dropout rate to use for encoder output.
forward(inputs, begin_state)[source]

Implement forward computation.

Parameters:
  • inputs (NDArray) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”.
  • begin_state (list) – initial recurrent state tensor with length equals to num_layers*2. For each layer the two initial states have shape (batch_size, num_hidden) and (batch_size, num_projection)
Returns:

  • out (NDArray) –

    output tensor with shape (sequence_length, batch_size, vocab_size)

    when layout is “TNC”.

  • out_states (list) – output recurrent state tensor with length equals to num_layers*2. For each layer the two initial states have shape (batch_size, num_hidden) and (batch_size, num_projection)

gluonnlp.model.awd_lstm_lm_1150(dataset_name=None, vocab=None, pretrained=False, ctx=cpu(0), root='~/.mxnet/models', **kwargs)[source]

3-layer LSTM language model with weight-drop, variational dropout, and tied weights.

Embedding size is 400, and hidden layer size is 1150.

Parameters:
  • dataset_name (str or None, default None) – The dataset name on which the pretrained model is trained. Options are ‘wikitext-2’. If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned. The pre-trained model achieves 73.32/69.74 ppl on Val and Test of wikitext-2 respectively.
  • vocab (gluonnlp.Vocab or None, default None) – Vocab object to be used with the language model. Required when dataset_name is not specified.
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
  • root (str, default '~/.mxnet/models') – Location for keeping the model parameters.
Returns:

Return type:

gluon.Block, gluonnlp.Vocab

gluonnlp.model.awd_lstm_lm_600(dataset_name=None, vocab=None, pretrained=False, ctx=cpu(0), root='~/.mxnet/models', **kwargs)[source]

3-layer LSTM language model with weight-drop, variational dropout, and tied weights.

Embedding size is 200, and hidden layer size is 600.

Parameters:
  • dataset_name (str or None, default None) – The dataset name on which the pretrained model is trained. Options are ‘wikitext-2’. If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned. The pre-trained model achieves 84.61/80.96 ppl on Val and Test of wikitext-2 respectively.
  • vocab (gluonnlp.Vocab or None, default None) – Vocab object to be used with the language model. Required when dataset_name is not specified.
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
  • root (str, default '~/.mxnet/models') – Location for keeping the model parameters.
Returns:

Return type:

gluon.Block, gluonnlp.Vocab

gluonnlp.model.standard_lstm_lm_200(dataset_name=None, vocab=None, pretrained=False, ctx=cpu(0), root='~/.mxnet/models', **kwargs)[source]

Standard 2-layer LSTM language model with tied embedding and output weights.

Both embedding and hidden dimensions are 200.

Parameters:
  • dataset_name (str or None, default None) – The dataset name on which the pretrained model is trained. Options are ‘wikitext-2’. If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned. The pre-trained model achieves 108.25/102.26 ppl on Val and Test of wikitext-2 respectively.
  • vocab (gluonnlp.Vocab or None, default None) – Vocabulary object to be used with the language model. Required when dataset_name is not specified.
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
  • root (str, default '~/.mxnet/models') – Location for keeping the model parameters.
Returns:

Return type:

gluon.Block, gluonnlp.Vocab

gluonnlp.model.standard_lstm_lm_650(dataset_name=None, vocab=None, pretrained=False, ctx=cpu(0), root='~/.mxnet/models', **kwargs)[source]

Standard 2-layer LSTM language model with tied embedding and output weights.

Both embedding and hidden dimensions are 650.

Parameters:
  • dataset_name (str or None, default None) – The dataset name on which the pretrained model is trained. Options are ‘wikitext-2’. If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned. The pre-trained model achieves 98.96/93.90 ppl on Val and Test of wikitext-2 respectively.
  • vocab (gluonnlp.Vocab or None, default None) – Vocabulary object to be used with the language model. Required when dataset_name is not specified.
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
  • root (str, default '~/.mxnet/models') – Location for keeping the model parameters.
Returns:

Return type:

gluon.Block, gluonnlp.Vocab

gluonnlp.model.standard_lstm_lm_1500(dataset_name=None, vocab=None, pretrained=False, ctx=cpu(0), root='~/.mxnet/models', **kwargs)[source]

Standard 2-layer LSTM language model with tied embedding and output weights.

Both embedding and hidden dimensions are 1500.

Parameters:
  • dataset_name (str or None, default None) – The dataset name on which the pretrained model is trained. Options are ‘wikitext-2’. If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned. The pre-trained model achieves 98.29/92.83 ppl on Val and Test of wikitext-2 respectively.
  • vocab (gluonnlp.Vocab or None, default None) – Vocabulary object to be used with the language model. Required when dataset_name is not specified.
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
  • root (str, default '~/.mxnet/models') – Location for keeping the model parameters.
Returns:

Return type:

gluon.Block, gluonnlp.Vocab

gluonnlp.model.big_rnn_lm_2048_512(dataset_name=None, vocab=None, pretrained=False, ctx=cpu(0), root='~/.mxnet/models', **kwargs)[source]

Big 1-layer LSTMP language model.

Both embedding and projection size are 512. Hidden size is 2048.

Parameters:
  • dataset_name (str or None, default None) – The dataset name on which the pretrained model is trained. Options are ‘gbw’. If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned. The pre-trained model achieves 44.05 ppl on Test of GBW dataset.
  • vocab (gluonnlp.Vocab or None, default None) – Vocabulary object to be used with the language model. Required when dataset_name is not specified.
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
  • root (str, default '~/.mxnet/models') – Location for keeping the model parameters.
Returns:

Return type:

gluon.Block, gluonnlp.Vocab

class gluonnlp.model.BeamSearchScorer(alpha=1.0, K=5.0, prefix=None, params=None)[source]

Score function used in beam search.

Implements the length-penalized score function used in the GNMT paper:

scores = (log_probs + scores) / length_penalty
length_penalty = (K + length)^\alpha / (K + 1)^\alpha
Parameters:
  • alpha (float, default 1.0) –
  • K (float, default 5.0) –
hybrid_forward(F, log_probs, scores, step)[source]

Overrides to construct symbolic graph for this Block.

Parameters:
  • x (Symbol or NDArray) – The first input tensor.
  • *args (list of Symbol or list of NDArray) – Additional input tensors.
class gluonnlp.model.BeamSearchSampler(beam_size, decoder, eos_id, scorer=BeamSearchScorer( ), max_length=100)[source]

Draw samples from the decoder by beam search.

Parameters:
  • beam_size (int) – The beam size.
  • decoder (callable) –

    Function of the one-step-ahead decoder, should have the form:

    log_probs, new_states = decoder(step_input, states)
    

    The log_probs, input should follow these rules:

    • step_input has shape (batch_size,),
    • log_probs has shape (batch_size, V),
    • states and new_states have the same structure and the leading dimension of the inner NDArrays is the batch dimension.
  • eos_id (int) – Id of the EOS token. No other elements will be appended to the sample if it reaches eos_id.
  • scorer (BeamSearchScorer, default BeamSearchScorer(alpha=1.0, K=5)) – The score function used in beam search.
  • max_length (int, default 100) – The maximum search length.
class gluonnlp.model.AttentionCell(prefix=None, params=None)[source]

Abstract class for attention cells. Extend the class to implement your own attention method. One typical usage is to define your own _compute_weight() function to calculate the weights:

cell = AttentionCell()
out = cell(query, key, value, mask)
forward(query, key, value=None, mask=None)[source]

Defines the forward computation. Arguments can be either NDArray or Symbol.

hybrid_forward(F, query, key, value, mask=None)[source]

Overrides to construct symbolic graph for this Block.

Parameters:
  • x (Symbol or NDArray) – The first input tensor.
  • *args (list of Symbol or list of NDArray) – Additional input tensors.
class gluonnlp.model.MultiHeadAttentionCell(base_cell, query_units, key_units, value_units, num_heads, use_bias=True, weight_initializer=None, bias_initializer='zeros', prefix=None, params=None)[source]

Multi-head Attention Cell.

In the MultiHeadAttentionCell, the input query/key/value will be linearly projected for num_heads times with different projection matrices. Each projected key, value, query will be used to calculate the attention weights and values. The output of each head will be concatenated to form the final output.

The idea is first proposed in “[Arxiv2014] Neural Turing Machines” and is later adopted in “[NIPS2017] Attention is All You Need” to solve the Neural Machine Translation problem.

Parameters:
  • base_cell (AttentionCell) –
  • query_units (int) – Total number of projected units for query. Must be divided exactly by num_heads.
  • key_units (int) – Total number of projected units for key. Must be divided exactly by num_heads.
  • value_units (int) – Total number of projected units for value. Must be divided exactly by num_heads.
  • num_heads (int) – Number of parallel attention heads
  • use_bias (bool, default True) – Whether to use bias when projecting the query/key/values
  • weight_initializer (str or Initializer or None, default None) – Initializer of the weights.
  • bias_initializer (str or Initializer, default ‘zeros’) – Initializer of the bias.
  • prefix (str or None, default None) – See document of Block.
  • params (str or None, default None) – See document of Block.
class gluonnlp.model.MLPAttentionCell(units, act=Activation(tanh), normalized=False, dropout=0.0, weight_initializer=None, bias_initializer='zeros', prefix=None, params=None)[source]

Concat the query and the key and use a single-hidden-layer MLP to get the attention score. We provide two mode, the standard mode and the normalized mode.

In the standard mode:

score = v tanh(W [h_q, h_k] + b)

In the normalized mode (Same as TensorFlow):

score = g v / ||v||_2 tanh(W [h_q, h_k] + b)

This type of attention is first proposed in

Parameters:
  • units (int) –
  • act (Activation, default nn.Activation('tanh')) –
  • normalized (bool, default False) – Whether to normalize the weight that maps the embedded hidden states to the final score. This strategy can be interpreted as a type of “[NIPS2016] Weight Normalization”.
  • dropout (float, default 0.0) – Attention dropout.
  • weight_initializer (str or Initializer or None, default None) – Initializer of the weights.
  • bias_initializer (str or Initializer, default ‘zeros’) – Initializer of the bias.
  • prefix (str or None, default None) – See document of Block.
  • params (ParameterDict or None, default None) – See document of Block.
class gluonnlp.model.DotProductAttentionCell(units=None, luong_style=False, scaled=True, normalized=False, use_bias=True, dropout=0.0, weight_initializer=None, bias_initializer='zeros', prefix=None, params=None)[source]

Dot product attention between the query and the key.

Depending on parameters, defined as:

units is None:
    score = <h_q, h_k>
units is not None and luong_style is False:
    score = <W_q h_q, W_k h_k>
units is not None and luong_style is True:
    score = <W h_q, h_k>
Parameters:
  • units (int or None, default None) –

    Project the query and key to vectors with units dimension before applying the attention. If set to None, the query vector and the key vector are directly used to compute the attention and should have the same dimension:

    If the units is None,
        score = <h_q, h_k>
    Else if the units is not None and luong_style is False:
        score = <W_q h_q, W_k, h_k>
    Else if the units is not None and luong_style is True:
        score = <W h_q, h_k>
    
  • luong_style (bool, default False) –

    If turned on, the score will be:

    score = <W h_q, h_k>
    

    units must be the same as the dimension of the key vector

  • scaled (bool, default True) –

    Whether to divide the attention weights by the sqrt of the query dimension. This is first proposed in “[NIPS2017] Attention is all you need.”:

    score = <h_q, h_k> / sqrt(dim_q)
    
  • normalized (bool, default False) –

    If turned on, the cosine distance is used, i.e:

    score = <h_q / ||h_q||, h_k / ||h_k||>
    
  • use_bias (bool, default True) – Whether to use bias in the projection layers.
  • dropout (float, default 0.0) – Attention dropout
  • weight_initializer (str or Initializer or None, default None) – Initializer of the weights
  • bias_initializer (str or Initializer, default ‘zeros’) – Initializer of the bias
  • prefix (str or None, default None) – See document of Block.
  • params (str or None, default None) – See document of Block.
gluonnlp.model.apply_weight_drop(block, local_param_name, rate, axes=(), weight_dropout_mode='training')[source]

Apply weight drop to the parameter of a block.

Parameters:
  • block (Block or HybridBlock) – The block whose parameter is to be applied weight-drop.
  • local_param_name (str) – The parameter name used on the block. such as ‘weight’.
  • rate (float) – Fraction of the input units to drop. Must be a number between 0 and 1.
  • axes (tuple of int, default ()) – The axes on which dropout mask is shared. If empty, regular dropout is applied.
  • weight_drop_mode ({'training', 'always'}, default 'training') – Whether the weight dropout should be applied only at training time, or always be applied.
class gluonnlp.model.WeightDropParameter(parameter, rate=0.0, mode='training', axes=())[source]

A Container holding parameters (weights) of Blocks and performs dropout.

Parameters:
  • parameter (Parameter) – The parameter which drops out.
  • rate (float, default 0.0) – Fraction of the input units to drop. Must be a number between 0 and 1. Dropout is not applied if dropout_rate is 0.
  • mode (str, default 'training') – Whether to only turn on dropout during training or to also turn on for inference. Options are ‘training’ and ‘always’.
  • axes (tuple of int, default ()) – Axes on which dropout mask is shared.
data(ctx=None)[source]

Returns a copy of this parameter on one context. Must have been initialized on this context before.

Parameters:ctx (Context) – Desired context.
Returns:
Return type:NDArray on ctx
class gluonnlp.model.RNNCellLayer(rnn_cell, layout='TNC', **kwargs)[source]

A block that takes an rnn cell and makes it act like rnn layer.

Parameters:
  • rnn_cell (Cell) – The cell to wrap into a layer-like block.
  • layout (str, default 'TNC') – The output layout of the layer.
forward(inputs, states=None)[source]

Defines the forward computation. Arguments can be either NDArray or Symbol.

class gluonnlp.model.L2Normalization(axis=-1, eps=1e-06, **kwargs)[source]

Normalize the input array by dividing the L2 norm along the given axis.

..code

out = data / (sqrt(sum(data**2, axis)) + eps)
Parameters:
  • axis (int, default -1) – The axis to compute the norm value.
  • eps (float, default 1E-6) – The epsilon value to avoid dividing zero
hybrid_forward(F, x)[source]

Overrides to construct symbolic graph for this Block.

Parameters:
  • x (Symbol or NDArray) – The first input tensor.
  • *args (list of Symbol or list of NDArray) – Additional input tensors.
class gluonnlp.model.Highway(input_size, num_layers, activation='relu', highway_bias=<gluonnlp.initializer.initializer.HighwayBias object>, **kwargs)[source]

Highway network.

We implemented the highway network proposed in the following work:

@article{srivastava2015highway,
  title={Highway networks},
  author={Srivastava, Rupesh Kumar and Greff, Klaus and Schmidhuber, J{\"u}rgen},
  journal={arXiv preprint arXiv:1505.00387},
  year={2015}
}

The full version of the work:

@inproceedings{srivastava2015training,
 title={Training very deep networks},
 author={Srivastava, Rupesh K and Greff, Klaus and Schmidhuber, J{\"u}rgen},
 booktitle={Advances in neural information processing systems},
 pages={2377--2385},
 year={2015}
}

A Highway layer is defined as below:

\[y = (1 - t) * x + t * f(A(x))\]

which is a gated combination of a linear transform and a non-linear transform of its input, where \(x\) is the input tensor, \(A\) is a linear transformer, \(f\) is an element-wise non-linear transformer, and \(t\) is an element-wise transform gate, and \(1-t\) refers to carry gate.

Parameters:
  • input_size (int) – The dimension of the input tensor. We assume the input has shape (batch_size, input_size).
  • num_layers (int) – The number of highway layers to apply to the input.
  • activation (str, default 'relu') – The non-linear activation function to use. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
  • highway_bias (HighwayBias,) – default HighwayBias(nonlinear_transform_bias=0.0, transform_gate_bias=-2.0) The biases applied to the highway layer. We set the default according to the above original work.
hybrid_forward(F, inputs, **kwargs)[source]

Forward computation for highway layer

Parameters:inputs (NDArray) – The input tensor is of shape (batch_size, input_size).
Returns:outputs – The output tensor is of the same shape with input tensor (batch_size, input_size).
Return type:NDArray
class gluonnlp.model.ConvolutionalEncoder(embed_size=15, num_filters=(25, 50, 75, 100, 125, 150), ngram_filter_sizes=(1, 2, 3, 4, 5, 6), conv_layer_activation='tanh', num_highway=1, highway_layer_activation='relu', highway_bias=<gluonnlp.initializer.initializer.HighwayBias object>, output_size=None, **kwargs)[source]

Convolutional encoder.

We implement the convolutional encoder proposed in the following work:

@inproceedings{kim2016character,
 title={Character-Aware Neural Language Models.},
 author={Kim, Yoon and Jernite, Yacine and Sontag, David and Rush, Alexander M},
 booktitle={AAAI},
 pages={2741--2749},
 year={2016}
}
Parameters:
  • embed_size (int, default 15) – The input dimension to the encoder. We set the default according to the original work’s experiments on PTB dataset with Char-small model setting.
  • num_filters (Tuple[int], default (25, 50, 75, 100, 125, 150)) – The output dimension for each convolutional layer according to the filter sizes, which are the number of the filters learned by the layers. We set the default according to the original work’s experiments on PTB dataset with Char-small model setting.
  • ngram_filter_sizes (Tuple[int], default (1, 2, 3, 4, 5, 6)) – The size of each convolutional layer, and len(ngram_filter_sizes) equals to the number of convolutional layers. We set the default according to the original work’s experiments on PTB dataset with Char-small model setting.
  • conv_layer_activation (str, default 'tanh') – Activation function to be used after convolutional layer. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x). We set the default according to the original work’s experiments on PTB dataset with Char-small model setting.
  • num_highway (int, default '1') – The number of layers of the Highway layer. We set the default according to the original work’s experiments on PTB dataset with Char-small model setting.
  • highway_layer_activation (str, default 'relu') – Activation function to be used after highway layer. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x). We set the default according to the original work’s experiments on PTB dataset with Char-small model setting.
  • highway_bias (HighwayBias,) – default HighwayBias(nonlinear_transform_bias=0.0, transform_gate_bias=-2.0) The biases applied to the highway layer. We set the default according to the above original work.
  • output_size (int, default None) – The output dimension after conducting the convolutions and max pooling, and applying highways, as well as linear projection.
hybrid_forward(F, inputs, mask=None)[source]

Forward computation for char_encoder

Parameters:
  • inputs (NDArray) – The input tensor is of shape (seq_len, batch_size, embedding_size) TNC.
  • mask (NDArray) – The mask applied to the input of shape (seq_len, batch_size), the mask will be broadcasted along the embedding dimension.
Returns:

output – The output of the encoder with shape (batch_size, output_Size)

Return type:

NDArray

class gluonnlp.model.ISLogits(num_classes, num_sampled, in_unit, remove_accidental_hits=True, dtype='float32', weight_initializer=None, bias_initializer='zeros', sparse_grad=True, prefix=None, params=None)[source]

Block that computes sampled output training logits and labels suitable for importance sampled softmax loss.

Please use loss.SoftmaxCrossEntropyLoss for sampled softmax loss.

Example:

# network with importance sampling for training
encoder = Encoder(..)
decoder = ISLogits(..)
train_net.add(encoder)
train_net.add(decoder)
loss = SoftmaxCrossEntropyLoss()

# training
for x, y, sampled_values in train_batches:
    sampled_cls, cnt_sampled, cnt_true = sampled_values
    logits, new_targets = train_net(x, sampled_cls, cnt_sampled, cnt_true, y)
    l = loss(logits, new_targets)

# network for testing
test_net.add(encoder)
test_net.add(Dense(..., params=decoder.params))

# testing
for x, y in test_batches:
    logits = test_net(x)
    l = loss(logits, y)
Parameters:
  • num_classes (int) – Number of possible classes.
  • num_sampled (int) – Number of classes randomly sampled for each batch.
  • in_unit (int) – Dimensionality of the input space.
  • remove_accidental_hits (bool, default True) – Whether to remove “accidental hits” when a sampled candidate is equal to one of the true classes.
  • dtype (str or np.dtype, default 'float32') – Data type of output embeddings.
  • weight_initializer (str or Initializer, optional) – Initializer for the kernel weights matrix.
  • bias_initializer (str or Initializer, optional) – Initializer for the bias vector.
  • sparse_grad (bool, default True.) – Whether to use sparse gradient.
  • Inputs
    • x: A tensor of shape (batch_size, in_unit). The forward activation of the input network.
    • sampled_candidates: A tensor of shape (num_sampled,). The sampled candidate classes.
    • expected_count_sampled: A tensor of shape (num_sampled,). The expected count for sampled candidates.
    • expected_count_true: A tensor of shape (num_sampled). The expected count for true classes.
    • label: A tensor of shape (batch_size,1). The target classes.
  • Outputs
    • out: A tensor of shape (batch_size, 1+num_sampled). The output probability for the true class and sampled classes
    • new_targets: A tensor of shape (batch_size,). The new target classes.
  • note (.) – embeddings will be sparse. Only a subset of optimizers support sparse gradients, including SGD, AdaGrad and Adam. By default lazy_update is turned on for these optimizers, which may perform differently from standard updates. For more details, please check the Optimization API at: https://mxnet.incubator.apache.org/api/python/optimization/optimization.html
class gluonnlp.model.NCELogits(num_classes, num_sampled, in_unit, remove_accidental_hits=False, dtype='float32', weight_initializer=None, bias_initializer='zeros', sparse_grad=True, prefix=None, params=None)[source]

Block that computes sampled output training logits and labels suitable for noise contrastive estimation loss.

Please use loss.SigmoidBinaryCrossEntropyLoss for noise contrastive estimation loss during training.

Example:

# network with sampling for training
encoder = Encoder(..)
decoder = NCELogits(..)
train_net.add(encoder)
train_net.add(decoder)
loss_train = SigmoidBinaryCrossEntropyLoss()

# training
for x, y, sampled_values in train_batches:
    sampled_cls, cnt_sampled, cnt_true = sampled_values
    logits, new_targets = train_net(x, sampled_cls, cnt_sampled, cnt_true, y)
    l = loss_train(logits, new_targets)

# network for testing
test_net.add(encoder)
test_net.add(Dense(..., params=decoder.params))
loss_test = SoftmaxCrossEntropyLoss()

# testing
for x, y in test_batches:
    logits = test_net(x)
    l = loss_test(logits, y)
Parameters:
  • num_classes (int) – Number of possible classes.
  • num_sampled (int) – Number of classes randomly sampled for each batch.
  • in_unit (int) – Dimensionality of the input space.
  • remove_accidental_hits (bool, default False) – Whether to remove “accidental hits” when a sampled candidate is equal to one of the true classes.
  • dtype (str or np.dtype, default 'float32') – Data type of output embeddings.
  • weight_initializer (str or Initializer, optional) – Initializer for the kernel weights matrix.
  • bias_initializer (str or Initializer, optional) – Initializer for the bias vector.
  • sparse_grad (bool, default True.) – Whether to use sparse gradient.
  • Inputs
    • x: A tensor of shape (batch_size, in_unit). The forward activation of the input network.
    • sampled_candidates: A tensor of shape (num_sampled,). The sampled candidate classes.
    • expected_count_sampled: A tensor of shape (num_sampled,). The expected count for sampled candidates.
    • expected_count_true: A tensor of shape (num_sampled). The expected count for true classes.
    • label: A tensor of shape (batch_size,1). The target classes.
  • Outputs
    • out: A tensor of shape (batch_size, 1+num_sampled). The output probability for the true class and sampled classes
    • new_targets: A tensor of shape (batch_size,). The new target classes.
  • note (.) – embeddings will be sparse. Only a subset of optimizers support sparse gradients, including SGD, AdaGrad and Adam. By default lazy_update is turned on for these optimizers, which may perform differently from standard updates. For more details, please check the Optimization API at: https://mxnet.incubator.apache.org/api/python/optimization/optimization.html
class gluonnlp.model.SparseISLogits(num_classes, num_sampled, in_unit, remove_accidental_hits=True, dtype='float32', weight_initializer=None, bias_initializer='zeros', prefix=None, params=None)[source]

Block that computes sampled output training logits and labels suitable for importance sampled softmax loss.

Please use loss.SoftmaxCrossEntropyLoss for sampled softmax loss.

The block is designed for distributed training with extremely large number of classes to reduce communication overhead and memory consumption. Both weight and gradient w.r.t. weight are RowSparseNDArray.

Example:

# network with importance sampled softmax for training
encoder = Encoder(..)
train_net.add(encoder)
train_net.add(SparseISLogits(.., prefix='decoder')))
loss = SoftmaxCrossEntropyLoss()

# training
for x, y, sampled_values in train_batches:
    sampled_cls, cnt_sampled, cnt_true = sampled_values
    logits, new_targets = train_net(x, sampled_cls, cnt_sampled, cnt_true, y)
    l = loss(logits, new_targets)

# save params
train_net.save_parameters('net.params')

# network for testing
test_net.add(encoder)
test_net.add(Dense(..., prefix='decoder'))

# load params
test_net.load_parameters('net.params')

# testing
for x, y in test_batches:
    logits = test_net(x)
    l = loss(logits, y)
Parameters:
  • num_classes (int) – Number of possible classes.
  • num_sampled (int) – Number of classes randomly sampled for each batch.
  • in_unit (int) – Dimensionality of the input space.
  • remove_accidental_hits (bool, default True) – Whether to remove “accidental hits” when a sampled candidate is equal to one of the true classes.
  • dtype (str or np.dtype, default 'float32') – Data type of output embeddings.
  • weight_initializer (str or Initializer, optional) – Initializer for the kernel weights matrix.
  • bias_initializer (str or Initializer, optional) – Initializer for the bias vector.
  • Inputs
    • x: A tensor of shape (batch_size, in_unit). The forward activation of the input network.
    • sampled_candidates: A tensor of shape (num_sampled,). The sampled candidate classes.
    • expected_count_sampled: A tensor of shape (num_sampled,). The expected count for sampled candidates.
    • expected_count_true: A tensor of shape (num_sampled). The expected count for true classes.
    • label: A tensor of shape (batch_size,1). The target classes.
  • Outputs
    • out: A tensor of shape (batch_size, 1+num_sampled). The output probability for the true class and sampled classes
    • new_targets: A tensor of shape (batch_size,). The new target classes.
class gluonnlp.model.SparseNCELogits(num_classes, num_sampled, in_unit, remove_accidental_hits=True, dtype='float32', weight_initializer=None, bias_initializer='zeros', prefix=None, params=None)[source]

Block that computes sampled output training logits and labels suitable for noise contrastive estimation loss.

Please use loss.SigmoidBinaryCrossEntropyLoss for noise contrastive estimation loss during training.

The block is designed for distributed training with extremely large number of classes to reduce communication overhead and memory consumption. Both weight and gradient w.r.t. weight are RowSparseNDArray.

Example:

# network with importance sampled softmax for training
encoder = Encoder(..)
train_net.add(encoder)
train_net.add(SparseNCELogits(.., prefix='decoder')))
train_loss = SigmoidBinaryCrossEntropyLoss()

# training
for x, y, sampled_values in train_batches:
    sampled_cls, cnt_sampled, cnt_true = sampled_values
    logits, new_targets = train_net(x, sampled_cls, cnt_sampled, cnt_true, y)
    l = train_loss(logits, new_targets)

# save params
train_net.save_parameters('net.params')

# network for testing
test_net.add(encoder)
test_net.add(Dense(..., prefix='decoder'))

# load params
test_net.load_parameters('net.params')
test_loss = SoftmaxCrossEntropyLoss()

# testing
for x, y in test_batches:
    logits = test_net(x)
    l = test_loss(logits, y)
Parameters:
  • num_classes (int) – Number of possible classes.
  • num_sampled (int) – Number of classes randomly sampled for each batch.
  • in_unit (int) – Dimensionality of the input space.
  • remove_accidental_hits (bool, default True) – Whether to remove “accidental hits” when a sampled candidate is equal to one of the true classes.
  • dtype (str or np.dtype, default 'float32') – Data type of output embeddings.
  • weight_initializer (str or Initializer, optional) – Initializer for the kernel weights matrix.
  • bias_initializer (str or Initializer, optional) – Initializer for the bias vector.
  • Inputs
    • x: A tensor of shape (batch_size, in_unit). The forward activation of the input network.
    • sampled_candidates: A tensor of shape (num_sampled,). The sampled candidate classes.
    • expected_count_sampled: A tensor of shape (num_sampled,). The expected count for sampled candidates.
    • expected_count_true: A tensor of shape (num_sampled). The expected count for true classes.
    • label: A tensor of shape (batch_size,1). The target classes.
  • Outputs
    • out: A tensor of shape (batch_size, 1+num_sampled). The output probability for the true class and sampled classes
    • new_targets: A tensor of shape (batch_size,). The new target classes.
gluonnlp.model.get_model(name, dataset_name='wikitext-2', **kwargs)[source]

Returns a pre-defined model by name.

Parameters:
  • name (str) – Name of the model.
  • dataset_name (str or None, default 'wikitext-2'.) – The dataset name on which the pretrained model is trained. Options are ‘wikitext-2’. If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned.
  • vocab (gluonnlp.Vocab or None, default None) – Vocabulary object to be used with the language model. Required when dataset_name is not specified.
  • pretrained (bool, default False) – Whether to load the pretrained weights for model.
  • ctx (Context, default CPU) – The context in which to load the pretrained weights.
  • root (str, default '~/.mxnet/models') – Location for keeping the model parameters.
Returns:

The model.

Return type:

Block