gluonnlp.model¶
GluonNLP Toolkit supplies models for common NLP tasks with pretrained weights. By default, all requested pretrained weights are downloaded from public repo and stored in ~/.mxnet/models/.
Model Registry¶
The model registry provides an easy interface to obtain predefined and pretrained models.
Returns a predefined model by name. 
The get_model function returns a predefined model given the name of a registered model. The following sections of this page present a list of registered names for each model category.
Language Modeling¶
Components
AWD language model by salesforce. 

Bidirectional LM encoder. 

LongShort Term Memory Projected (LSTMP) network cell with cell clip and projection clip. 

Standard RNN language model. 

Big language model with LSTMP for inference. 
Predefined models
3layer LSTM language model with weightdrop, variational dropout, and tied weights. 

3layer LSTM language model with weightdrop, variational dropout, and tied weights. 

Standard 2layer LSTM language model with tied embedding and output weights. 

Standard 2layer LSTM language model with tied embedding and output weights. 

Standard 2layer LSTM language model with tied embedding and output weights. 

Big 1layer LSTMP language model. 
Machine Translation¶
Base class of the encoders in sequence to sequence learning models. 

Structure of the Transformer Encoder. 

Structure of the Transformer Encoder Cell. 

Structure of the Positionwise FeedForward Neural Network for Transformer. 
Transformer pretrained model. 
Bidirectional Encoder Representations from Transformers¶
Components
Generic Model for BERT (Bidirectional Encoder Representations from Transformers). 

BERT style Layer Normalization. 

Structure of the BERT Encoder. 

Structure of the Transformer Encoder Cell for BERT. 

Structure of the Positionwise FeedForward Neural Network for BERT. 
Predefined models
Generic BERT BASE model. 

Generic BERT LARGE model. 
Convolutional Encoder¶
Convolutional encoder. 
ELMo¶
Components
ELMo Bidirectional language model 

ELMo character encoder 
Predefined models
ELMo 2layer BiLSTM with 1024 hidden units, 128 projection size, 1 highway layer. 

ELMo 2layer BiLSTM with 2048 hidden units, 256 projection size, 1 highway layer. 

ELMo 2layer BiLSTM with 4096 hidden units, 512 projection size, 2 highway layer. 
Attention Cell¶
Abstract class for attention cells. 

Multihead Attention Cell. 

Concat the query and the key and use a singlehiddenlayer MLP to get the attention score. 

Dot product attention between the query and the key. 
Sequence Sampling¶
Score function used in beam search. 

Draw samples from the decoder by beam search. 

Draw samples from the decoder according to the stepwise distribution. 
Other Modeling Utilities¶
A Container holding parameters (weights) of Blocks and performs dropout. 

Apply weight drop to the parameter of a block. 

Normalize the input array by dividing the L2 norm along the given axis. 

Gaussian Error Linear Unit. 

Importance sampled Dense block, which computes sampled pred output and labels for importance sampled softmax loss during training. 

Noise contrastive estimated Dense block, which computes sampled pred output and labels for noise contrastive estimation loss during training. 

Importance sampled Dense block with sparse weights, which computes sampled pred output and labels for importance sampled softmax loss during training. 

Noise contrastive estimated Dense block with sparse weights, which computes sampled pred output and labels for noise contrastive estimation loss during training. 
API Reference¶
Module for predefined NLP models.
This module contains definitions for the following model architectures:  AWD
You can construct a model with random weights by calling its constructor. Because NLP models are tied to vocabularies, you can either specify a dataset name to load and use the vocabulary of that dataset:
import gluonnlp as nlp
awd, vocab = nlp.model.awd_lstm_lm_1150(dataset_name='wikitext2')
or directly specify a vocabulary object:
awd, vocab = nlp.model.awd_lstm_lm_1150(None, vocab=custom_vocab)
We provide pretrained models for all the listed models.
These models can constructed by passing pretrained=True
:
awd, vocab = nlp.model.awd_lstm_lm_1150(dataset_name='wikitext2'
pretrained=True)
You can construct a predefined ELMo model structure:
import gluonnlp as nlp
elmo = nlp.model.elmo_2x1024_128_2048cnn_1xhighway(dataset_name='gbw')
You can also get a ELMo model with pretrained parameters:
import gluonnlp as nlp
elmo = nlp.model.elmo_2x1024_128_2048cnn_1xhighway(dataset_name='gbw', pretrained=True)

class
gluonnlp.model.
AWDRNN
(mode, vocab_size, embed_size, hidden_size, num_layers, tie_weights, dropout, weight_drop, drop_h, drop_i, drop_e, **kwargs)[source]¶ AWD language model by salesforce.
Reference: https://github.com/salesforce/awdlstmlm
License: BSD 3Clause
 Parameters
mode (str) – The type of RNN to use. Options are ‘lstm’, ‘gru’, ‘rnn_tanh’, ‘rnn_relu’.
vocab_size (int) – Size of the input vocabulary.
embed_size (int) – Dimension of embedding vectors.
hidden_size (int) – Number of hidden units for RNN.
num_layers (int) – Number of RNN layers.
tie_weights (bool, default False) – Whether to tie the weight matrices of output dense layer and input embedding layer.
dropout (float) – Dropout rate to use for encoder output.
weight_drop (float) – Dropout rate to use on encoder h2h weights.
drop_h (float) – Dropout rate to on the output of intermediate layers of encoder.
drop_i (float) – Dropout rate to on the output of embedding.
drop_e (float) – Dropout rate to use on the embedding layer.

hybrid_forward
(F, inputs, begin_state=None)[source]¶ Implement forward computation.
 Parameters
inputs (NDArray) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”.
begin_state (list) – initial recurrent state tensor with length equals to num_layers. the initial state with shape (1, batch_size, num_hidden)
 Returns
out (NDArray) – output tensor with shape (sequence_length, batch_size, input_size) when layout is “TNC”.
out_states (list) – output recurrent state tensor with length equals to num_layers. the state with shape (1, batch_size, num_hidden)

class
gluonnlp.model.
StandardRNN
(mode, vocab_size, embed_size, hidden_size, num_layers, dropout, tie_weights, **kwargs)[source]¶ Standard RNN language model.
 Parameters
mode (str) – The type of RNN to use. Options are ‘lstm’, ‘gru’, ‘rnn_tanh’, ‘rnn_relu’.
vocab_size (int) – Size of the input vocabulary.
embed_size (int) – Dimension of embedding vectors.
hidden_size (int) – Number of hidden units for RNN.
num_layers (int) – Number of RNN layers.
dropout (float) – Dropout rate to use for encoder output.
tie_weights (bool, default False) – Whether to tie the weight matrices of output dense layer and input embedding layer.

hybrid_forward
(F, inputs, begin_state=None)[source]¶ Defines the forward computation. Arguments can be either
NDArray
orSymbol
. Parameters
inputs (NDArray) –
 input tensor with shape (sequence_length, batch_size)
when layout is “TNC”.
begin_state (list) – initial recurrent state tensor with length equals to num_layers1. the initial state with shape (num_layers, batch_size, num_hidden)
 Returns
out (NDArray) –
 output tensor with shape (sequence_length, batch_size, input_size)
when layout is “TNC”.
out_states (list) – output recurrent state tensor with length equals to num_layers1. the state with shape (num_layers, batch_size, num_hidden)

class
gluonnlp.model.
BigRNN
(vocab_size, embed_size, hidden_size, num_layers, projection_size, embed_dropout=0.0, encode_dropout=0.0, **kwargs)[source]¶ Big language model with LSTMP for inference.
 Parameters
vocab_size (int) – Size of the input vocabulary.
embed_size (int) – Dimension of embedding vectors.
hidden_size (int) – Number of hidden units for LSTMP.
num_layers (int) – Number of LSTMP layers.
projection_size (int) – Number of projection units for LSTMP.
embed_dropout (float) – Dropout rate to use for embedding output.
encode_dropout (float) – Dropout rate to use for encoder output.

forward
(inputs, begin_state)[source]¶ Implement forward computation.
 Parameters
inputs (NDArray) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”.
begin_state (list) – initial recurrent state tensor with length equals to num_layers*2. For each layer the two initial states have shape (batch_size, num_hidden) and (batch_size, num_projection)
 Returns
out (NDArray) –
 output tensor with shape (sequence_length, batch_size, vocab_size)
when layout is “TNC”.
out_states (list) – output recurrent state tensor with length equals to num_layers*2. For each layer the two initial states have shape (batch_size, num_hidden) and (batch_size, num_projection)

gluonnlp.model.
awd_lstm_lm_1150
(dataset_name=None, vocab=None, pretrained=False, ctx=cpu(0), root='/var/lib/jenkins/.mxnet/models', **kwargs)[source]¶ 3layer LSTM language model with weightdrop, variational dropout, and tied weights.
Embedding size is 400, and hidden layer size is 1150.
 Parameters
dataset_name (str or None, default None) – The dataset name on which the pretrained model is trained. Options are ‘wikitext2’. If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned. The pretrained model achieves 73.32/69.74 ppl on Val and Test of wikitext2 respectively.
vocab (gluonnlp.Vocab or None, default None) – Vocab object to be used with the language model. Required when dataset_name is not specified.
pretrained (bool, default False) – Whether to load the pretrained weights for model.
ctx (Context, default CPU) – The context in which to load the pretrained weights.
root (str, default '$MXNET_HOME/models') – Location for keeping the model parameters. MXNET_HOME defaults to ‘~/.mxnet’.
 Returns
 Return type
gluon.Block, gluonnlp.Vocab

gluonnlp.model.
awd_lstm_lm_600
(dataset_name=None, vocab=None, pretrained=False, ctx=cpu(0), root='/var/lib/jenkins/.mxnet/models', **kwargs)[source]¶ 3layer LSTM language model with weightdrop, variational dropout, and tied weights.
Embedding size is 200, and hidden layer size is 600.
 Parameters
dataset_name (str or None, default None) – The dataset name on which the pretrained model is trained. Options are ‘wikitext2’. If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned. The pretrained model achieves 84.61/80.96 ppl on Val and Test of wikitext2 respectively.
vocab (gluonnlp.Vocab or None, default None) – Vocab object to be used with the language model. Required when dataset_name is not specified.
pretrained (bool, default False) – Whether to load the pretrained weights for model.
ctx (Context, default CPU) – The context in which to load the pretrained weights.
root (str, default '$MXNET_HOME/models') – Location for keeping the model parameters. MXNET_HOME defaults to ‘~/.mxnet’.
 Returns
 Return type
gluon.Block, gluonnlp.Vocab

gluonnlp.model.
standard_lstm_lm_200
(dataset_name=None, vocab=None, pretrained=False, ctx=cpu(0), root='/var/lib/jenkins/.mxnet/models', **kwargs)[source]¶ Standard 2layer LSTM language model with tied embedding and output weights.
Both embedding and hidden dimensions are 200.
 Parameters
dataset_name (str or None, default None) – The dataset name on which the pretrained model is trained. Options are ‘wikitext2’. If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned. The pretrained model achieves 108.25/102.26 ppl on Val and Test of wikitext2 respectively.
vocab (gluonnlp.Vocab or None, default None) – Vocabulary object to be used with the language model. Required when dataset_name is not specified.
pretrained (bool, default False) – Whether to load the pretrained weights for model.
ctx (Context, default CPU) – The context in which to load the pretrained weights.
root (str, default '$MXNET_HOME/models') – Location for keeping the model parameters. MXNET_HOME defaults to ‘~/.mxnet’.
 Returns
 Return type
gluon.Block, gluonnlp.Vocab

gluonnlp.model.
standard_lstm_lm_650
(dataset_name=None, vocab=None, pretrained=False, ctx=cpu(0), root='/var/lib/jenkins/.mxnet/models', **kwargs)[source]¶ Standard 2layer LSTM language model with tied embedding and output weights.
Both embedding and hidden dimensions are 650.
 Parameters
dataset_name (str or None, default None) – The dataset name on which the pretrained model is trained. Options are ‘wikitext2’. If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned. The pretrained model achieves 98.96/93.90 ppl on Val and Test of wikitext2 respectively.
vocab (gluonnlp.Vocab or None, default None) – Vocabulary object to be used with the language model. Required when dataset_name is not specified.
pretrained (bool, default False) – Whether to load the pretrained weights for model.
ctx (Context, default CPU) – The context in which to load the pretrained weights.
root (str, default '$MXNET_HOME/models') – Location for keeping the model parameters. MXNET_HOME defaults to ‘~/.mxnet’.
 Returns
 Return type
gluon.Block, gluonnlp.Vocab

gluonnlp.model.
standard_lstm_lm_1500
(dataset_name=None, vocab=None, pretrained=False, ctx=cpu(0), root='/var/lib/jenkins/.mxnet/models', **kwargs)[source]¶ Standard 2layer LSTM language model with tied embedding and output weights.
Both embedding and hidden dimensions are 1500.
 Parameters
dataset_name (str or None, default None) – The dataset name on which the pretrained model is trained. Options are ‘wikitext2’. If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned. The pretrained model achieves 98.29/92.83 ppl on Val and Test of wikitext2 respectively.
vocab (gluonnlp.Vocab or None, default None) – Vocabulary object to be used with the language model. Required when dataset_name is not specified.
pretrained (bool, default False) – Whether to load the pretrained weights for model.
ctx (Context, default CPU) – The context in which to load the pretrained weights.
root (str, default '$MXNET_HOME/models') – Location for keeping the model parameters. MXNET_HOME defaults to ‘~/.mxnet’.
 Returns
 Return type
gluon.Block, gluonnlp.Vocab

gluonnlp.model.
big_rnn_lm_2048_512
(dataset_name=None, vocab=None, pretrained=False, ctx=cpu(0), root='/var/lib/jenkins/.mxnet/models', **kwargs)[source]¶ Big 1layer LSTMP language model.
Both embedding and projection size are 512. Hidden size is 2048.
 Parameters
dataset_name (str or None, default None) – The dataset name on which the pretrained model is trained. Options are ‘gbw’. If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned. The pretrained model achieves 44.05 ppl on Test of GBW dataset.
vocab (gluonnlp.Vocab or None, default None) – Vocabulary object to be used with the language model. Required when dataset_name is not specified.
pretrained (bool, default False) – Whether to load the pretrained weights for model.
ctx (Context, default CPU) – The context in which to load the pretrained weights.
root (str, default '$MXNET_HOME/models') – Location for keeping the model parameters. MXNET_HOME defaults to ‘~/.mxnet’.
 Returns
 Return type
gluon.Block, gluonnlp.Vocab

class
gluonnlp.model.
BeamSearchScorer
(alpha=1.0, K=5.0, from_logits=True, **kwargs)[source]¶ Score function used in beam search.
Implements the lengthpenalized score function used in the GNMT paper:
scores = (log_probs + scores) / length_penalty length_penalty = (K + length)^\alpha / (K + 1)^\alpha
 Parameters

class
gluonnlp.model.
BeamSearchSampler
(beam_size, decoder, eos_id, scorer=BeamSearchScorer( ), max_length=100)[source]¶ Draw samples from the decoder by beam search.
 Parameters
beam_size (int) – The beam size.
decoder (callable) –
Function of the onestepahead decoder, should have the form:
outputs, new_states = decoder(step_input, states)
The outputs, input should follow these rules:
step_input has shape (batch_size,),
outputs has shape (batch_size, V),
states and new_states have the same structure and the leading dimension of the inner NDArrays is the batch dimension.
eos_id (int) – Id of the EOS token. No other elements will be appended to the sample if it reaches eos_id.
scorer (BeamSearchScorer, default BeamSearchScorer(alpha=1.0, K=5)) – The score function used in beam search.
max_length (int, default 100) – The maximum search length.

class
gluonnlp.model.
HybridBeamSearchSampler
(batch_size, beam_size, decoder, eos_id, scorer=BeamSearchScorer( ), max_length=100, vocab_size=None, prefix=None, params=None)[source]¶ Draw samples from the decoder by beam search.
 Parameters
batch_size (int) – The batch size.
beam_size (int) – The beam size.
decoder (callable, must be hybridizable) –
Function of the onestepahead decoder, should have the form:
outputs, new_states = decoder(step_input, states)
The outputs, input should follow these rules:
step_input has shape (batch_size,),
outputs has shape (batch_size, V),
states and new_states have the same structure and the leading dimension of the inner NDArrays is the batch dimension.
eos_id (int) – Id of the EOS token. No other elements will be appended to the sample if it reaches eos_id.
scorer (BeamSearchScorer, default BeamSearchScorer(alpha=1.0, K=5), must be hybridizable) – The score function used in beam search.
max_length (int, default 100) – The maximum search length.
vocab_size (int, default None, meaning decoder._vocab_size) – The vocabulary size

hybrid_forward
(F, inputs, states)[source]¶ Sample by beam search.
 Parameters
F –
inputs (NDArray or Symbol) – The initial input of the decoder. Shape is (batch_size,).
states (Object that contains NDArrays or Symbols) – The initial states of the decoder.
 Returns
samples (NDArray or Symbol) – Samples draw by beam search. Shape (batch_size, beam_size, length). dtype is int32.
scores (NDArray or Symbol) – Scores of the samples. Shape (batch_size, beam_size). We make sure that scores[i, :] are in descending order.
valid_length (NDArray or Symbol) – The valid length of the samples. Shape (batch_size, beam_size). dtype will be int32.

class
gluonnlp.model.
SequenceSampler
(beam_size, decoder, eos_id, max_length=100, temperature=1.0, top_k=None)[source]¶ Draw samples from the decoder according to the stepwise distribution.
 Parameters
beam_size (int) – The beam size.
decoder (callable) –
Function of the onestepahead decoder, should have the form:
outputs, new_states = decoder(step_input, states)
The outputs, input should follow these rules:
step_input has shape (batch_size,)
outputs is the unnormalized prediction before softmax with shape (batch_size, V)
states and new_states have the same structure and the leading dimension of the inner NDArrays is the batch dimension.
eos_id (int) – Id of the EOS token. No other elements will be appended to the sample if it reaches eos_id.
max_length (int, default 100) – The maximum search length.
temperature (float, default 1.0) – Softmax temperature.
top_k (int or None, default None) – Sample only from the topk candidates. If None, all candidates are considered.

class
gluonnlp.model.
AttentionCell
(prefix=None, params=None)[source]¶ Abstract class for attention cells. Extend the class to implement your own attention method. One typical usage is to define your own _compute_weight() function to calculate the weights:
cell = AttentionCell() out = cell(query, key, value, mask)

cast
(dtype)[source]¶ Cast this Block to use another data type.
 Parameters
dtype (str or numpy.dtype) – The new data type.


class
gluonnlp.model.
MultiHeadAttentionCell
(base_cell, query_units, key_units, value_units, num_heads, use_bias=True, weight_initializer=None, bias_initializer='zeros', prefix=None, params=None)[source]¶ Multihead Attention Cell.
In the MultiHeadAttentionCell, the input query/key/value will be linearly projected for num_heads times with different projection matrices. Each projected key, value, query will be used to calculate the attention weights and values. The output of each head will be concatenated to form the final output.
The idea is first proposed in “[Arxiv2014] Neural Turing Machines” and is later adopted in “[NIPS2017] Attention is All You Need” to solve the Neural Machine Translation problem.
 Parameters
base_cell (AttentionCell) –
query_units (int) – Total number of projected units for query. Must be divided exactly by num_heads.
key_units (int) – Total number of projected units for key. Must be divided exactly by num_heads.
value_units (int) – Total number of projected units for value. Must be divided exactly by num_heads.
num_heads (int) – Number of parallel attention heads
use_bias (bool, default True) – Whether to use bias when projecting the query/key/values
weight_initializer (str or Initializer or None, default None) – Initializer of the weights.
bias_initializer (str or Initializer, default ‘zeros’) – Initializer of the bias.

class
gluonnlp.model.
MLPAttentionCell
(units, act=Activation(tanh), normalized=False, dropout=0.0, weight_initializer=None, bias_initializer='zeros', prefix=None, params=None)[source]¶ Concat the query and the key and use a singlehiddenlayer MLP to get the attention score. We provide two mode, the standard mode and the normalized mode.
In the standard mode:
score = v tanh(W [h_q, h_k] + b)
In the normalized mode (Same as TensorFlow):
score = g v / v_2 tanh(W [h_q, h_k] + b)
This type of attention is first proposed in
 Parameters
units (int) –
act (Activation, default nn.Activation('tanh')) –
normalized (bool, default False) – Whether to normalize the weight that maps the embedded hidden states to the final score. This strategy can be interpreted as a type of “[NIPS2016] Weight Normalization”.
dropout (float, default 0.0) – Attention dropout.
weight_initializer (str or Initializer or None, default None) – Initializer of the weights.
bias_initializer (str or Initializer, default ‘zeros’) – Initializer of the bias.
params (ParameterDict or None, default None) – See document of Block.

class
gluonnlp.model.
DotProductAttentionCell
(units=None, luong_style=False, scaled=True, normalized=False, use_bias=True, dropout=0.0, weight_initializer=None, bias_initializer='zeros', prefix=None, params=None)[source]¶ Dot product attention between the query and the key.
Depending on parameters, defined as:
units is None: score = <h_q, h_k> units is not None and luong_style is False: score = <W_q h_q, W_k h_k> units is not None and luong_style is True: score = <W h_q, h_k>
 Parameters
units (int or None, default None) –
Project the query and key to vectors with units dimension before applying the attention. If set to None, the query vector and the key vector are directly used to compute the attention and should have the same dimension:
If the units is None, score = <h_q, h_k> Else if the units is not None and luong_style is False: score = <W_q h_q, W_k h_k> Else if the units is not None and luong_style is True: score = <W h_q, h_k>
luong_style (bool, default False) –
If turned on, the score will be:
score = <W h_q, h_k>
units must be the same as the dimension of the key vector
scaled (bool, default True) –
Whether to divide the attention weights by the sqrt of the query dimension. This is first proposed in “[NIPS2017] Attention is all you need.”:
score = <h_q, h_k> / sqrt(dim_q)
normalized (bool, default False) –
If turned on, the cosine distance is used, i.e:
score = <h_q / h_q, h_k / h_k>
use_bias (bool, default True) – Whether to use bias in the projection layers.
dropout (float, default 0.0) – Attention dropout
weight_initializer (str or Initializer or None, default None) – Initializer of the weights
bias_initializer (str or Initializer, default ‘zeros’) – Initializer of the bias

gluonnlp.model.
apply_weight_drop
(block, local_param_regex, rate, axes=(), weight_dropout_mode='training')[source]¶ Apply weight drop to the parameter of a block.
 Parameters
block (Block or HybridBlock) – The block whose parameter is to be applied weightdrop.
local_param_regex (str) – The regex for parameter names used in the self.params.get(), such as ‘weight’.
rate (float) – Fraction of the input units to drop. Must be a number between 0 and 1.
axes (tuple of int, default ()) – The axes on which dropout mask is shared. If empty, regular dropout is applied.
weight_drop_mode ({'training', 'always'}, default 'training') – Whether the weight dropout should be applied only at training time, or always be applied.
Examples
>>> net = gluon.rnn.LSTM(10, num_layers=2, bidirectional=True) >>> gluonnlp.model.apply_weight_drop(net, r'.*h2h_weight', 0.5) >>> net.collect_params() lstm0_ ( Parameter lstm0_l0_i2h_weight (shape=(40, 0), dtype=float32) WeightDropParameter lstm0_l0_h2h_weight (shape=(40, 10), dtype=float32, rate=0.5, mode=training) Parameter lstm0_l0_i2h_bias (shape=(40,), dtype=float32) Parameter lstm0_l0_h2h_bias (shape=(40,), dtype=float32) Parameter lstm0_r0_i2h_weight (shape=(40, 0), dtype=float32) WeightDropParameter lstm0_r0_h2h_weight (shape=(40, 10), dtype=float32, rate=0.5, mode=training) Parameter lstm0_r0_i2h_bias (shape=(40,), dtype=float32) Parameter lstm0_r0_h2h_bias (shape=(40,), dtype=float32) Parameter lstm0_l1_i2h_weight (shape=(40, 20), dtype=float32) WeightDropParameter lstm0_l1_h2h_weight (shape=(40, 10), dtype=float32, rate=0.5, mode=training) Parameter lstm0_l1_i2h_bias (shape=(40,), dtype=float32) Parameter lstm0_l1_h2h_bias (shape=(40,), dtype=float32) Parameter lstm0_r1_i2h_weight (shape=(40, 20), dtype=float32) WeightDropParameter lstm0_r1_h2h_weight (shape=(40, 10), dtype=float32, rate=0.5, mode=training) Parameter lstm0_r1_i2h_bias (shape=(40,), dtype=float32) Parameter lstm0_r1_h2h_bias (shape=(40,), dtype=float32) ) >>> ones = mx.nd.ones((3, 4, 5)) >>> net.initialize() >>> with mx.autograd.train_mode(): ... net(ones).max().asscalar() != net(ones).max().asscalar() True

class
gluonnlp.model.
WeightDropParameter
(parameter, rate=0.0, mode='training', axes=())[source]¶ A Container holding parameters (weights) of Blocks and performs dropout.
 Parameters
parameter (Parameter) – The parameter which drops out.
rate (float, default 0.0) – Fraction of the input units to drop. Must be a number between 0 and 1. Dropout is not applied if dropout_rate is 0.
mode (str, default 'training') – Whether to only turn on dropout during training or to also turn on for inference. Options are ‘training’ and ‘always’.
axes (tuple of int, default ()) – Axes on which dropout mask is shared.

class
gluonnlp.model.
RNNCellLayer
(rnn_cell, layout='TNC', **kwargs)[source]¶ A block that takes an rnn cell and makes it act like rnn layer.
 Parameters
rnn_cell (Cell) – The cell to wrap into a layerlike block.
layout (str, default 'TNC') – The output layout of the layer.

class
gluonnlp.model.
L2Normalization
(axis=1, eps=1e06, **kwargs)[source]¶ Normalize the input array by dividing the L2 norm along the given axis.
..code
out = data / (sqrt(sum(data**2, axis)) + eps)
 Parameters

class
gluonnlp.model.
GELU
(**kwargs)[source]¶ Gaussian Error Linear Unit.
This is a smoother version of the RELU. https://arxiv.org/abs/1606.08415
 Parameters
Inputs –
data: input tensor with arbitrary shape.
Outputs –
out: output tensor with the same shape as data.

class
gluonnlp.model.
Highway
(input_size, num_layers, activation='relu', highway_bias=<gluonnlp.initializer.initializer.HighwayBias object>, **kwargs)[source]¶ Highway network.
We implemented the highway network proposed in the following work:
@article{srivastava2015highway, title={Highway networks}, author={Srivastava, Rupesh Kumar and Greff, Klaus and Schmidhuber, J{\"u}rgen}, journal={arXiv preprint arXiv:1505.00387}, year={2015} }
The full version of the work:
@inproceedings{srivastava2015training, title={Training very deep networks}, author={Srivastava, Rupesh K and Greff, Klaus and Schmidhuber, J{\"u}rgen}, booktitle={Advances in neural information processing systems}, pages={23772385}, year={2015} }
A Highway layer is defined as below:
\[y = (1  t) * x + t * f(A(x))\]which is a gated combination of a linear transform and a nonlinear transform of its input, where \(x\) is the input tensor, \(A\) is a linear transformer, \(f\) is an elementwise nonlinear transformer, and \(t\) is an elementwise transform gate, and \(1t\) refers to carry gate.
 Parameters
input_size (int) – The dimension of the input tensor. We assume the input has shape
(batch_size, input_size)
.num_layers (int) – The number of highway layers to apply to the input.
activation (str, default 'relu') – The nonlinear activation function to use. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
highway_bias (HighwayBias,) – default HighwayBias(nonlinear_transform_bias=0.0, transform_gate_bias=2.0) The biases applied to the highway layer. We set the default according to the above original work.

class
gluonnlp.model.
ConvolutionalEncoder
(embed_size=15, num_filters=(25, 50, 75, 100, 125, 150), ngram_filter_sizes=(1, 2, 3, 4, 5, 6), conv_layer_activation='tanh', num_highway=1, highway_layer_activation='relu', highway_bias=<gluonnlp.initializer.initializer.HighwayBias object>, output_size=None, **kwargs)[source]¶ Convolutional encoder.
We implement the convolutional encoder proposed in the following work:
@inproceedings{kim2016character, title={CharacterAware Neural Language Models.}, author={Kim, Yoon and Jernite, Yacine and Sontag, David and Rush, Alexander M}, booktitle={AAAI}, pages={27412749}, year={2016} }
 Parameters
embed_size (int, default 15) – The input dimension to the encoder. We set the default according to the original work’s experiments on PTB dataset with Charsmall model setting.
num_filters (Tuple[int], default (25, 50, 75, 100, 125, 150)) – The output dimension for each convolutional layer according to the filter sizes, which are the number of the filters learned by the layers. We set the default according to the original work’s experiments on PTB dataset with Charsmall model setting.
ngram_filter_sizes (Tuple[int], default (1, 2, 3, 4, 5, 6)) – The size of each convolutional layer, and len(ngram_filter_sizes) equals to the number of convolutional layers. We set the default according to the original work’s experiments on PTB dataset with Charsmall model setting.
conv_layer_activation (str, default 'tanh') – Activation function to be used after convolutional layer. We set the default according to the original work’s experiments on PTB dataset with Charsmall model setting.
num_highway (int, default '1') – The number of layers of the Highway layer. We set the default according to the original work’s experiments on PTB dataset with Charsmall model setting.
highway_layer_activation (str, default 'relu') – Activation function to be used after highway layer. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x). We set the default according to the original work’s experiments on PTB dataset with Charsmall model setting.
highway_bias (HighwayBias,) – default HighwayBias(nonlinear_transform_bias=0.0, transform_gate_bias=2.0) The biases applied to the highway layer. We set the default according to the above original work.
output_size (int, default None) – The output dimension after conducting the convolutions and max pooling, and applying highways, as well as linear projection.

hybrid_forward
(F, inputs, mask=None)[source]¶ Forward computation for char_encoder
 Parameters
inputs (NDArray) – The input tensor is of shape (seq_len, batch_size, embedding_size) TNC.
mask (NDArray) – The mask applied to the input of shape (seq_len, batch_size), the mask will be broadcasted along the embedding dimension.
 Returns
output – The output of the encoder with shape (batch_size, output_size)
 Return type
NDArray

class
gluonnlp.model.
ISDense
(num_classes, num_sampled, in_unit, remove_accidental_hits=True, dtype='float32', weight_initializer=None, bias_initializer='zeros', sparse_grad=True, prefix=None, params=None)[source]¶ Importance sampled Dense block, which computes sampled pred output and labels for importance sampled softmax loss during training.
Reference:
Exploring the Limits of Language Modeling Jozefowicz, Rafal and Vinyals, Oriol and Schuster, Mike and Shazeer, Noam and Wu, Yonghui https://arxiv.org/pdf/1602.02410
Please use loss.SoftmaxCrossEntropyLoss for sampled softmax loss.
Note
If sparse_grad is set to True, the gradient w.r.t input and output embeddings will be sparse. Only a subset of optimizers support sparse gradients, including SGD, AdaGrad and Adam. By default lazy_update is turned on for these optimizers, which may perform differently from standard updates. For more details, please check the Optimization API at https://mxnet.incubator.apache.org/api/python/optimization/optimization.html
Example:
# network with importance sampling for training encoder = Encoder(..) decoder = ISDense(..) train_net.add(encoder) train_net.add(decoder) loss = SoftmaxCrossEntropyLoss() # training for x, y, sampled_values in train_batches: pred, new_targets = train_net(x, sampled_values, y) l = loss(pred, new_targets) # network for testing test_net.add(encoder) test_net.add(Dense(..., params=decoder.params)) # testing for x, y in test_batches: pred = test_net(x) l = loss(pred, y)
 Parameters
num_classes (int) – Number of possible classes.
num_sampled (int) – Number of classes randomly sampled for each batch.
in_unit (int) – Dimensionality of the input space.
remove_accidental_hits (bool, default True) – Whether to remove “accidental hits” when a sampled candidate is equal to one of the true classes.
dtype (str or np.dtype, default 'float32') – Data type of output embeddings.
weight_initializer (str or Initializer, optional) – Initializer for the kernel weights matrix.
bias_initializer (str or Initializer, optional) – Initializer for the bias vector.
sparse_grad (bool, default True.) – Whether to use sparse gradient.
Inputs –
x: A tensor of shape (batch_size, in_unit). The forward activation of the input network.
sampled_values : A list of three tensors for sampled_classes with shape (num_samples,), expected_count_sampled with shape (num_samples,), and expected_count_true with shape (sequence_length, batch_size).
label: A tensor of shape (batch_size,1). The target classes.
Outputs –
out: A tensor of shape (batch_size, 1+num_sampled). The output probability for the true class and sampled classes
new_targets: A tensor of shape (batch_size,). The new target classes.

class
gluonnlp.model.
NCEDense
(num_classes, num_sampled, in_unit, remove_accidental_hits=False, dtype='float32', weight_initializer=None, bias_initializer='zeros', sparse_grad=True, prefix=None, params=None)[source]¶ Noise contrastive estimated Dense block, which computes sampled pred output and labels for noise contrastive estimation loss during training.
Reference:
Exploring the Limits of Language Modeling Jozefowicz, Rafal and Vinyals, Oriol and Schuster, Mike and Shazeer, Noam and Wu, Yonghui https://arxiv.org/pdf/1602.02410
Please use loss.SigmoidBinaryCrossEntropyLoss for noise contrastive estimation loss during training.
Note
If sparse_grad is set to True, the gradient w.r.t input and output embeddings will be sparse. Only a subset of optimizers support sparse gradients, including SGD, AdaGrad and Adam. By default lazy_update is turned on for these optimizers, which may perform differently from standard updates. For more details, please check the Optimization API at: https://mxnet.incubator.apache.org/api/python/optimization/optimization.html
Example:
# network with sampling for training encoder = Encoder(..) decoder = NCEDense(..) train_net.add(encoder) train_net.add(decoder) loss_train = SigmoidBinaryCrossEntropyLoss() # training for x, y, sampled_values in train_batches: pred, new_targets = train_net(x, sampled_values, y) l = loss_train(pred, new_targets) # network for testing test_net.add(encoder) test_net.add(Dense(..., params=decoder.params)) loss_test = SoftmaxCrossEntropyLoss() # testing for x, y in test_batches: pred = test_net(x) l = loss_test(pred, y)
 Parameters
num_classes (int) – Number of possible classes.
num_sampled (int) – Number of classes randomly sampled for each batch.
in_unit (int) – Dimensionality of the input space.
remove_accidental_hits (bool, default False) – Whether to remove “accidental hits” when a sampled candidate is equal to one of the true classes.
dtype (str or np.dtype, default 'float32') – Data type of output embeddings.
weight_initializer (str or Initializer, optional) – Initializer for the kernel weights matrix.
bias_initializer (str or Initializer, optional) – Initializer for the bias vector.
sparse_grad (bool, default True.) – Whether to use sparse gradient.
Inputs –
x: A tensor of shape (batch_size, in_unit). The forward activation of the input network.
sampled_values : A list of three tensors for sampled_classes with shape (num_samples,), expected_count_sampled with shape (num_samples,), and expected_count_true with shape (sequence_length, batch_size).
label: A tensor of shape (batch_size,1). The target classes.
Outputs –
out: A tensor of shape (batch_size, 1+num_sampled). The output probability for the true class and sampled classes
new_targets: A tensor of shape (batch_size, 1+num_sampled). The new target classes.

class
gluonnlp.model.
SparseISDense
(num_classes, num_sampled, in_unit, remove_accidental_hits=True, dtype='float32', weight_initializer=None, bias_initializer='zeros', prefix=None, params=None)[source]¶ Importance sampled Dense block with sparse weights, which computes sampled pred output and labels for importance sampled softmax loss during training.
Reference:
Exploring the Limits of Language Modeling Jozefowicz, Rafal and Vinyals, Oriol and Schuster, Mike and Shazeer, Noam and Wu, Yonghui https://arxiv.org/pdf/1602.02410
Please use loss.SoftmaxCrossEntropyLoss for sampled softmax loss.
The block is designed for distributed training with extremely large number of classes to reduce communication overhead and memory consumption. Both weight and gradient w.r.t. weight are RowSparseNDArray.
Note
Different from ISDense block, the weight parameter is stored in row_sparse format, which helps reduce memory consumption and communication overhead during multiGPU training. However, sparse parameters cannot be shared with other blocks, nor could we hybridize a block containing sparse parameters. Therefore, the parameters have to be saved before they are used for testing.
Example:
# network with importance sampled softmax for training encoder = Encoder(..) train_net.add(encoder) train_net.add(SparseISDense(.., prefix='decoder'))) loss = SoftmaxCrossEntropyLoss() # training for x, y, sampled_values in train_batches: pred, new_targets = train_net(x, sampled_values, y) l = loss(pred, new_targets) # save params train_net.save_parameters('net.params') # network for testing test_net.add(encoder) test_net.add(Dense(..., prefix='decoder')) # load params test_net.load_parameters('net.params') # testing for x, y in test_batches: pred = test_net(x) l = loss(pred, y)
 Parameters
num_classes (int) – Number of possible classes.
num_sampled (int) – Number of classes randomly sampled for each batch.
in_unit (int) – Dimensionality of the input space.
remove_accidental_hits (bool, default True) – Whether to remove “accidental hits” when a sampled candidate is equal to one of the true classes.
dtype (str or np.dtype, default 'float32') – Data type of output embeddings.
weight_initializer (str or Initializer, optional) – Initializer for the kernel weights matrix.
bias_initializer (str or Initializer, optional) – Initializer for the bias vector.
Inputs –
x: A tensor of shape (batch_size, in_unit). The forward activation of the input network.
sampled_values : A list of three tensors for sampled_classes with shape (num_samples,), expected_count_sampled with shape (num_samples,), and expected_count_true with shape (sequence_length, batch_size).
label: A tensor of shape (batch_size,1). The target classes.
Outputs –
out: A tensor of shape (batch_size, 1+num_sampled). The output probability for the true class and sampled classes
new_targets: A tensor of shape (batch_size,). The new target classes.

class
gluonnlp.model.
SparseNCEDense
(num_classes, num_sampled, in_unit, remove_accidental_hits=True, dtype='float32', weight_initializer=None, bias_initializer='zeros', prefix=None, params=None)[source]¶ Noise contrastive estimated Dense block with sparse weights, which computes sampled pred output and labels for noise contrastive estimation loss during training.
Reference:
Exploring the Limits of Language Modeling Jozefowicz, Rafal and Vinyals, Oriol and Schuster, Mike and Shazeer, Noam and Wu, Yonghui https://arxiv.org/pdf/1602.02410
Please use loss.SigmoidBinaryCrossEntropyLoss for noise contrastive estimation loss during training.
The block is designed for distributed training with extremely large number of classes to reduce communication overhead and memory consumption. Both weight and gradient w.r.t. weight are RowSparseNDArray.
Note
Different from NCEDense block, the weight parameter is stored in row_sparse format, which helps reduce memory consumption and communication overhead during multiGPU training. However, sparse parameters cannot be shared with other blocks, nor could we hybridize a block containing sparse parameters. Therefore, the parameters have to be saved before they are used for testing.
Example:
# network with importance sampled softmax for training encoder = Encoder(..) train_net.add(encoder) train_net.add(SparseNCEDense(.., prefix='decoder'))) train_loss = SigmoidBinaryCrossEntropyLoss() # training for x, y, sampled_values in train_batches: pred, new_targets = train_net(x, sampled_values, y) l = train_loss(pred, new_targets) # save params train_net.save_parameters('net.params') # network for testing test_net.add(encoder) test_net.add(Dense(..., prefix='decoder')) # load params test_net.load_parameters('net.params') test_loss = SoftmaxCrossEntropyLoss() # testing for x, y in test_batches: pred = test_net(x) l = test_loss(pred, y)
 Parameters
num_classes (int) – Number of possible classes.
num_sampled (int) – Number of classes randomly sampled for each batch.
in_unit (int) – Dimensionality of the input space.
remove_accidental_hits (bool, default True) – Whether to remove “accidental hits” when a sampled candidate is equal to one of the true classes.
dtype (str or np.dtype, default 'float32') – Data type of output embeddings.
weight_initializer (str or Initializer, optional) – Initializer for the kernel weights matrix.
bias_initializer (str or Initializer, optional) – Initializer for the bias vector.
Inputs –
x: A tensor of shape (batch_size, in_unit). The forward activation of the input network.
sampled_values : A list of three tensors for sampled_classes with shape (num_samples,), expected_count_sampled with shape (num_samples,), and expected_count_true with shape (sequence_length, batch_size).
label: A tensor of shape (batch_size, 1+num_samples). The target classes.
Outputs –
out: A tensor of shape (batch_size, 1+num_sampled). The output probability for the true class and sampled classes
new_targets: A tensor of shape (batch_size, 1+num_sampled). The new target classes.

gluonnlp.model.
get_model
(name, **kwargs)[source]¶ Returns a predefined model by name.
 Parameters
name (str) – Name of the model.
dataset_name (str or None, default None) – The dataset name on which the pretrained model is trained. For language model, options are ‘wikitext2’. For ELMo, Options are ‘gbw’ and ‘5bw’. ‘gbw’ represents 1 Billion Word Language Model Benchmark http://www.statmt.org/lmbenchmark/; ‘5bw’ represents a dataset of 5.5B tokens consisting of Wikipedia (1.9B) and all of the monolingual news crawl data from WMT 20082012 (3.6B). If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned.
vocab (gluonnlp.Vocab or None, default None) – Vocabulary object to be used with the language model. Required when dataset_name is not specified. None Vocabulary object is required with the ELMo model.
pretrained (bool, default False) – Whether to load the pretrained weights for model.
ctx (Context, default CPU) – The context in which to load the pretrained weights.
root (str, default '$MXNET_HOME/models' with MXNET_HOME defaults to '~/.mxnet') – Location for keeping the model parameters.
 Returns
 Return type
gluon.Block, gluonnlp.Vocab, (optional) gluonnlp.Vocab

class
gluonnlp.model.
BiLMEncoder
(mode, num_layers, input_size, hidden_size, dropout=0.0, skip_connection=True, proj_size=None, cell_clip=None, proj_clip=None, **kwargs)[source]¶ Bidirectional LM encoder.
We implement the encoder of the biLM proposed in the following work:
@inproceedings{Peters:2018, author={Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke}, title={Deep contextualized word representations}, booktitle={Proc. of NAACL}, year={2018} }
 Parameters
mode (str) – The type of RNN cell to use. Options are ‘lstmpc’, ‘rnn_tanh’, ‘rnn_relu’, ‘lstm’, ‘gru’.
num_layers (int) – The number of RNN cells in the encoder.
input_size (int) – The initial input size of in the RNN cell.
hidden_size (int) – The hidden size of the RNN cell.
dropout (float) – The dropout rate to use for encoder output.
skip_connection (bool) – Whether to add skip connections (add RNN cell input to output)
proj_size (int) – The projection size of each LSTMPCellWithClip cell
cell_clip (float) – Clip cell state between [cellclip, cell_clip] in LSTMPCellWithClip cell
proj_clip (float) – Clip projection between [projclip, projclip] in LSTMPCellWithClip cell

hybrid_forward
(F, inputs, states=None, mask=None)[source]¶ Defines the forward computation for cache cell. Arguments can be either
NDArray
orSymbol
. Parameters
inputs (NDArray) – The input data layout=’TNC’.
states (Tuple[List[List[NDArray]]]) – The states. including: states[0] indicates the states used in forward layer, Each layer has a list of two initial tensors with shape (batch_size, proj_size) and (batch_size, hidden_size). states[1] indicates the states used in backward layer, Each layer has a list of two initial tensors with shape (batch_size, proj_size) and (batch_size, hidden_size).
 Returns
out (NDArray) – The output data with shape (num_layers, seq_len, batch_size, 2*input_size).
[states_forward, states_backward] (List) – Including: states_forward: The out states from forward layer, which has the same structure with states[0]. states_backward: The out states from backward layer, which has the same structure with states[1].

class
gluonnlp.model.
LSTMPCellWithClip
(hidden_size, projection_size, i2h_weight_initializer=None, h2h_weight_initializer=None, h2r_weight_initializer=None, i2h_bias_initializer='zeros', h2h_bias_initializer='zeros', input_size=0, cell_clip=None, projection_clip=None, prefix=None, params=None)[source]¶ LongShort Term Memory Projected (LSTMP) network cell with cell clip and projection clip. Each call computes the following function:
\[\begin{split}\DeclareMathOperator{\sigmoid}{sigmoid} \begin{array}{ll} i_t = \sigmoid(W_{ii} x_t + b_{ii} + W_{ri} r_{(t1)} + b_{ri}) \\ f_t = \sigmoid(W_{if} x_t + b_{if} + W_{rf} r_{(t1)} + b_{rf}) \\ g_t = \tanh(W_{ig} x_t + b_{ig} + W_{rc} r_{(t1)} + b_{rg}) \\ o_t = \sigmoid(W_{io} x_t + b_{io} + W_{ro} r_{(t1)} + b_{ro}) \\ c_t = c_{\text{clip}}(f_t * c_{(t1)} + i_t * g_t) \\ h_t = o_t * \tanh(c_t) \\ r_t = p_{\text{clip}}(W_{hr} h_t) \end{array}\end{split}\]where \(c_{\text{clip}}\) is the cell clip applied on the next cell; \(r_t\) is the projected recurrent activation at time t, \(p_{\text{clip}}\) means apply projection clip on he projected output. math:h_t is the hidden state at time t, \(c_t\) is the cell state at time t, \(x_t\) is the input at time t, and \(i_t\), \(f_t\), \(g_t\), \(o_t\) are the input, forget, cell, and out gates, respectively.
 Parameters
hidden_size (int) – Number of units in cell state symbol.
projection_size (int) – Number of units in output symbol.
i2h_weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
h2h_weight_initializer (str or Initializer) – Initializer for the recurrent weights matrix, used for the linear transformation of the hidden state.
h2r_weight_initializer (str or Initializer) – Initializer for the projection weights matrix, used for the linear transformation of the recurrent state.
i2h_bias_initializer (str or Initializer, default 'lstmbias') – Initializer for the bias vector. By default, bias for the forget gate is initialized to 1 while all other biases are initialized to zero.
h2h_bias_initializer (str or Initializer) – Initializer for the bias vector.
prefix (str) – Prefix for name of Block`s (and name of weight if params is `None).
params (Parameter or None) – Container for weight sharing between cells. Created if None.
cell_clip (float) – Clip cell state between [cell_clip, cell_clip] in LSTMPCellWithClip cell
projection_clip (float) – Clip projection between [projection_clip, projection_clip] in LSTMPCellWithClip cell

hybrid_forward
(F, inputs, states, i2h_weight, h2h_weight, h2r_weight, i2h_bias, h2h_bias)[source]¶ Hybrid forward computation for LongShort Term Memory Projected network cell with cell clip and projection clip.
 Parameters
inputs (input tensor with shape (batch_size, input_size).) –
states (a list of two initial recurrent state tensors, with shape) – (batch_size, projection_size) and (batch_size, hidden_size) respectively.
 Returns
out (output tensor with shape (batch_size, num_hidden).)
next_states (a list of two output recurrent state tensors. Each has) – the same shape as states.

class
gluonnlp.model.
ELMoBiLM
(rnn_type, output_size, filters, char_embed_size, char_vocab_size, num_highway, conv_layer_activation, max_chars_per_token, input_size, hidden_size, proj_size, num_layers, cell_clip, proj_clip, skip_connection=True, **kwargs)[source]¶ ELMo Bidirectional language model
Run a pretrained bidirectional language model, outputting the weighted ELMo representation.
We implement the ELMo Bidirectional language model (BiLm) proposed in the following work:
@inproceedings{Peters:2018, author={Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke}, title={Deep contextualized word representations}, booktitle={Proc. of NAACL}, year={2018} }
 Parameters
rnn_type (str) – The type of RNN cell to use. The option for pretrained models is ‘lstmpc’.
output_size (int) – The output dimension after conducting the convolutions and max pooling, and applying highways, as well as linear projection.
filters (list of tuple) – List of tuples representing the settings for convolution layers. Each element is (ngram_filter_size, num_filters).
char_embed_size (int) – The input dimension to the encoder.
char_vocab_size (int) – Size of characterlevel vocabulary.
num_highway (int) – The number of layers of the Highway layer.
conv_layer_activation (str) – Activation function to be used after convolutional layer.
max_chars_per_token (int) – The maximum number of characters of a token.
input_size (int) – The initial input size of in the RNN cell.
hidden_size (int) – The hidden size of the RNN cell.
proj_size (int) – The projection size of each LSTMPCellWithClip cell
num_layers (int) – The number of RNN cells.
cell_clip (float) – Clip cell state between [cellclip, cell_clip] in LSTMPCellWithClip cell
proj_clip (float) – Clip projection between [projclip, projclip] in LSTMPCellWithClip cell
skip_connection (bool) – Whether to add skip connections (add RNN cell input to output)

hybrid_forward
(F, inputs, states=None, mask=None)[source]¶  Parameters
inputs (NDArray) – Shape (batch_size, sequence_length, max_character_per_token) of character ids representing the current batch.
states ((list of list of NDArray, list of list of NDArray)) – The states. First tuple element is the forward layer states, while the second is the states from backward layer. Each is a list of states for each layer. The state of each layer has a list of two initial tensors with shape (batch_size, proj_size) and (batch_size, hidden_size).
mask (NDArray) – Shape (batch_size, sequence_length) with sequence mask.
 Returns
output (list of NDArray) – A list of activations at each layer of the network, each of shape (batch_size, sequence_length, embedding_size)
states ((list of list of NDArray, list of list of NDArray)) – The states. First tuple element is the forward layer states, while the second is the states from backward layer. Each is a list of states for each layer. The state of each layer has a list of two initial tensors with shape (batch_size, proj_size) and (batch_size, hidden_size).

class
gluonnlp.model.
ELMoCharacterEncoder
(output_size, filters, char_embed_size, num_highway, conv_layer_activation, max_chars_per_token, char_vocab_size, **kwargs)[source]¶ ELMo character encoder
Compute contextfree characterbased token representation with characterlevel convolution.
This encoder has input character ids of shape (batch_size, sequence_length, max_character_per_word) and returns (batch_size, sequence_length, embedding_size).
 Parameters
output_size (int) – The output dimension after conducting the convolutions and max pooling, and applying highways, as well as linear projection.
filters (list of tuple) – List of tuples representing the settings for convolution layers. Each element is (ngram_filter_size, num_filters).
char_embed_size (int) – The input dimension to the encoder.
num_highway (int) – The number of layers of the Highway layer.
conv_layer_activation (str) – Activation function to be used after convolutional layer.
max_chars_per_token (int) – The maximum number of characters of a token.
char_vocab_size (int) – Size of characterlevel vocabulary.

hybrid_forward
(F, inputs)[source]¶ Compute context insensitive token embeddings for ELMo representations.
 Parameters
inputs (NDArray) – Shape (batch_size, sequence_length, max_character_per_token) of character ids representing the current batch.
 Returns
token_embedding – Shape (batch_size, sequence_length, embedding_size) with context insensitive token representations.
 Return type
NDArray

gluonnlp.model.
elmo_2x1024_128_2048cnn_1xhighway
(dataset_name=None, pretrained=False, ctx=cpu(0), root='/var/lib/jenkins/.mxnet/models', **kwargs)[source]¶ ELMo 2layer BiLSTM with 1024 hidden units, 128 projection size, 1 highway layer.
 Parameters
dataset_name (str or None, default None) – The dataset name on which the pretrained model is trained. Options are ‘gbw’.
pretrained (bool, default False) – Whether to load the pretrained weights for model.
ctx (Context, default CPU) – The context in which to load the pretrained weights.
root (str, default '$MXNET_HOME/models') – Location for keeping the model parameters. MXNET_HOME defaults to ‘~/.mxnet’.
 Returns
 Return type
gluon.Block

gluonnlp.model.
elmo_2x2048_256_2048cnn_1xhighway
(dataset_name=None, pretrained=False, ctx=cpu(0), root='/var/lib/jenkins/.mxnet/models', **kwargs)[source]¶ ELMo 2layer BiLSTM with 2048 hidden units, 256 projection size, 1 highway layer.
 Parameters
dataset_name (str or None, default None) – The dataset name on which the pretrained model is trained. Options are ‘gbw’.
pretrained (bool, default False) – Whether to load the pretrained weights for model.
ctx (Context, default CPU) – The context in which to load the pretrained weights.
root (str, default '$MXNET_HOME/models') – Location for keeping the model parameters. MXNET_HOME defaults to ‘~/.mxnet’.
 Returns
 Return type
gluon.Block

gluonnlp.model.
elmo_2x4096_512_2048cnn_2xhighway
(dataset_name=None, pretrained=False, ctx=cpu(0), root='/var/lib/jenkins/.mxnet/models', **kwargs)[source]¶ ELMo 2layer BiLSTM with 4096 hidden units, 512 projection size, 2 highway layer.
 Parameters
dataset_name (str or None, default None) – The dataset name on which the pretrained model is trained. Options are ‘gbw’ and ‘5bw’.
pretrained (bool, default False) – Whether to load the pretrained weights for model.
ctx (Context, default CPU) – The context in which to load the pretrained weights.
root (str, default '$MXNET_HOME/models') – Location for keeping the model parameters. MXNET_HOME defaults to ‘~/.mxnet’.
 Returns
 Return type
gluon.Block

class
gluonnlp.model.
Seq2SeqEncoder
(prefix=None, params=None)[source]¶ Base class of the encoders in sequence to sequence learning models.

class
gluonnlp.model.
TransformerEncoder
(attention_cell='multi_head', num_layers=2, units=512, hidden_size=2048, max_length=50, num_heads=4, scaled=True, dropout=0.0, use_residual=True, output_attention=False, weight_initializer=None, bias_initializer='zeros', prefix=None, params=None)[source]¶ Structure of the Transformer Encoder.
 Parameters
attention_cell (AttentionCell or str, default 'multi_head') – Arguments of the attention cell. Can be ‘multi_head’, ‘scaled_luong’, ‘scaled_dot’, ‘dot’, ‘cosine’, ‘normed_mlp’, ‘mlp’
num_layers (int) – Number of attention layers.
units (int) – Number of units for the output.
hidden_size (int) – number of units in the hidden layer of positionwise feedforward networks
max_length (int) – Maximum length of the input sequence
num_heads (int) – Number of heads in multihead attention
scaled (bool) – Whether to scale the softmax input by the sqrt of the input dimension in multihead attention
dropout (float) – Dropout probability of the attention probabilities.
use_residual (bool) –
output_attention (bool) – Whether to output the attention weights
weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
bias_initializer (str or Initializer) – Initializer for the bias vector.
prefix (str, default None.) – Prefix for name of Block`s. (and name of weight if params is `None).
params (Parameter or None) – Container for weight sharing between cells. Created if None.
Inputs –
inputs : input sequence of shape (batch_size, length, C_in)
states : list of tensors for initial states and masks.
 valid_lengthvalid lengths of each sequence. Usually used when part of sequence
has been padded. Shape is (batch_size, )
Outputs –
outputs : the output of the encoder. Shape is (batch_size, length, C_out)
 additional_outputslist of tensors.
Either be an empty list or contains the attention weights in this step. The attention weights will have shape (batch_size, length, mem_length) or (batch_size, num_heads, length, mem_length)

class
gluonnlp.model.
PositionwiseFFN
(units=512, hidden_size=2048, dropout=0.0, use_residual=True, ffn1_dropout=False, weight_initializer=None, bias_initializer='zeros', prefix=None, params=None, activation='relu', layer_norm_eps=None)[source]¶ Structure of the Positionwise FeedForward Neural Network for Transformer.
Computes the positionwise encoding of the inputs.
 Parameters
units (int) – Number of units for the output
hidden_size (int) – Number of units in the hidden layer of positionwise feedforward networks
dropout (float) – Dropout probability for the output
use_residual (bool) – Add residual connection between the input and the output
ffn1_dropout (bool, default False) – If True, apply dropout both after the first and second Positionwise FeedForward Neural Network layers. If False, only apply dropout after the second.
weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
bias_initializer (str or Initializer) – Initializer for the bias vector.
prefix (str, default None) – Prefix for name of Block`s (and name of weight if params is `None).
params (Parameter or None) – Container for weight sharing between cells. Created if None.
activation (str, default 'relu') – Activation methods in PositionwiseFFN
layer_norm_eps (float, default None) – Epsilon for layer_norm
Inputs –
inputs : input sequence of shape (batch_size, length, C_in).
Outputs –
outputs : output encoding of shape (batch_size, length, C_out).

class
gluonnlp.model.
TransformerEncoderCell
(attention_cell='multi_head', units=128, hidden_size=512, num_heads=4, scaled=True, dropout=0.0, use_residual=True, output_attention=False, weight_initializer=None, bias_initializer='zeros', prefix=None, params=None, activation='relu', layer_norm_eps=None)[source]¶ Structure of the Transformer Encoder Cell.
 Parameters
attention_cell (AttentionCell or str, default 'multi_head') – Arguments of the attention cell. Can be ‘multi_head’, ‘scaled_luong’, ‘scaled_dot’, ‘dot’, ‘cosine’, ‘normed_mlp’, ‘mlp’
units (int) – Number of units for the output
hidden_size (int) – number of units in the hidden layer of positionwise feedforward networks
num_heads (int) – Number of heads in multihead attention
scaled (bool) – Whether to scale the softmax input by the sqrt of the input dimension in multihead attention
dropout (float) –
use_residual (bool) –
output_attention (bool) – Whether to output the attention weights
weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
bias_initializer (str or Initializer) – Initializer for the bias vector.
prefix (str, default None) – Prefix for name of Block`s. (and name of weight if params is `None).
params (Parameter or None) – Container for weight sharing between cells. Created if None.
activation (str, default None) – Activation methods in PositionwiseFFN
layer_norm_eps (float, default None) – Epsilon for layer_norm
Inputs –
inputs : input sequence. Shape (batch_size, length, C_in)
mask : mask for inputs. Shape (batch_size, length, length)
Outputs –
 outputs: output tensor of the transformer encoder cell.
Shape (batch_size, length, C_out)
additional_outputs: the additional output of all the transformer encoder cell.

gluonnlp.model.
transformer_en_de_512
(dataset_name=None, src_vocab=None, tgt_vocab=None, pretrained=False, ctx=cpu(0), root='/var/lib/jenkins/.mxnet/models', **kwargs)[source]¶ Transformer pretrained model.
Embedding size is 400, and hidden layer size is 1150.
 Parameters
src_vocab (gluonnlp.Vocab or None, default None) –
tgt_vocab (gluonnlp.Vocab or None, default None) –
pretrained (bool, default False) – Whether to load the pretrained weights for model.
ctx (Context, default CPU) – The context in which to load the pretrained weights.
root (str, default '$MXNET_HOME/models') – Location for keeping the model parameters. MXNET_HOME defaults to ‘~/.mxnet’.
 Returns
 Return type
gluon.Block, gluonnlp.Vocab, gluonnlp.Vocab

class
gluonnlp.model.
BERTModel
(encoder, vocab_size=None, token_type_vocab_size=None, units=None, embed_size=None, embed_dropout=0.0, embed_initializer=None, word_embed=None, token_type_embed=None, use_pooler=True, use_decoder=True, use_classifier=True, use_token_type_embed=True, prefix=None, params=None)[source]¶ Generic Model for BERT (Bidirectional Encoder Representations from Transformers).
 Parameters
encoder (BERTEncoder) – Bidirectional encoder that encodes the input sentence.
vocab_size (int or None, default None) – The size of the vocabulary.
token_type_vocab_size (int or None, default None) – The vocabulary size of token types (number of segments).
units (int or None, default None) – Number of units for the final pooler layer.
embed_size (int or None, default None) – Size of the embedding vectors. It is used to generate the word and token type embeddings if word_embed and token_type_embed are None.
embed_dropout (float, default 0.0) – Dropout rate of the embedding weights. It is used to generate the source and target embeddings if word_embed and token_type_embed are None.
embed_initializer (Initializer, default None) – Initializer of the embedding weights. It is used to generate the source and target embeddings if word_embed and token_type_embed are None.
word_embed (Block or None, default None) – The word embedding. If set to None, word_embed will be constructed using embed_size and embed_dropout.
token_type_embed (Block or None, default None) – The token type embedding (segment embedding). If set to None and the token_type_embed will be constructed using embed_size and embed_dropout.
use_pooler (bool, default True) – Whether to include the pooler which converts the encoded sequence tensor of shape (batch_size, seq_length, units) to a tensor of shape (batch_size, units) for segment level classification task.
use_decoder (bool, default True) – Whether to include the decoder for masked language model prediction.
use_classifier (bool, default True) – Whether to include the classifier for next sentence classification.
use_token_type_embed (bool, default True) – Whether to include token type embedding (segment embedding).
params (ParameterDict or None) – See document of mx.gluon.Block.
Inputs –
inputs: input sequence tensor, shape (batch_size, seq_length)
 token_types: optional input token type tensor, shape (batch_size, seq_length).
If the inputs contain two sequences, then the token type of the first sequence differs from that of the second one.
valid_length: optional tensor of input sequence valid lengths, shape (batch_size,)
 masked_positions: optional tensor of position of tokens for masked LM decoding,
shape (batch_size, num_masked_positions).
Outputs –
 sequence_outputs: Encoded sequence, which can be either a tensor of the last
layer of the Encoder, or a list of all sequence encodings of all layers. In both cases shape of the tensor(s) is/are (batch_size, seq_length, units).
 attention_outputs: output list of all intermediate encodings per layer
Returned only if BERTEncoder.output_attention is True. List of num_layers length of tensors of shape (batch_size, num_attention_heads, seq_length, seq_length)
 pooled_output: output tensor of pooled representation of the first tokens.
Returned only if use_pooler is True. Shape (batch_size, units)
 next_sentence_classifier_output: output tensor of next sentence classification.
Returned only if use_classifier is True. Shape (batch_size, 2)
 masked_lm_outputs: output tensor of sequence decoding for masked language model
prediction. Returned only if use_decoder True. Shape (batch_size, num_masked_positions, vocab_size)

class
gluonnlp.model.
RoBERTaModel
(encoder, vocab_size=None, units=None, embed_size=None, embed_dropout=0.0, embed_initializer=None, word_embed=None, use_decoder=True, prefix=None, params=None)[source]¶ Generic Model for BERT (Bidirectional Encoder Representations from Transformers).
 Parameters
encoder (BERTEncoder) – Bidirectional encoder that encodes the input sentence.
vocab_size (int or None, default None) – The size of the vocabulary.
units (int or None, default None) – Number of units for the final pooler layer.
embed_size (int or None, default None) – Size of the embedding vectors. It is used to generate the word and token type embeddings if word_embed and token_type_embed are None.
embed_dropout (float, default 0.0) – Dropout rate of the embedding weights. It is used to generate the source and target embeddings if word_embed and token_type_embed are None.
embed_initializer (Initializer, default None) – Initializer of the embedding weights. It is used to generate the source and target embeddings if word_embed and token_type_embed are None.
word_embed (Block or None, default None) – The word embedding. If set to None, word_embed will be constructed using embed_size and embed_dropout.
use_decoder (bool, default True) – Whether to include the decoder for masked language model prediction.
params (ParameterDict or None) – See document of mx.gluon.Block.
Inputs –
inputs: input sequence tensor, shape (batch_size, seq_length)
valid_length: optional tensor of input sequence valid lengths, shape (batch_size,)
 masked_positions: optional tensor of position of tokens for masked LM decoding,
shape (batch_size, num_masked_positions).
Outputs –
 sequence_outputs: Encoded sequence, which can be either a tensor of the last
layer of the Encoder, or a list of all sequence encodings of all layers. In both cases shape of the tensor(s) is/are (batch_size, seq_length, units).
 attention_outputs: output list of all intermediate encodings per layer
Returned only if BERTEncoder.output_attention is True. List of num_layers length of tensors of shape (num_masks, num_attention_heads, seq_length, seq_length)
 masked_lm_outputs: output tensor of sequence decoding for masked language model
prediction. Returned only if use_decoder True. Shape (batch_size, num_masked_positions, vocab_size)

class
gluonnlp.model.
BERTEncoder
(attention_cell='multi_head', num_layers=2, units=512, hidden_size=2048, max_length=50, num_heads=4, scaled=True, dropout=0.0, use_residual=True, output_attention=False, output_all_encodings=False, weight_initializer=None, bias_initializer='zeros', prefix=None, params=None, activation='gelu', layer_norm_eps=None)[source]¶ Structure of the BERT Encoder.
Different from the original encoder for transformer, BERTEncoder uses learnable positional embedding, BERTPositionwiseFFN and BERTLayerNorm.
 Parameters
attention_cell (AttentionCell or str, default 'multi_head') – Arguments of the attention cell. Can be ‘multi_head’, ‘scaled_luong’, ‘scaled_dot’, ‘dot’, ‘cosine’, ‘normed_mlp’, ‘mlp’
num_layers (int) – Number of attention layers.
units (int) – Number of units for the output.
hidden_size (int) – number of units in the hidden layer of positionwise feedforward networks
max_length (int) – Maximum length of the input sequence
num_heads (int) – Number of heads in multihead attention
scaled (bool) – Whether to scale the softmax input by the sqrt of the input dimension in multihead attention
dropout (float) – Dropout probability of the attention probabilities.
use_residual (bool) –
output_attention (bool, default False) – Whether to output the attention weights
output_all_encodings (bool, default False) – Whether to output encodings of all encoder cells
weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
bias_initializer (str or Initializer) – Initializer for the bias vector.
prefix (str, default None.) – Prefix for name of Block`s. (and name of weight if params is `None).
params (Parameter or None) – Container for weight sharing between cells. Created if None.
activation (str, default 'gelu') – Activation methods in PositionwiseFFN
layer_norm_eps (float, default None) – Epsilon for layer_norm
Inputs –
inputs : input sequence of shape (batch_size, length, C_in)
states : list of tensors for initial states and masks.
 valid_lengthvalid lengths of each sequence. Usually used when part of sequence
has been padded. Shape is (batch_size, )
Outputs –
outputs : the output of the encoder. Shape is (batch_size, length, C_out)
 additional_outputslist of tensors.
Either be an empty list or contains the attention weights in this step. The attention weights will have shape (batch_size, num_heads, length, mem_length)

class
gluonnlp.model.
BERTEncoderCell
(attention_cell='multi_head', units=128, hidden_size=512, num_heads=4, scaled=True, dropout=0.0, use_residual=True, output_attention=False, weight_initializer=None, bias_initializer='zeros', prefix=None, params=None, activation='gelu', layer_norm_eps=None)[source]¶ Structure of the Transformer Encoder Cell for BERT.
Different from the original encoder cell for transformer, BERTEncoderCell adds bias terms for attention and the projection on attention output. It also uses BERTPositionwiseFFN and BERTLayerNorm.
 Parameters
attention_cell (AttentionCell or str, default 'multi_head') – Arguments of the attention cell. Can be ‘multi_head’, ‘scaled_luong’, ‘scaled_dot’, ‘dot’, ‘cosine’, ‘normed_mlp’, ‘mlp’
units (int) – Number of units for the output
hidden_size (int) – number of units in the hidden layer of positionwise feedforward networks
num_heads (int) – Number of heads in multihead attention
scaled (bool) – Whether to scale the softmax input by the sqrt of the input dimension in multihead attention
dropout (float) –
use_residual (bool) –
output_attention (bool) – Whether to output the attention weights
weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
bias_initializer (str or Initializer) – Initializer for the bias vector.
prefix (str, default None) – Prefix for name of Block`s. (and name of weight if params is `None).
params (Parameter or None) – Container for weight sharing between cells. Created if None.
activation (str, default 'gelu') – Activation methods in PositionwiseFFN
layer_norm_eps (float, default None) – Epsilon for layer_norm
Inputs –
inputs : input sequence. Shape (batch_size, length, C_in)
mask : mask for inputs. Shape (batch_size, length, length)
Outputs –
 outputs: output tensor of the transformer encoder cell.
Shape (batch_size, length, C_out)
additional_outputs: the additional output of all the transformer encoder cell.

class
gluonnlp.model.
BERTPositionwiseFFN
(units=512, hidden_size=2048, dropout=0.0, use_residual=True, weight_initializer=None, bias_initializer='zeros', prefix=None, params=None, activation='gelu', layer_norm_eps=None)[source]¶ Structure of the Positionwise FeedForward Neural Network for BERT.
Different from the original positionwise feed forward network for transformer, BERTPositionwiseFFN uses GELU for activation and BERTLayerNorm for layer normalization.
 Parameters
units (int) – Number of units for the output
hidden_size (int) – Number of units in the hidden layer of positionwise feedforward networks
dropout (float) – Dropout probability for the output
use_residual (bool) – Add residual connection between the input and the output
weight_initializer (str or Initializer) – Initializer for the input weights matrix, used for the linear transformation of the inputs.
bias_initializer (str or Initializer) – Initializer for the bias vector.
prefix (str, default None) – Prefix for name of Block`s (and name of weight if params is `None).
params (Parameter or None) – Container for weight sharing between cells. Created if None.
activation (str, default 'gelu') – Activation methods in PositionwiseFFN
layer_norm_eps (float, default None) – Epsilon for layer_norm
Inputs –
inputs : input sequence of shape (batch_size, length, C_in).
Outputs –
outputs : output encoding of shape (batch_size, length, C_out).

class
gluonnlp.model.
BERTLayerNorm
(epsilon=1e12, in_channels=0, prefix=None, params=None)[source]¶ BERT style Layer Normalization.
Epsilon is added inside the square root and set to 1e12 by default.
 Inputs:
data: input tensor with arbitrary shape.
out: output tensor with the same shape as data.

class
gluonnlp.model.
BERTClassifier
(bert, num_classes=2, dropout=0.0, prefix=None, params=None)[source]¶ Model for sentence (pair) classification task with BERT.
The model feeds token ids and token type ids into BERT to get the pooled BERT sequence representation, then apply a Dense layer for classification.
 Parameters

hybrid_forward
(F, inputs, token_types, valid_length=None)[source]¶ Generate the unnormalized score for the given the input sequences.
 Parameters
inputs (NDArray or Symbol, shape (batch_size, seq_length)) – Input words for the sequences.
token_types (NDArray or Symbol, shape (batch_size, seq_length)) – Token types for the sequences, used to indicate whether the word belongs to the first sentence or the second one.
valid_length (NDArray or None, shape (batch_size)) – Valid length of the sequence. This is used to mask the padded tokens.
 Returns
outputs – Shape (batch_size, num_classes)
 Return type
NDArray

class
gluonnlp.model.
RoBERTaClassifier
(roberta, num_classes=2, dropout=0.0, prefix=None, params=None)[source]¶ Model for sentence (pair) classification task with BERT.
The model feeds token ids and token type ids into BERT to get the pooled BERT sequence representation, then apply a Dense layer for classification.
 Parameters
bert (RoBERTaModel) – The RoBERTa model.
num_classes (int, default is 2) – The number of target classes.
dropout (float or None, default 0.0.) – Dropout probability for the bert output.
params (ParameterDict or None) – See document of mx.gluon.Block.
Inputs –
inputs: input sequence tensor, shape (batch_size, seq_length)
 valid_length: optional tensor of input sequence valid lengths.
Shape (batch_size, num_classes).
Outputs –
output: Regression output, shape (batch_size, num_classes)

hybrid_forward
(F, inputs, valid_length=None)[source]¶ Generate the unnormalized score for the given the input sequences.
 Parameters
inputs (NDArray or Symbol, shape (batch_size, seq_length)) – Input words for the sequences.
valid_length (NDArray or Symbol, or None, shape (batch_size)) – Valid length of the sequence. This is used to mask the padded tokens.
 Returns
outputs – Shape (batch_size, num_classes)
 Return type
NDArray or Symbol

gluonnlp.model.
bert_12_768_12
(dataset_name=None, vocab=None, pretrained=True, ctx=cpu(0), root='/var/lib/jenkins/.mxnet/models', use_pooler=True, use_decoder=True, use_classifier=True, pretrained_allow_missing=False, **kwargs)[source]¶ Generic BERT BASE model.
The number of layers (L) is 12, number of units (H) is 768, and the number of selfattention heads (A) is 12.
 Parameters
dataset_name (str or None, default None) – If not None, the dataset name is used to load a vocabulary for the dataset. If the pretrained argument is set to True, the dataset name is further used to select the pretrained parameters to load. The supported datasets are ‘book_corpus_wiki_en_cased’, ‘book_corpus_wiki_en_uncased’, ‘wiki_cn_cased’, ‘openwebtext_book_corpus_wiki_en_uncased’, ‘wiki_multilingual_uncased’, ‘wiki_multilingual_cased’, ‘scibert_scivocab_uncased’, ‘scibert_scivocab_cased’, ‘scibert_basevocab_uncased’, ‘scibert_basevocab_cased’, ‘biobert_v1.0_pmc’, ‘biobert_v1.0_pubmed’, ‘biobert_v1.0_pubmed_pmc’, ‘biobert_v1.1_pubmed’, ‘clinicalbert’
vocab (gluonnlp.vocab.BERTVocab or None, default None) – Vocabulary for the dataset. Must be provided if dataset_name is not specified. Ignored if dataset_name is specified.
pretrained (bool, default True) – Whether to load the pretrained weights for model.
ctx (Context, default CPU) – The context in which to load the pretrained weights.
root (str, default '$MXNET_HOME/models') – Location for keeping the model parameters. MXNET_HOME defaults to ‘~/.mxnet’.
use_pooler (bool, default True) – Whether to include the pooler which converts the encoded sequence tensor of shape (batch_size, seq_length, units) to a tensor of shape (batch_size, units) for for segment level classification task.
use_decoder (bool, default True) – Whether to include the decoder for masked language model prediction. Note that ‘biobert_v1.0_pmc’, ‘biobert_v1.0_pubmed’, ‘biobert_v1.0_pubmed_pmc’, ‘biobert_v1.1_pubmed’, ‘clinicalbert’ do not include these parameters.
use_classifier (bool, default True) – Whether to include the classifier for next sentence classification. Note that ‘biobert_v1.0_pmc’, ‘biobert_v1.0_pubmed’, ‘biobert_v1.0_pubmed_pmc’, ‘biobert_v1.1_pubmed’ do not include these parameters.
pretrained_allow_missing (bool, default False) – Whether to ignore if any parameters for the BERTModel are missing in the pretrained weights for model. Some BERTModels for example do not provide decoder or classifier weights. In that case it is still possible to construct a BERTModel with use_decoder=True and/or use_classifier=True, but the respective parameters will be missing from the pretrained file. If pretrained_allow_missing=True, this will be ignored and the parameters will be left uninitialized. Otherwise AssertionError is raised.
pretrained parameters for dataset_name (The) –
were obtained by running the ('openwebtext_book_corpus_wiki_en_uncased') –
BERT pretraining script on OpenWebText. (GluonNLP) –
pretrained parameters for dataset_name 'scibert_scivocab_uncased', (The) –
'scibert_basevocab_uncased', ('scibert_scivocab_cased',) –
were obtained by converting the parameters ('scibert_basevocab_cased') –
by "Beltagy, I., Cohan, A., & Lo, K. (2019) Scibert (published) –
embeddings for scientific text. arXiv preprint (contextualized) –
arXiv (1903.10676.") –
pretrained parameters for dataset_name 'biobert_v1.0_pmc', (The) –
'biobert_v1.0_pubmed_pmc', 'biobert_v1.1_pubmed' ('biobert_v1.0_pubmed',) –
obtained by converting the parameters published by "Lee, J., Yoon, W., (were) –
S., Kim, D., Kim, S., So, C. H., & Kang, J. (2019) Biobert (Kim,) –
biomedical language representation model for biomedical text (pretrained) –
arXiv preprint arXiv (mining.) –
pretrained parameters for dataset_name 'clinicalbert' were obtained by (The) –
the parameters published by "Huang, K., Altosaar, J., & (converting) –
R. (2019) ClinicalBERT (Ranganath,) –
Readmission. arXiv preprint arXiv (Hospital) –
 Returns
 Return type

gluonnlp.model.
bert_24_1024_16
(dataset_name=None, vocab=None, pretrained=True, ctx=cpu(0), use_pooler=True, use_decoder=True, use_classifier=True, root='/var/lib/jenkins/.mxnet/models', pretrained_allow_missing=False, **kwargs)[source]¶ Generic BERT LARGE model.
The number of layers (L) is 24, number of units (H) is 1024, and the number of selfattention heads (A) is 16.
 Parameters
dataset_name (str or None, default None) – If not None, the dataset name is used to load a vocabulary for the dataset. If the pretrained argument is set to True, the dataset name is further used to select the pretrained parameters to load. Options include ‘book_corpus_wiki_en_uncased’ and ‘book_corpus_wiki_en_cased’.
vocab (gluonnlp.vocab.BERTVocab or None, default None) – Vocabulary for the dataset. Must be provided if dataset_name is not specified. Ignored if dataset_name is specified.
pretrained (bool, default True) – Whether to load the pretrained weights for model.
ctx (Context, default CPU) – The context in which to load the pretrained weights.
root (str, default '$MXNET_HOME/models') – Location for keeping the model parameters. MXNET_HOME defaults to ‘~/.mxnet’.
use_pooler (bool, default True) – Whether to include the pooler which converts the encoded sequence tensor of shape (batch_size, seq_length, units) to a tensor of shape (batch_size, units) for for segment level classification task.
use_decoder (bool, default True) – Whether to include the decoder for masked language model prediction.
use_classifier (bool, default True) – Whether to include the classifier for next sentence classification.
pretrained_allow_missing (bool, default False) – Whether to ignore if any parameters for the BERTModel are missing in the pretrained weights for model. Some BERTModels for example do not provide decoder or classifier weights. In that case it is still possible to construct a BERTModel with use_decoder=True and/or use_classifier=True, but the respective parameters will be missing from the pretrained file. If pretrained_allow_missing=True, this will be ignored and the parameters will be left uninitialized. Otherwise AssertionError is raised.
 Returns
 Return type

gluonnlp.model.
ernie_12_768_12
(dataset_name=None, vocab=None, pretrained=True, ctx=cpu(0), root='/var/lib/jenkins/.mxnet/models', use_pooler=True, use_decoder=True, use_classifier=True, **kwargs)[source]¶ Baidu ERNIE model.
Reference: https://arxiv.org/pdf/1904.09223.pdf
The number of layers (L) is 12, number of units (H) is 768, and the number of selfattention heads (A) is 12.
 Parameters
dataset_name (str or None, default None) – If not None, the dataset name is used to load a vocabulary for the dataset. If the pretrained argument is set to True, the dataset name is further used to select the pretrained parameters to load. The supported datasets are ‘baidu_ernie’
vocab (gluonnlp.vocab.BERTVocab or None, default None) – Vocabulary for the dataset. Must be provided if dataset_name is not specified. Ignored if dataset_name is specified.
pretrained (bool, default True) – Whether to load the pretrained weights for model.
ctx (Context, default CPU) – The context in which to load the pretrained weights.
root (str, default '$MXNET_HOME/models') – Location for keeping the model parameters. MXNET_HOME defaults to ‘~/.mxnet’.
use_pooler (bool, default True) – Whether to include the pooler which converts the encoded sequence tensor of shape (batch_size, seq_length, units) to a tensor of shape (batch_size, units) for for segment level classification task.
use_decoder (bool, default True) – Whether to include the decoder for masked language model prediction.
use_classifier (bool, default True) – Whether to include the classifier for next sentence classification.
 Returns
 Return type

gluonnlp.model.
roberta_12_768_12
(dataset_name=None, vocab=None, pretrained=True, ctx=cpu(0), use_decoder=True, root='/var/lib/jenkins/.mxnet/models', **kwargs)[source]¶ Generic RoBERTa BASE model.
The number of layers (L) is 12, number of units (H) is 768, and the number of selfattention heads (A) is 12.
 Parameters
dataset_name (str or None, default None) – If not None, the dataset name is used to load a vocabulary for the dataset. If the pretrained argument is set to True, the dataset name is further used to select the pretrained parameters to load. Options include ‘book_corpus_wiki_en_uncased’ and ‘book_corpus_wiki_en_cased’.
vocab (gluonnlp.vocab.Vocab or None, default None) – Vocabulary for the dataset. Must be provided if dataset_name is not specified. Ignored if dataset_name is specified.
pretrained (bool, default True) – Whether to load the pretrained weights for model.
ctx (Context, default CPU) – The context in which to load the pretrained weights.
root (str, default '$MXNET_HOME/models') – Location for keeping the model parameters. MXNET_HOME defaults to ‘~/.mxnet’.
use_decoder (bool, default True) – Whether to include the decoder for masked language model prediction.
 Returns
 Return type
RoBERTaModel, gluonnlp.vocab.Vocab

gluonnlp.model.
roberta_24_1024_16
(dataset_name=None, vocab=None, pretrained=True, ctx=cpu(0), use_decoder=True, root='/var/lib/jenkins/.mxnet/models', **kwargs)[source]¶ Generic RoBERTa LARGE model.
The number of layers (L) is 24, number of units (H) is 1024, and the number of selfattention heads (A) is 16.
 Parameters
dataset_name (str or None, default None) – If not None, the dataset name is used to load a vocabulary for the dataset. If the pretrained argument is set to True, the dataset name is further used to select the pretrained parameters to load. Options include ‘book_corpus_wiki_en_uncased’ and ‘book_corpus_wiki_en_cased’.
vocab (gluonnlp.vocab.Vocab or None, default None) – Vocabulary for the dataset. Must be provided if dataset_name is not specified. Ignored if dataset_name is specified.
pretrained (bool, default True) – Whether to load the pretrained weights for model.
ctx (Context, default CPU) – The context in which to load the pretrained weights.
root (str, default '$MXNET_HOME/models') – Location for keeping the model parameters. MXNET_HOME defaults to ‘~/.mxnet’.
use_decoder (bool, default True) – Whether to include the decoder for masked language model prediction.
 Returns
 Return type
RoBERTaModel, gluonnlp.vocab.Vocab