Language Model¶

Download scripts

Word Language Model¶

Reference: Merity, S., et al. “Regularizing and optimizing LSTM language models”. ICLR 2018

The key features used to reproduce the results for pre-trained models are listed in the following tables.

The dataset used for training the models is wikitext-2.

Model	awd_lstm_lm_1150_wikitext-2	awd_lstm_lm_600_wikitext-2	standard_lstm_lm_1500_wikitext-2	standard_lstm_lm_650_wikitext-2	standard_lstm_lm_200_wikitext-2
Mode	LSTM	LSTM	LSTM	LSTM	LSTM
Num_layers	3	3	2	2	2
Embed size	400	200	1500	650	200
Hidden size	1150	600	1500	650	200
Dropout	0.4	0.2	0.65	0.5	0.2
Dropout_h	0.2	0.1	0	0	0
Dropout_i	0.65	0.3	0	0	0
Dropout_e	0.1	0.05	0	0	0
Weight_drop	0.5	0.2	0	0	0
Val PPL	71.78	80.11	86.28	91.30	108.17
Test PPL	68.55	76.14	81.99	85.82	102.49
Command	[1]	[2]	[3]	[4]	[5]
Training logs	log	log	log	log	log

For all the above model settings, we set Tied = True and NTASGD = True .

[1] awd_lstm_lm_1150_wikitext-2 (Val PPL 71.78 Test PPL 68.55 )

$ python word_language_model.py --gpu 0 --tied --ntasgd --lr_update_interval 30 --lr_update_factor 0.1 --save awd_lstm_lm_1150_wikitext-2

[2] awd_lstm_lm_600_wikitext-2 (Val PPL 80.11 Test PPL 76.14)

$ python word_language_model.py --gpu 0 --emsize 200 --nhid 600 --epochs 750 --dropout 0.2 --dropout_h 0.1 --dropout_i 0.3 --dropout_e 0.05 --weight_drop 0.2 --tied --ntasgd --lr_update_interval 30 --lr_update_factor 0.1 --save awd_lstm_lm_600_wikitext-2

[3] standard_lstm_lm_1500_wikitext-2 (Val PPL 86.28 Test PPL 81.99)

$ python word_language_model.py --gpu 0 --emsize 1500 --nhid 1500 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.65 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --ntasgd --lr_update_interval 30 --lr_update_factor 0.1 --save standard_lstm_lm_1500_wikitext-2

[4] standard_lstm_lm_650_wikitext-2 (Val PPL 91.30 Test PPL 85.82)

$ python word_language_model.py --gpu 0 --emsize 650 --nhid 650 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.5 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --ntasgd --lr_update_interval 30 --lr_update_factor 0.1 --save standard_lstm_lm_650_wikitext-2

[5] standard_lstm_lm_200_wikitext-2 (Val PPL 108.17 Test PPL 102.49)

$ python word_language_model.py --gpu 0 --emsize 200 --nhid 200 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.2 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --ntasgd --lr_update_interval 30 --lr_update_factor 0.1 --save standard_lstm_lm_200_wikitext-2

Cache Language Model¶

Reference: Grave, E., et al. “Improving neural language models with a continuous cache”. ICLR 2017

The key features used to reproduce the results based on the corresponding pre-trained models are listed in the following tables.

The dataset used for training the models is wikitext-2.

Model	cache_awd_lstm_lm_1150_wikitext-2	cache_awd_lstm_lm_600_wikitext-2	cache_standard_lstm_lm_1500_wikitext-2	cache_standard_lstm_lm_650_wikitext-2	cache_standard_lstm_lm_200_wikitext-2
Pre-trained setting	Refer to: awd_lstm_lm_1150_wikitext-2	Refer to: awd_lstm_lm_600_wikitext-2	Refer to: standard_lstm_lm_1500_wikitext-2	Refer to: standard_lstm_lm_650_wikitext-2	Refer to: standard_lstm_lm_200_wikitext-2
Val PPL	58.18	64.09	73.19	69.27	81.68
Test PPL	56.08	61.62	70.91	66.39	77.83
Command	[1]	[2]	[3]	[4]	[5]
Training logs	log	log	log	log	log

For all the above model settings, we set lambdas = 0.1279, theta = 0.662, window = 2000 and bptt= 2000 .

[1] cache_awd_lstm_lm_1150_wikitext-2 (Val PPL 58.18 Test PPL 56.08)

$ python cache_language_model.py --gpus 0 --model_name awd_lstm_lm_1150

[2] cache_awd_lstm_lm_600_wikitext-2 (Val PPL 64.09 Test PPL 61.62)

$ python cache_language_model.py --gpus 0 --model_name awd_lstm_lm_600

[3] cache_standard_lstm_lm_1500_wikitext-2 (Val PPL 73.19 Test PPL 70.91)

$ python cache_language_model.py --gpus 0 --model_name standard_lstm_lm_1500

[4] cache_standard_lstm_lm_650_wikitext-2 (Val PPL 69.27 Test PPL 66.39)

$ python cache_language_model.py --gpus 0 --model_name standard_lstm_lm_650

[5] cache_standard_lstm_lm_200_wikitext-2 (Val PPL 81.68 Test PPL 77.83)

$ python cache_language_model.py --gpus 0 --model_name standard_lstm_lm_200

Large Scale Word Language Model¶

Reference: Jozefowicz, Rafal, et al. “Exploring the limits of language modeling”. arXiv preprint arXiv:1602.02410 (2016).

The key features used to reproduce the results for pre-trained models are listed in the following tables.

The dataset used for training the models is Google’s 1 billion words dataset.

Model	LSTM-2048-512
Mode	LSTMP
Num layers	1
Embed size	512
Hidden size	2048
Projection size	512
Dropout	0.1
Learning rate	0.2
Num samples	8192
Batch size	128
Gradient clip	10.0
Test perplexity	43.62
Num epochs	50
Training logs	log
Evaluation logs	log

[1] LSTM-2048-512 (Test PPL 43.62)

$ python large_word_language_model.py --gpus 0,1,2,3 --clip=10
$ python large_word_language_model.py --gpus 4 --eval-only --batch-size=1

XLNet: Generalized Autoregressive Pretraining for Language Understanding¶

Reference: Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. “XLNet: Generalized Autoregressive Pretraining for Language Understanding.” arXiv preprint arXiv:1906.08237 (2019).

The following pre-trained XLNet models are available from the get_model API:

	xlnet_cased_l12_h768_a12	xlnet_cased_l24_h1024_a16
126gb	✓	✓

where 126gb refers to the 126 GB large training dataset used by the XLNet paper authors.

import gluonnlp as nlp; import mxnet as mx
from transformer import get_model, XLNetTokenizer
model, vocab, tokenizer = get_model('xlnet_cased_l12_h768_a12', dataset_name='126gb', use_decoder=True)
indices = mx.nd.array([vocab.to_indices(tokenizer('Hello world'))])
token_types = mx.nd.ones_like(indices)
mems = model.begin_mems(batch_size=1, mem_len=500, context=indices.context)
output, new_mems = model(indices, token_types, mems)

Sentence Classification¶

GluonNLP provides the following example script to fine-tune sentence classification with pre-trained XLNet model.

Results using xlnet_12_768_12:

Task Name	Metrics	Results on Dev Set	log	command
CoLA	Matthew Corr.	59.33	log	command
SST-2	Accuracy	94.61	log	command
MRPC	Accuracy/F1	89.22/92.20	log	command
STS-B	Pearson Corr.	89.34	log	command
QQP	Accuracy	91.31	log	command
MNLI	Accuracy(m/mm)	87.19/86.45	log	command
QNLI	Accuracy	88	log	command
RTE	Accuracy	75.09	log	command

Results using xlnet_24_1024_16: We followed the hyperparameters reported by the paper authors.

Task Name	Metrics	Results on Dev Set	log	command
CoLA	Matthew Corr.	67	log	command
SST-2	Accuracy	94	log	command
MRPC	Accuracy/F1	90.2/93	log	command
STS-B	Pearson Corr.	91.37	log	command
QQP	Accuracy	91.94	log	command
MNLI	Accuracy(m/mm)	89.93/89.91	log	command
RTE	Accuracy	84.12	log	command

Question Answering on SQuAD¶

Dataset	SQuAD 1.1	SQuAD 1.1	SQuAD 2.0	SQuAD 2.0
Model	xlnet_12_768_12	xlnet_24_1024_16	xlnet_12_768_12	xlnet_24_1024_16
EM / F1	85.50 / 91.77	89.08 / 94.52	80.47 / 83.22	86.08 / 86.69
Log	log	log	log	log
Command	command	command	command	command
Prediction	predictions.json	predictions.json	predictions.json null_odds.json	predictions.json null_odds.json

For xlnet_24_1024_16, we used hyperparameters reported by the paper authors.

To get the score of the dev data, you need to download the evaluate script (evaluate-2.0.py). You can either put the evaluate script under the same folder with run_squad.py to let our script run it automatically, or run it manually by yourself. To run the evaluate script, you can use the following commands:

SQuAD1.1:

$ python evaluate-v2.0.py dev-v2.0.json predictions.json

SQuAD2.0:

$ python evaluate-v2.0.py dev-v2.0.json predictions.json --na-prob-file null_odds.json