Language Model

Word Language Model

Reference: Merity, S., et al. “Regularizing and optimizing LSTM language models”. ICLR 2018

[Download]

The key features used to reproduce the results for pre-trained models are listed in the following tables.

The dataset used for training the models is wikitext-2.

Model awd_lstm_lm_1150_wikitext-2 awd_lstm_lm_600_wikitext-2 standard_lstm_lm_1500_wikitext-2 standard_lstm_lm_650_wikitext-2 standard_lstm_lm_200_wikitext-2
Mode LSTM LSTM LSTM LSTM LSTM
Num_layers 3 3 2 2 2
Embed size 400 200 1500 650 200
Hidden size 1150 600 1500 650 200
Dropout 0.4 0.2 0.65 0.5 0.2
Dropout_h 0.2 0.1 0 0 0
Dropout_i 0.65 0.3 0 0 0
Dropout_e 0.1 0.05 0 0 0
Weight_drop 0.5 0.2 0 0 0
Val PPL 68.71 84.89 86.51 90.96 107.59
Test PPL 65.62 80.67 82.29 86.91 101.64
Command [1] [2] [3] [4] [5]
Training logs log log log log log

For all the above model settings, we set Tied = True and NTASGD = True .

[1] awd_lstm_lm_1150_wikitext-2 (Val PPL 68.71 Test PPL 65.62 )

$ python word_language_model.py --gpu 0 --tied --ntasgd --lr_update_interval 50 --lr_update_factor 0.5 --save awd_lstm_lm_1150_wikitext-2

[2] awd_lstm_lm_600_wikitext-2 (Val PPL 84.89 Test PPL 80.67)

$ python word_language_model.py --gpu 0 --emsize 200 --nhid 600 --epochs 750 --dropout 0.2 --dropout_h 0.1 --dropout_i 0.3 --dropout_e 0.05 --weight_drop 0.2 --tied --ntasgd --lr_update_interval 50 --lr_update_factor 0.5 --save awd_lstm_lm_600_wikitext-2

[3] standard_lstm_lm_1500_wikitext-2 (Val PPL 86.51 Test PPL 82.29)

$ python word_language_model.py --gpu 0 --emsize 1500 --nhid 1500 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.65 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --ntasgd --lr_update_interval 50 --lr_update_factor 0.5 --save standard_lstm_lm_1500_wikitext-2

[4] standard_lstm_lm_650_wikitext-2 (Val PPL 90.96 Test PPL 86.91)

$ python word_language_model.py --gpu 0 --emsize 650 --nhid 650 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.5 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --ntasgd --lr_update_interval 50 --lr_update_factor 0.5 --save standard_lstm_lm_650_wikitext-2

[5] standard_lstm_lm_200_wikitext-2 (Val PPL 107.59 Test PPL 101.64)

$ python word_language_model.py --gpu 0 --emsize 200 --nhid 200 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.2 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --ntasgd --lr_update_interval 50 --lr_update_factor 0.5 --save standard_lstm_lm_200_wikitext-2

Cache Language Model

Reference: Grave, E., et al. “Improving neural language models with a continuous cache”. ICLR 2017

The key features used to reproduce the results based on the corresponding pre-trained models are listed in the following tables.

The dataset used for training the models is wikitext-2.

Model cache_awd_lstm_lm_1150_wikitext-2 cache_awd_lstm_lm_600_wikitext-2 cache_standard_lstm_lm_1500_wikitext-2 cache_standard_lstm_lm_650_wikitext-2 cache_standard_lstm_lm_200_wikitext-2
Pre-trained setting Refer to: awd_lstm_lm_1150_wikitext-2 Refer to: awd_lstm_lm_600_wikitext-2 Refer to: standard_lstm_lm_1500_wikitext-2 Refer to: standard_lstm_lm_650_wikitext-2 Refer to: standard_lstm_lm_200_wikitext-2
Val PPL 53.41 64.51 65.54 68.47 77.51
Test PPL 51.46 62.19 62.79 65.85 73.74
Command [1] [2] [3] [4] [5]
Training logs log log log log log

For all the above model settings, we set lambdas = 0.1279, theta = 0.662, window = 2000 and bptt= 2000 .

[1] cache_awd_lstm_lm_1150_wikitext-2 (Val PPL 53.41 Test PPL 51.46)

$ python cache_language_model.py --gpus 0 --model_name awd_lstm_lm_1150

[2] cache_awd_lstm_lm_600_wikitext-2 (Val PPL 64.51 Test PPL 62.19)

$ python cache_language_model.py --gpus 0 --model_name awd_lstm_lm_600

[3] cache_standard_lstm_lm_1500_wikitext-2 (Val PPL 65.54 Test PPL 62.79)

$ python cache_language_model.py --gpus 0 --model_name standard_lstm_lm_1500

[4] cache_standard_lstm_lm_650_wikitext-2 (Val PPL 68.47 Test PPL 65.85)

$ python cache_language_model.py --gpus 0 --model_name standard_lstm_lm_650

[5] cache_standard_lstm_lm_200_wikitext-2 (Val PPL 77.51 Test PPL 73.74)

$ python cache_language_model.py --gpus 0 --model_name standard_lstm_lm_200

Large Scale Word Language Model

Reference: Jozefowicz, Rafal, et al. “Exploring the limits of language modeling”. arXiv preprint arXiv:1602.02410 (2016).

The key features used to reproduce the results for pre-trained models are listed in the following tables.

The dataset used for training the models is Google’s 1 billion words dataset.

Model LSTM-2048-512
Mode LSTMP
Num layers 1
Embed size 512
Hidden size 2048
Projection size 512
Dropout 0.1
Learning rate 0.2
Num samples 8192
Batch size 128
Graident clip 10.0
Test perplexity 43.62
Num epochs 50
Training logs log
Evaluation logs log

[1] LSTM-2048-512 (Test PPL 43.62)

$ python large_word_language_model.py --gpus 0,1,2,3 --clip=10
$ python large_word_language_model.py --gpus 4 --eval-only --batch-size=1