Language Model

Download scripts

Word Language Model

Reference: Merity, S., et al. “Regularizing and optimizing LSTM language models”. ICLR 2018

The key features used to reproduce the results for pre-trained models are listed in the following tables.

The dataset used for training the models is wikitext-2.

Model

awd_lstm_lm_1150_wikitext-2

awd_lstm_lm_600_wikitext-2

standard_lstm_lm_1500_wikitext-2

standard_lstm_lm_650_wikitext-2

standard_lstm_lm_200_wikitext-2

Mode

LSTM

LSTM

LSTM

LSTM

LSTM

Num_layers

3

3

2

2

2

Embed size

400

200

1500

650

200

Hidden size

1150

600

1500

650

200

Dropout

0.4

0.2

0.65

0.5

0.2

Dropout_h

0.2

0.1

0

0

0

Dropout_i

0.65

0.3

0

0

0

Dropout_e

0.1

0.05

0

0

0

Weight_drop

0.5

0.2

0

0

0

Val PPL

68.71

84.89

86.51

90.96

107.59

Test PPL

65.62

80.67

82.29

86.91

101.64

Command

[1]

[2]

[3]

[4]

[5]

Training logs

log

log

log

log

log

For all the above model settings, we set Tied = True and NTASGD = True .

[1] awd_lstm_lm_1150_wikitext-2 (Val PPL 68.71 Test PPL 65.62 )

$ python word_language_model.py --gpu 0 --tied --ntasgd --lr_update_interval 50 --lr_update_factor 0.5 --save awd_lstm_lm_1150_wikitext-2

[2] awd_lstm_lm_600_wikitext-2 (Val PPL 84.89 Test PPL 80.67)

$ python word_language_model.py --gpu 0 --emsize 200 --nhid 600 --epochs 750 --dropout 0.2 --dropout_h 0.1 --dropout_i 0.3 --dropout_e 0.05 --weight_drop 0.2 --tied --ntasgd --lr_update_interval 50 --lr_update_factor 0.5 --save awd_lstm_lm_600_wikitext-2

[3] standard_lstm_lm_1500_wikitext-2 (Val PPL 86.51 Test PPL 82.29)

$ python word_language_model.py --gpu 0 --emsize 1500 --nhid 1500 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.65 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --ntasgd --lr_update_interval 50 --lr_update_factor 0.5 --save standard_lstm_lm_1500_wikitext-2

[4] standard_lstm_lm_650_wikitext-2 (Val PPL 90.96 Test PPL 86.91)

$ python word_language_model.py --gpu 0 --emsize 650 --nhid 650 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.5 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --ntasgd --lr_update_interval 50 --lr_update_factor 0.5 --save standard_lstm_lm_650_wikitext-2

[5] standard_lstm_lm_200_wikitext-2 (Val PPL 107.59 Test PPL 101.64)

$ python word_language_model.py --gpu 0 --emsize 200 --nhid 200 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.2 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --ntasgd --lr_update_interval 50 --lr_update_factor 0.5 --save standard_lstm_lm_200_wikitext-2

Cache Language Model

Reference: Grave, E., et al. “Improving neural language models with a continuous cache”. ICLR 2017

The key features used to reproduce the results based on the corresponding pre-trained models are listed in the following tables.

The dataset used for training the models is wikitext-2.

Model

cache_awd_lstm_lm_1150_wikitext-2

cache_awd_lstm_lm_600_wikitext-2

cache_standard_lstm_lm_1500_wikitext-2

cache_standard_lstm_lm_650_wikitext-2

cache_standard_lstm_lm_200_wikitext-2

Pre-trained setting

Refer to: awd_lstm_lm_1150_wikitext-2

Refer to: awd_lstm_lm_600_wikitext-2

Refer to: standard_lstm_lm_1500_wikitext-2

Refer to: standard_lstm_lm_650_wikitext-2

Refer to: standard_lstm_lm_200_wikitext-2

Val PPL

53.41

64.51

65.54

68.47

77.51

Test PPL

51.46

62.19

62.79

65.85

73.74

Command

[1]

[2]

[3]

[4]

[5]

Training logs

log

log

log

log

log

For all the above model settings, we set lambdas = 0.1279, theta = 0.662, window = 2000 and bptt= 2000 .

[1] cache_awd_lstm_lm_1150_wikitext-2 (Val PPL 53.41 Test PPL 51.46)

$ python cache_language_model.py --gpus 0 --model_name awd_lstm_lm_1150

[2] cache_awd_lstm_lm_600_wikitext-2 (Val PPL 64.51 Test PPL 62.19)

$ python cache_language_model.py --gpus 0 --model_name awd_lstm_lm_600

[3] cache_standard_lstm_lm_1500_wikitext-2 (Val PPL 65.54 Test PPL 62.79)

$ python cache_language_model.py --gpus 0 --model_name standard_lstm_lm_1500

[4] cache_standard_lstm_lm_650_wikitext-2 (Val PPL 68.47 Test PPL 65.85)

$ python cache_language_model.py --gpus 0 --model_name standard_lstm_lm_650

[5] cache_standard_lstm_lm_200_wikitext-2 (Val PPL 77.51 Test PPL 73.74)

$ python cache_language_model.py --gpus 0 --model_name standard_lstm_lm_200

Large Scale Word Language Model

Reference: Jozefowicz, Rafal, et al. “Exploring the limits of language modeling”. arXiv preprint arXiv:1602.02410 (2016).

The key features used to reproduce the results for pre-trained models are listed in the following tables.

The dataset used for training the models is Google’s 1 billion words dataset.

Model

LSTM-2048-512

Mode

LSTMP

Num layers

1

Embed size

512

Hidden size

2048

Projection size

512

Dropout

0.1

Learning rate

0.2

Num samples

8192

Batch size

128

Gradient clip

10.0

Test perplexity

43.62

Num epochs

50

Training logs

log

Evaluation logs

log

[1] LSTM-2048-512 (Test PPL 43.62)

$ python large_word_language_model.py --gpus 0,1,2,3 --clip=10
$ python large_word_language_model.py --gpus 4 --eval-only --batch-size=1