Scripts

Word Embedding Toolkit

[Download]

Gluon NLP makes it easy to evaluate and train word embeddings. This folder includes examples to evaluate the pre-trained embeddings included in the Gluon NLP toolkit as well as example scripts for training embeddings on custom datasets.

Word Embedding Evaluation

To evaluate a specific embedding on one or multiple datasets you can use the included evaluate_pretrained.py as follows.

$ python evaluate_pretrained.py

Call the script with the –help option to get an overview of the supported options. We include a run_all.sh script to run the evaluation for the pretrained English Glove and fastText embeddings included in GluonNLP.

$ run_all.sh

The resulting logs and a notebook containing a ranking for the different evaluation tasks are available here.

Word Embedding Training

Besides loading pretrained embeddings, the Gluon NLP toolkit also makes it easy to train embeddings.

train_fasttext.py shows how to use Gluon NLP to train fastText or Word2Vec models. The script and parts of the Gluon NLP library support just-in-time compilation with numba, which is enabled automatically when numba is installed on the system. Please pip install –upgrade numba to make sure training speed is not needlessly throttled by Python.

Word2Vec models were introduced by

  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. ICLR Workshop , 2013.

FastText models were introudced by

  • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. TACL, 5(), 135–146.

We report the results obtained by running the train_fasttext.py script with default parameters. You can reproduce these results with runningand python train_fasttext.py –gpu 0 respectively. For comparison we also report the results obtained by training FastText with the facebookresearch/fastText implementation. All results are obtained by training 5 epochs on the Text8 dataset.

Similarity Dataset facebookresearch/fasttext train_fasttext.py
WordSim353-similarity 0.65275 0.687187
WordSim353-relatedness 0.540742 0.612768
MEN (test set) 0.659031 0.679318
RadinskyMTurk 0.638946 0.619085
RareWords 0.40731 0.398834
SimLex999 0.314253 0.309361
SimVerb3500 0.187372 0.190025
SemEval17Task2 (test set) 0.535899 0.533027
BakerVerb143 0.419168 0.478791
YangPowersVerb130 0.429905 0.437008
Google Analogy Dataset facebookresearch/fasttext train_fasttext.py
capital-common-countries 0.337945 0.405138
capital-world 0.0935013 0.159151
currency 0.0230947 0.0427252
city-in-state 0.039319 0.06364
family 0.3083 0.300395
gram1-adjective-to-adverb 0.694556 0.699597
gram2-opposite 0.76601 0.713054
gram3-comparative 0.721471 0.750751
gram4-superlative 0.727273 0.574866
gram5-present-participle 0.5625 0.407197
gram6-nationality-adjective 0.829268 0.826141
gram7-past-tense 0.173718 0.194872
gram8-plural 0.760511 0.848348
gram9-plural-verbs 0.752874 0.736782

Loading of fastText models with subword information

Fasttext models trained with the library of facebookresearch are exported both in a text and a binary format. Unlike the text format, the binary format preserves information about subword units and consequently supports computation of word vectors for words unknown during training (and not included in the text format). Besides training new fastText embeddings with Gluon NLP it is also possible to load the binary format into a Block provided by the Gluon NLP toolkit using FasttextEmbeddingModel.load_fasttext_format.

Beam Search Generator

[Download]

This script can be used to generate sentences using beam search from a pretrained language model.

Use the following command to generate the sentences

$ python beam_search_generator.py --bos I love it --beam_size 5

Output is

Beam Seach Parameters: beam_size=5, alpha=0.0, K=5
Generation Result:
['I love it , but it is not clear that it will be difficult to do it , but it is not a .', 243.20294]
['I love it , but it is not clear that it will be difficult to do it , so it is not a .', 242.4809]
['I love it , but it is not clear that it will be difficult to do so , but it is not a .', 242.45113]

You can also try a larger beam size.

$ python beam_search_generator.py --bos I love it --beam_size 10

Output is

Beam Seach Parameters: beam_size=10, alpha=0.0, K=5
Generation Result:
['I love it , but it is not possible to do it , but it is not impossible to do it , but .', 246.26108]
['I love it , but it is not possible to do it , but it is not impossible to do it , and .', 245.80142]
["I love it , but it is not possible to do it , but I 'm not going to do it , but .", 245.55646]

Try beam size equals to 15

$ python beam_search_generator.py --bos I love it --beam_size 15

Output is

Beam Seach Parameters: beam_size=15, alpha=0.0, K=5
Generation Result:
["I love it , and I don 't know how to do it , but I don ’ t think it would be .", 274.9892]
["I love it , and I don 't know how to do it , but I don ’ t think it will be .", 274.63895]
["I love it , and I don 't know how to do it , but I don ’ t want to do it .", 274.61063]

Language Model

Word Language Model

Reference: Merity, S., et al. “Regularizing and optimizing LSTM language models”. ICLR 2018

[Download]

The key features used to reproduce the results for pre-trained models are listed in the following tables.

The dataset used for training the models is wikitext-2.

Model awd_lstm_lm_1150_wikitext-2 awd_lstm_lm_600_wikitext-2 standard_lstm_lm_1500_wikitext-2 standard_lstm_lm_650_wikitext-2 standard_lstm_lm_200_wikitext-2
Mode LSTM LSTM LSTM LSTM LSTM
Num_layers 3 3 2 2 2
Embed size 400 200 1500 650 200
Hidden size 1150 600 1500 650 200
Dropout 0.4 0.2 0.65 0.5 0.2
Dropout_h 0.2 0.1 0 0 0
Dropout_i 0.65 0.3 0 0 0
Dropout_e 0.1 0.05 0 0 0
Weight_drop 0.5 0.2 0 0 0
Tied True True True True True
Val PPL 73.32 84.61 98.29 98.96 108.25
Test PPL 69.74 80.96 92.83 93.90 102.26
Command [1] [2] [3] [4] [5]
Training logs log log log log log

[1] awd_lstm_lm_1150_wikitext-2 (Val PPL 73.32 Test PPL 69.74)

$ python -u word_language_model.py --gpus 0 --tied --save awd_lstm_lm_1150_wikitext-2

[2] awd_lstm_lm_600_wikitext-2 (Val PPL 84.61 Test PPL 80.96)

$ python -u word_language_model.py --gpus 0 --emsize 200 --nhid 600 --epochs 750 --dropout 0.2 --dropout_h 0.1 --dropout_i 0.3 --dropout_e 0.05 --weight_drop 0.2 --tied --save awd_lstm_lm_600_wikitext-2

[3] standard_lstm_lm_1500_wikitext-2 (Val PPL 98.29 Test PPL 92.83)

$ python -u word_language_model.py --gpus 0 --emsize 1500 --nhid 1500 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.65 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --save standard_lstm_lm_1500_wikitext-2

[4] standard_lstm_lm_650_wikitext-2 (Val PPL 98.96 Test PPL 93.90)

$ python -u word_language_model.py --gpus 0 --emsize 650 --nhid 650 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.5 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --save standard_lstm_lm_650_wikitext-2

[5] standard_lstm_lm_200_wikitext-2 (Val PPL 108.25 Test PPL 102.26)

$ python -u word_language_model.py --gpus 0 --emsize 200 --nhid 200 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.2 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --save standard_lstm_lm_200_wikitext-2

Cache Language Model

Reference: Grave, E., et al. “Improving neural language models with a continuous cache”. ICLR 2017

[Download]

The key features used to reproduce the results based on the corresponding pre-trained models are listed in the following tables.

The dataset used for training the models is wikitext-2.

Model cache_awd_lstm_lm_1150_wikitext-2 cache_awd_lstm_lm_600_wikitext-2 cache_standard_lstm_lm_1500_wikitext-2 cache_standard_lstm_lm_650_wikitext-2 cache_standard_lstm_lm_200_wikitext-2
Pretrained setting Refer to: awd_lstm_lm_1150_wikitext-2 Refer to: awd_lstm_lm_600_wikitext-2 Refer to: standard_lstm_lm_1500_wikitext-2 Refer to: standard_lstm_lm_650_wikitext-2 Refer to: standard_lstm_lm_200_wikitext-2
lambdas 0.1279 0.1279 0.1279 0.1279 0.1279
theta 0.662 0.662 0.662 0.662 0.662
window 2000 2000 2000 2000 2000
bptt 2000 2000 2000 2000 2000
Val PPL 56.67 64.51 71.92 69.57 77.51
Test PPL 54.51 62.19 68.71 66.52 73.74
Command [1] [2] [3] [4] [5]
Training logs log log log log log

[1] cache_awd_lstm_lm_1150_wikitext-2 (Val PPL 56.67 Test PPL 54.51)

$ python -u cache_language_model.py --gpus 0 --save awd_lstm_lm_1150

[2] cache_awd_lstm_lm_600_wikitext-2 (Val PPL 64.51 Test PPL 62.19)

$ python -u cache_language_model.py --gpus 0 --save awd_lstm_lm_600

[3] cache_standard_lstm_lm_1500_wikitext-2 (Val PPL 71.92 Test PPL 68.71)

$ python -u cache_language_model.py --gpus 0 --save standard_lstm_lm_1500

[4] cache_standard_lstm_lm_650_wikitext-2 (Val PPL 69.57 Test PPL 66.52)

$ python -u cache_language_model.py --gpus 0 --save standard_lstm_lm_650

[5] cache_standard_lstm_lm_200_wikitext-2 (Val PPL 77.51 Test PPL 73.74)

$ python -u cache_language_model.py --gpus 0 --save standard_lstm_lm_200

Large Scale Word Language Model

Reference: Jozefowicz, Rafal, et al. “Exploring the limits of language modeling”. arXiv preprint arXiv:1602.02410 (2016).

The key features used to reproduce the results for pre-trained models are listed in the following tables.

The dataset used for training the models is Google’s 1 billion words dataset.

Model | LSTM-2048-512 |
————— | ————- |
Num layers | 1 |
Embed size | 512 |
Hidden size | 2048 |
Batch size | 256 |
Batch size | 256 |
Graident clip | 10.0 |
Projection size | 512 |
Dropout | 0.1 |
Learning rate | 0.2 |
Num samples | 8192 |
Test perplexity | 44.05 |
Num epochs | 48 |

[1] LSTM-2048-512 (Test PPL 44.05)

$ pip install cython
$ make
$ python large_word_language_model.py --gpus 0,1,2,3 --epochs=48 --batch-size=256 --clip=10
$ python large_word_language_model.py --gpus 0 --eval-only --batch-size=32 --log-interval=1

Sentiment Analysis through Fine-tuning, w/ Bucketing

[Download]

This script can be used to train a sentiment analysis model from scratch, or fine-tune a pre-trained language model. The pre-trained language models are loaded from Gluon NLP Toolkit model zoo. It also showcases how to use different bucketing strategies to speed up training.

Use the following command to run without using pretrained model

$ python sentiment_analysis.py --gpu 0 --batch_size 16 --bucket_type fixed --epochs 3 --dropout 0 --no_pretrained --lr 0.005 --valid_ratio 0.1 --save-prefix imdb_lstm_200  # Test Accuracy 85.36

Use the following command to run with pretrained model

$ python sentiment_analysis.py --gpu 0 --batch_size 16 --bucket_type fixed --epochs 3 --dropout 0 --lr 0.005 --valid_ratio 0.1 --save-prefix imdb_lstm_200  # Test Accuracy 87.41

Machine Translation

[Download]

Use the following command to train the GNMT model on the IWSLT2015 dataset.

$ python train_gnmt.py --src_lang en --tgt_lang vi --batch_size 128 \
                --optimizer adam --lr 0.001 --lr_update_factor 0.5 --beam_size 10 --bucket_scheme exp \
                --num_hidden 512 --save_dir gnmt_en_vi_l2_h512_beam10 --epochs 12 --gpu 0

It gets test BLEU score equals to 26.20.

Use the following commands to train the Transformer model on the WMT14 dataset for English to German translation.

$ python train_transformer.py --dataset WMT2014BPE --src_lang en --tgt_lang de --batch_size 4096 \
                       --optimizer adam --num_accumulated 8 --lr 1.0 --warmup_steps 8000 \
                       --save_dir transformer_en_de_u512 --epochs 40 --gpus 0,1,2,3 --scaled \
                       --average_start 5 --num_buckets 20 --bucket_scheme exp --bleu 13a

It gets official mteval-v13a BLEU score equals to 26.95 on newstest2014 (http://statmt.org/wmt14/test-filtered.tgz). This result is obtained by using averaged SGD in last 5 epochs. If we use international tokenization (i.e., --bleu intl), we can obtain bleu score equals to 27.75. If we use --bleu tweaked, we obtain test BLEU score equals to 28.81. This result is obtained on tweaked reference, where the tokenized reference text is put in ATAT format for historical reason and following preprocessing pipeline is done:

mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l de
mosesdecoder/scripts/tokenizer/remove-non-printing-char.perl
mosesdecoder/scripts/tokenizer/tokenizer.perl -q -no-escape -protected mosesdecoder/scripts/tokenizer/basic-protected-patterns -l de.

If we turn on --full, the testing is performed on newstest2014 (http://statmt.org/wmt14/test-full.tgz). Then, we can obtain BLEU=26.89 with --bleu 13a, BLEU=27.66 with --bleu intl, and BLEU=28.63 with --bleu tweaked