Scripts

Word Embedding Toolkit

[Download]

Gluon NLP makes it easy to evaluate and train word embeddings. This folder includes examples to evaluate the pre-trained embeddings included in the Gluon NLP toolkit as well as example scripts for training embeddings on custom datasets.

Word Embedding Evaluation

To evaluate a specific embedding on one or multiple datasets you can use the included evaluate_pretrained.py as follows.

$ python evaluate_pretrained.py

Call the script with the –help option to get an overview of the supported options. We include a run_all.sh script to run the evaluation for the pre-trained English Glove and fastText embeddings included in GluonNLP.

$ run_all.sh

The resulting logs and a notebook containing a ranking for the different evaluation tasks are available here.

Word Embedding Training

Besides loading pre-trained embeddings, the Gluon NLP toolkit also makes it easy to train embeddings.

The following code block shows how to use Gluon NLP to train fastText or Word2Vec models. The script and parts of the Gluon NLP library support just-in-time compilation with numba, which is enabled automatically when numba is installed on the system. Please pip install –upgrade numba to make sure training speed is not needlessly throttled by Python.

$ python train_fasttext.py

Word2Vec models were introduced by

  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. ICLR Workshop , 2013.

FastText models were introudced by

  • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. TACL, 5(), 135–146.

We report the results obtained by running the train_fasttext.py script with default parameters. You can reproduce these results with runningand python train_fasttext.py –gpu 0 respectively. For comparison we also report the results obtained by training FastText with the facebookresearch/fastText implementation. All results are obtained by training 5 epochs on the Text8 dataset.

Similarity Dataset facebookresearch/fasttext train_fasttext.py
WordSim353-similarity 0.670 0.685
WordSim353-relatedness 0.557 0.592
MEN (test set) 0.665 0.629
RadinskyMTurk 0.640 0.609
RareWords 0.400 0.429
SimLex999 0.300 0.323
SimVerb3500 0.170 0.191
SemEval17Task2 (test set) 0.540 0.566
BakerVerb143 0.390 0.363
YangPowersVerb130 0.424 0.366
Google Analogy Dataset facebookresearch/fasttext train_fasttext.py
capital-common-countries 0.581 0.470
capital-world 0.176 0.148
currency 0.046 0.043
city-in-state 0.100 0.076
family 0.375 0.342
gram1-adjective-to-adverb 0.695 0.663
gram2-opposite 0.539 0.700
gram3-comparative 0.523 0.740
gram4-superlative 0.523 0.535
gram5-present-participle 0.480 0.399
gram6-nationality-adjective 0.830 0.830
gram7-past-tense 0.269 0.200
gram8-plural 0.703 0.860
gram9-plural-verbs 0.575 0.800

Loading of fastText models with subword information

Fasttext models trained with the library of facebookresearch are exported both in a text and a binary format. Unlike the text format, the binary format preserves information about subword units and consequently supports computation of word vectors for words unknown during training (and not included in the text format). Besides training new fastText embeddings with Gluon NLP it is also possible to load the binary format into a Block provided by the Gluon NLP toolkit using FasttextEmbeddingModel.load_fasttext_format.

Sequence Sampling

[Download]

This script can be used to generate sentences using beam search from a pre-trained language model.

Beam Search Generator

Use the following command to decode using beam search.

$ python sequence_sampling.py --use-beam-search --bos I love it --beam_size 5 --print_num 5

Output is

Beam Seach Parameters: beam_size=5, alpha=0.0, K=5
Generation Result:
[u'I love it .', -1.1241297]
[u'I love it " .', -4.001592]
[u'I love it , but it is not a <unk> .', -15.624882]
[u'I love it , but it is not a <unk> , but it is not a <unk> .', -28.37084]
[u'I love it , but it is not a <unk> , and it is not a <unk> .', -28.826918]

You can also try a larger beam size, such as 15.

$ python sequence_sampling.py --use-beam-search --bos I love it --beam_size 15 --print_num 15

Output is

Beam Seach Parameters: beam_size=15, alpha=0.0, K=5
Generation Result:
[u'I love it .', -1.1241297]
[u'I love it " .', -4.001592]
[u'I love it as a <unk> .', -8.038588]
[u"I love it , and I don 't know how to do it .", -15.407309]
[u"I love it , and I don 't want to do it .", -15.887625]
[u"I love it , and I don 't know what it is .", -15.91673]
[u"I love it , and I don 't know how to do so .", -16.780586]
[u"I love it , and I don 't know how to do that .", -16.98329]
[u"I love it , and I don 't think it is a <unk> .", -17.490877]
[u"I love it , and I don 't think it would be a <unk> .", -19.416945]
[u"I love it , and I don 't know how to do it , but I don 't know how to do it .", -28.04979]
[u"I love it , and I don 't know how to do it , but I don 't think it is a <unk> .", -29.397102]
[u"I love it , and I don 't know how to do it , but I don 't think it 's a good .", -29.406847]
[u"I love it , and I don 't know how to do it , but I don 't think it is a good .", -29.413773]
[u"I love it , and I don 't know how to do it , but I don 't think it 's a lot .", -29.43183]

Sequence Sampler

Use the following command to decode to sample from the multinomial distribution, produced from softmax with temperature 1.0.

$ python sequence_sampling.py --use-sampling --bos I love it --beam_size 5 --print_num 5 --temperature 1.0

Output is

Sampling Parameters: beam_size=5, temperature=1.0
Generation Result:
[u'I love it and martial arts , history , and communism ; it is seems to be probably a date .', -76.772766]
[u'I love it in all @-@ bodied households but like those who got part in the concept of refugee peoples , and had .', -96.42722]
[u'I love it for adult people .', -17.899687]
[u"I love it I think it 's through the side that we are going to mean the world it else .", -69.61122]
[u'I love it in late arrangement .', -22.287495]

You can also try a lower temperature such as 0.95, which results in sharper distribution.

$ python sequence_sampling.py --use-sampling --bos I love it --beam_size 5 --print_num 5 --temperature 0.95

Output is

Sampling Parameters: beam_size=5, temperature=0.95
Generation Result:
[u'I love it .', -1.1241297]
[u'I love it and then it pays me serious from what he writes .', -45.79579]
[u"I love it as if this was from now <unk> , good as to the grounds of ' Hoyt ' where it had .", -91.47732]
[u'I love it be an action .', -19.657116]
[u'I love it and now leads to his best resulted in a shift between the two were announced in 2006 .', -71.7838]

Language Model

Word Language Model

Reference: Merity, S., et al. “Regularizing and optimizing LSTM language models”. ICLR 2018

[Download]

The key features used to reproduce the results for pre-trained models are listed in the following tables.

The dataset used for training the models is wikitext-2.

Model awd_lstm_lm_1150_wikitext-2 awd_lstm_lm_600_wikitext-2 standard_lstm_lm_1500_wikitext-2 standard_lstm_lm_650_wikitext-2 standard_lstm_lm_200_wikitext-2
Mode LSTM LSTM LSTM LSTM LSTM
Num_layers 3 3 2 2 2
Embed size 400 200 1500 650 200
Hidden size 1150 600 1500 650 200
Dropout 0.4 0.2 0.65 0.5 0.2
Dropout_h 0.2 0.1 0 0 0
Dropout_i 0.65 0.3 0 0 0
Dropout_e 0.1 0.05 0 0 0
Weight_drop 0.5 0.2 0 0 0
Val PPL 68.71 84.89 86.51 90.96 107.59
Test PPL 65.62 80.67 82.29 86.91 101.64
Command [1] [2] [3] [4] [5]
Training logs log log log log log

For all the above model settings, we set Tied = True and NTASGD = True .

[1] awd_lstm_lm_1150_wikitext-2 (Val PPL 68.71 Test PPL 65.62 )

$ python word_language_model.py --gpu 0 --tied --ntasgd --lr_update_interval 50 --lr_update_factor 0.5 --save awd_lstm_lm_1150_wikitext-2

[2] awd_lstm_lm_600_wikitext-2 (Val PPL 84.89 Test PPL 80.67)

$ python word_language_model.py --gpu 0 --emsize 200 --nhid 600 --epochs 750 --dropout 0.2 --dropout_h 0.1 --dropout_i 0.3 --dropout_e 0.05 --weight_drop 0.2 --tied --ntasgd --lr_update_interval 50 --lr_update_factor 0.5 --save awd_lstm_lm_600_wikitext-2

[3] standard_lstm_lm_1500_wikitext-2 (Val PPL 86.51 Test PPL 82.29)

$ python word_language_model.py --gpu 0 --emsize 1500 --nhid 1500 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.65 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --ntasgd --lr_update_interval 50 --lr_update_factor 0.5 --save standard_lstm_lm_1500_wikitext-2

[4] standard_lstm_lm_650_wikitext-2 (Val PPL 90.96 Test PPL 86.91)

$ python word_language_model.py --gpu 0 --emsize 650 --nhid 650 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.5 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --ntasgd --lr_update_interval 50 --lr_update_factor 0.5 --save standard_lstm_lm_650_wikitext-2

[5] standard_lstm_lm_200_wikitext-2 (Val PPL 107.59 Test PPL 101.64)

$ python word_language_model.py --gpu 0 --emsize 200 --nhid 200 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.2 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --ntasgd --lr_update_interval 50 --lr_update_factor 0.5 --save standard_lstm_lm_200_wikitext-2

Cache Language Model

Reference: Grave, E., et al. “Improving neural language models with a continuous cache”. ICLR 2017

[Download]

The key features used to reproduce the results based on the corresponding pre-trained models are listed in the following tables.

The dataset used for training the models is wikitext-2.

Model cache_awd_lstm_lm_1150_wikitext-2 cache_awd_lstm_lm_600_wikitext-2 cache_standard_lstm_lm_1500_wikitext-2 cache_standard_lstm_lm_650_wikitext-2 cache_standard_lstm_lm_200_wikitext-2
Pre-trained setting Refer to: awd_lstm_lm_1150_wikitext-2 Refer to: awd_lstm_lm_600_wikitext-2 Refer to: standard_lstm_lm_1500_wikitext-2 Refer to: standard_lstm_lm_650_wikitext-2 Refer to: standard_lstm_lm_200_wikitext-2
Val PPL 53.41 64.51 65.54 68.47 77.51
Test PPL 51.46 62.19 62.79 65.85 73.74
Command [1] [2] [3] [4] [5]
Training logs log log log log log

For all the above model settings, we set lambdas = 0.1279, theta = 0.662, window = 2000 and bptt= 2000 .

[1] cache_awd_lstm_lm_1150_wikitext-2 (Val PPL 53.41 Test PPL 51.46)

$ python cache_language_model.py --gpus 0 --model_name awd_lstm_lm_1150

[2] cache_awd_lstm_lm_600_wikitext-2 (Val PPL 64.51 Test PPL 62.19)

$ python cache_language_model.py --gpus 0 --model_name awd_lstm_lm_600

[3] cache_standard_lstm_lm_1500_wikitext-2 (Val PPL 65.54 Test PPL 62.79)

$ python cache_language_model.py --gpus 0 --model_name standard_lstm_lm_1500

[4] cache_standard_lstm_lm_650_wikitext-2 (Val PPL 68.47 Test PPL 65.85)

$ python cache_language_model.py --gpus 0 --model_name standard_lstm_lm_650

[5] cache_standard_lstm_lm_200_wikitext-2 (Val PPL 77.51 Test PPL 73.74)

$ python cache_language_model.py --gpus 0 --model_name standard_lstm_lm_200

Large Scale Word Language Model

Reference: Jozefowicz, Rafal, et al. “Exploring the limits of language modeling”. arXiv preprint arXiv:1602.02410 (2016).

The key features used to reproduce the results for pre-trained models are listed in the following tables.

The dataset used for training the models is Google’s 1 billion words dataset.

Model LSTM-2048-512
Mode LSTMP
Num layers 1
Embed size 512
Hidden size 2048
Projection size 512
Dropout 0.1
Learning rate 0.2
Num samples 8192
Batch size 128
Graident clip 10.0
Test perplexity 43.62
Num epochs 50
Training logs log
Evaluation logs log

[1] LSTM-2048-512 (Test PPL 43.62)

$ python large_word_language_model.py --gpus 0,1,2,3 --clip=10
$ python large_word_language_model.py --gpus 4 --eval-only --batch-size=1

Sentiment Analysis through Fine-tuning, w/ Bucketing

[Download]

This script can be used to train a sentiment analysis model from scratch, or fine-tune a pre-trained language model. The pre-trained language models are loaded from Gluon NLP Toolkit model zoo. It also showcases how to use different bucketing strategies to speed up training.

Use the following command to run without using pre-trained model (log)

$ python sentiment_analysis.py --gpu 0 --batch_size 16 --bucket_type fixed --epochs 3 --dropout 0 --no_pretrained --lr 0.005 --valid_ratio 0.1 --save-prefix imdb_lstm_200  # Test Accuracy 85.60

Use the following command to run with pre-trained model (log)

$ python sentiment_analysis.py --gpu 0 --batch_size 16 --bucket_type fixed --epochs 3 --dropout 0 --lr 0.005 --valid_ratio 0.1 --save-prefix imdb_lstm_200  # Test Accuracy 86.46

Machine Translation

[Download]

Use the following command to train the GNMT model on the IWSLT2015 dataset.

$ MXNET_GPU_MEM_POOL_TYPE=Round python train_gnmt.py --src_lang en --tgt_lang vi --batch_size 128 \
                --optimizer adam --lr 0.001 --lr_update_factor 0.5 --beam_size 10 --bucket_scheme exp \
                --num_hidden 512 --save_dir gnmt_en_vi_l2_h512_beam10 --epochs 12 --gpu 0

It gets test BLEU score equals to 26.20.

Use the following commands to train the Transformer model on the WMT14 dataset for English to German translation.

$ MXNET_GPU_MEM_POOL_TYPE=Round python train_transformer.py --dataset WMT2014BPE \
                       --src_lang en --tgt_lang de --batch_size 2700 \
                       --optimizer adam --num_accumulated 16 --lr 2.0 --warmup_steps 4000 \
                       --save_dir transformer_en_de_u512 --epochs 30 --gpus 0,1,2,3,4,5,6,7 --scaled \
                       --average_start 5 --num_buckets 20 --bucket_scheme exp --bleu 13a --log_interval 10

It gets official mteval-v13a BLEU score equals to 27.09 on newstest2014 (http://statmt.org/wmt14/test-filtered.tgz). This result is obtained by using averaged SGD in last 5 epochs. If we use international tokenization (i.e., --bleu intl), we can obtain bleu score equals to 27.89. If we use --bleu tweaked, we obtain test BLEU score equals to 28.96. This result is obtained on tweaked reference, where the tokenized reference text is put in ATAT format for historical reason and following preprocessing pipeline is done:

mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l de
mosesdecoder/scripts/tokenizer/remove-non-printing-char.perl
mosesdecoder/scripts/tokenizer/tokenizer.perl -q -no-escape -protected mosesdecoder/scripts/tokenizer/basic-protected-patterns -l de.

If we turn on --full, the testing is performed on newstest2014 (http://statmt.org/wmt14/test-full.tgz). Then, we can obtain BLEU=27.05 with --bleu 13a, BLEU=27.81 with --bleu intl, and BLEU=28.80 with --bleu tweaked

The pre-trained model can be downloaded from http://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/transformer_en_de_512_WMT2014-14bd361b.zip.

For the users from China, it might be faster with this link instead: https://apache-mxnet.s3.cn-north-1.amazonaws.com.cn/gluon/models/transformer_en_de_512_WMT2014-14bd361b.zip.

Document Classification

Use the following command to train the FastText classification model on the Yelp review dataset. The model we have implemented is a slight variant of :

We have added dropout to the final layer, and the optimizer is changed from ‘sgd’ to ‘adam’ These are made for demo purposes and we can get very good numbers with original settings too, but a complete async sgd with batch size = 1, might be very slow for training using a GPU.

The datasets used in this script can be obtained with this script from fasttext.

$ python train_classification_fasttext.py --input yelp_review_polarity.train \
                                             --output yelp_review_polarity.gluon \
                                              --validation dbpedia.test \
                                              --ngrams 1 --epochs 25 --lr 0.1 --emsize 100 --gpu 0

It gets validation accuracy score of 93.96%. Yelp review is a binary classification dataset. (It has 2 classes) Training logs : log

We can call the script for multiclass classification as well without any change, it automatically figures out the number of classes and chooses to use sigmoid or softmax loss corresponding to the problem.

Use the following commands to train a Classification model on the dbpedia dataset which has 14 labels

$ python train_classification_fasttext.py --input dbpedia.train \
                                             --output yelp_review_polarity.gluon \
                                              --validation dbpedia.test \
                                              --ngrams 1 --epochs 25 --lr 0.1 --emsize 100 --gpu 0

It gives validation accuracy of 98%. Try tweaking –ngrams to 2 or 3 for improved accuracy numbers. Training logs : log

Use the following command to train a Classification model on the ag_news dataset:

$ python train_classification_fasttext.py --input ag_news.train \
                                             --output ag_news.gluon \
                                              --validation ag_news.test \
                                              --ngrams 1 --epochs 25 --lr 0.1 --emsize 100 --gpu 0

It gives a validation accuracy of 91% Training logs : log

Note: Its not advised to try higher order n-grams with large datasets since it would cause OOM on the GPU’s. You can try running it when you disable the –gpu option as many AWS EC2 instances support > 64GB RAM. In general, larger learning rate and higher order n-grams yield better accuracy. Too high learning rate might cause very high oscillations in the accuracy during the training.

Custom Datasets:

The training can benefit from preprocessing the dataset to lower case all the text and remove punctuations. Use the following linux utility for achieving the same:

cat <input.txt> | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > input.preprocessed.txt