Download this tutorial

Word Embeddings Training and Evaluation

In [1]:
import warnings

import itertools
import time
import math
import logging
import random

import mxnet as mx
import gluonnlp as nlp
import numpy as np
from scipy import stats

# context = mx.cpu()  # Enable this to run on CPU
context = mx.gpu(0)  # Enable this to run on GPU


Here we use the Text8 corpus from the Large Text Compression Benchmark which includes the first 100 MB of cleaned text from the English Wikipedia.

In [2]:
text8 =
print('# sentences:', len(text8))
for sentence in text8[:3]:
    print('# tokens:', len(sentence), sentence[:5])
Downloading /var/lib/jenkins/workspace/gluon-nlp-gpu-py3@2/tests/data/datasets/text8/ from
# sentences: 1701
# tokens: 10000 ['anarchism', 'originated', 'as', 'a', 'term']
# tokens: 10000 ['reciprocity', 'qualitative', 'impairments', 'in', 'communication']
# tokens: 10000 ['with', 'the', 'aegis', 'of', 'zeus']

Given the tokenized data, we first count all tokens and then construct a vocabulary of all tokens that occur at least 5 times in the dataset. The vocabulary contains a one-to-one mapping between tokens and integers (also called indices or short idx).

We further store the frequency count of each token in the vocabulary as we will require this information later on for sampling random negative (or noise) words. Finally we replace all tokens with their integer representation based on the vocabulary.

In [3]:
counter =
vocab = nlp.Vocab(counter, unknown_token=None, padding_token=None,
                  bos_token=None, eos_token=None, min_freq=5)
idx_to_counts = [counter[w] for w in vocab.idx_to_token]

def code(sentence):
    return [vocab[token] for token in sentence if token in vocab]

text8 = text8.transform(code, lazy=False)

print('# sentences:', len(text8))
for sentence in text8[:3]:
    print('# tokens:', len(sentence), sentence[:5])
# sentences: 1701
# tokens: 9895 [5233, 3083, 11, 5, 194]
# tokens: 9858 [18214, 17356, 36672, 4, 1753]
# tokens: 9926 [23, 0, 19754, 1, 4829]

Next we need to transform the coded Text8 dataset into batches useful for training an embedding model. In this tutorial we train with the SkipGram objective made popular by [1].

For SkipGram, we sample pairs of co-occurring words from the corpus. Two words are said to co-occur if they occur with distance less than a specified window size. The window size is usually chosen around 5.

For obtaining the samples from the corpus, we can shuffle the sentences and the proceed linearly through each sentence, considering each word as well as all the words in it’s window. In this case, we call the current word in focus the center word, and the words in it’s window the context words. GluonNLP contains batchify transformation, that takes a corpus, such as the coded Text8 we have here, and returns a DataStream of batches of center and context words.

To obtain good results, each sentence is further subsampled, meaning that words are deleted with a probability proportional to their frequency. [1] proposes to discard individual occurrences of words from the dataset with probability

\[P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}}\]

where \(f(w_i)\) is the frequency with which a word is observed in a dataset and \(t\) is a subsampling constant typically chosen around \(10^{-5}\). [1] has also shown that the final performance is improved if the window size is chosen uniformly random for each center words out of the range [1, window].

For this notebook, we are interested in training a fastText embedding model [2]. A fastText model not only associates a embedding vector to each token in the vocabulary, but also to a pre-specified number of subwords. Commonly 2 million subword vectors are obtained and each subword vector is associated with zero, one or multiple character-ngrams. The mapping between character-ngrams and subwords is based on a hash function. The final embedding vector of a token is the mean of the vectors associated with the token and all character-ngrams occurring in the string representation of the token. Thereby a fastText embedding model can compute meaningful embedding vectors for tokens that were not seen during training.

For this notebook, we have prepared a helper function transform_data_fasttext which builds a series of transformations of the text8 Dataset created above, applying “tricks” mentioned before. It returns a DataStream over batches as well as a batchify_fn function that applied to a batch looks up and includes the fastText subwords associated with the center words and finally the subword function that can be used to obtain the subwords of a given string representation of a token. We will take a closer look at the subword function shortly.

Note that the number of subwords is potentially different for every word. Therefore the batchify_fn represents a word with its subwords as a row in a compressed sparse row (CSR) matrix. Take a look at if you are not familia with CSR. Separating the batchify_fn from the previous word-pair sampling is useful, as it allows to parallelize the CSR matrix construction over multiple CPU cores for separate batches.

You can find it in in the archive that can be downloaded via the Download button at the top of this page. - [1] Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionally.” Advances in neural information processing systems. 2013. - [2] Bojanowski et al., “Enriching Word Vectors with Subword Information” Transactions of the Association for Computational Linguistics 2017

In [4]:
from data import transform_data_fasttext

data =[text8])  # input is a stream of datasets, here just 1. Allows scaling to larger corpora that don't fit in memory
data, batchify_fn, subword_function = transform_data_fasttext(
    data, vocab, idx_to_counts, cbow=False, ngrams=[3,4,5,6], ngram_buckets=100000, batch_size=batch_size, window_size=5)
In [5]:
batches = data.transform(batchify_fn)


gluonnlp provides the concept of a SubwordFunction which maps words to a list of indices representing their subword. Possible SubwordFunctions include mapping a word to the sequence of it’s characters/bytes or hashes of all its ngrams.

FastText models use a hash function to map each ngram of a word to a number in range [0, num_subwords). We include the same hash function. Above transform_data_fasttext has also returned a subword_function object. Let’s try it with a few words:

In [6]:
idx_to_subwordidxs = subword_function(vocab.idx_to_token)
for word, subwords in zip(vocab.idx_to_token[:3], idx_to_subwordidxs[:3]):
    print('<'+word+'>', subwords, sep = '\t')
<the>     [51151, 9726, 48960, 61980, 60934, 16280]
<of>      [97102, 64528, 28930]
<and>     [78080, 35020, 30390, 95046, 19624, 25443]


Here we define a SkipGram model for training fastText embeddings. For Skip-Gram, the model consists of two independent embedding networks. One for the center words, and one for the context words. For center words, subwords are taken into account while for context words only the token itself is taken into account.

GluonNLP provides a nlp.model.train.FasttextEmbeddingModel Block which defines the fastText style embedding with subword support. It can be used for training, but also supports loading models trained with the original C++ fastText library from .bin files. After training, vectors for arbitrary words can be looked up via embedding[['a', 'list', 'of', 'potentially', 'unknown', 'words']] where embedding is a nlp.model.train.FasttextEmbeddingModel.

In the script we provide a definition for the fastText model for the SkipGram objective. The model definition is a Gluon HybridBlock, meaning that the complete forward / backward pass are compiled and executed directly in the MXNet backend. Not only does the Block include the FasttextEmbeddingModel for the center words and a simple embedding matrix for the context words, but it also takes care of sampling a specified number of noise words for each center- context pair. These noise words are called negatives, as the resulting center- negative pair is unlikely to occur in the dataset. The model then must learn which word-pairs are negatives and which ones are real. Thereby it obtains meaningful word and subword vectors for all considered tokens. The negatives are sampled from the smoothed unigram frequency distribution.

Let’s instantiate and initialize the model. We also create a trainer object for updating the parameters with AdaGrad. Finally we print a summary of the model.

In [7]:
from model import SG as SkipGramNet

emsize = 300
num_negatives = 5

negatives_weights = mx.nd.array(idx_to_counts)
embedding = SkipGramNet(
    vocab.token_to_idx, emsize, batch_size, negatives_weights, subword_function, num_negatives=5, smoothing=0.75)
trainer = mx.gluon.Trainer(embedding.collect_params(), 'adagrad', dict(learning_rate=0.05))

  (embedding): FasttextEmbeddingModel(71290 + 100000 -> 300, float32)
  (embedding_out): Embedding(71290 -> 300, float32)
  (negatives_sampler): UnigramCandidateSampler(71290, int64)

Let’s take a look at the documentation of the forward pass.

In [8]:
SkipGram forward pass.

        center : mxnet.nd.NDArray or mxnet.sym.Symbol
            Sparse CSR array of word / subword indices of shape (batch_size,
            len(token_to_idx) + num_subwords). Embedding for center words are
            computed via between the CSR center array and the
            weight matrix.
        context : mxnet.nd.NDArray or mxnet.sym.Symbol
            Dense array of context words of shape (batch_size, ). Also used for
            row-wise independently masking negatives equal to one of context.
        center_words : mxnet.nd.NDArray or mxnet.sym.Symbol
            Dense array of center words of shape (batch_size, ). Only used for
            row-wise independently masking negatives equal to one of

Before we start training, let’s examine the quality of our randomly initialized embeddings:

In [9]:
def norm_vecs_by_row(x):
    return x / (mx.nd.sum(x * x, axis=1) + 1e-10).sqrt().reshape((-1, 1))

def get_k_closest_tokens(vocab, embedding, k, word):
    word_vec = norm_vecs_by_row(embedding[[word]])
    vocab_vecs = norm_vecs_by_row(embedding[vocab.idx_to_token])
    dot_prod =, word_vec.T)
    indices = mx.nd.topk(
        dot_prod.reshape((len(vocab.idx_to_token), )),
        k=k + 1,
    indices = [int(i.asscalar()) for i in indices]
    result = [vocab.idx_to_token[i] for i in indices[1:]]
    print('closest tokens to "%s": %s' % (word, ", ".join(result)))
In [10]:
example_token = "vector"
get_k_closest_tokens(vocab, embedding, 10, example_token)
closest tokens to "vector": vectors, vectoring, bivector, sector, rector, lector, spector, director, vectorborne, hector

We can see that in the randomly initialized fastText model the closest tokens to “vector” are based on overlapping ngrams.


Thanks to the Gluon data pipeline and the HybridBlock handling all complexity, our training code is very simple. We iterate over all batches, move them to the appropriate context (GPU), do forward, backward and parameter update and finally include some helpful print statements for following the training process.

In [11]:
log_interval = 500

def train_embedding(num_epochs):
    for epoch in range(1, num_epochs + 1):
        start_time = time.time()
        l_avg = 0
        log_wc = 0

        print('Beginnign epoch %d and resampling data.' % epoch)
        for i, batch in enumerate(batches):
            batch = [array.as_in_context(context) for array in batch]
            with mx.autograd.record():
                l = embedding(*batch)

            l_avg += l.mean()
            log_wc += l.shape[0]
            if i % log_interval == 0:
                wps = log_wc / (time.time() - start_time)
                l_avg = l_avg.asscalar() / log_interval
                print('epoch %d, iteration %d, loss %.2f, throughput=%.2fK wps'
                      % (epoch, i, l_avg, wps / 1000))
                start_time = time.time()
                log_wc = 0
                l_avg = 0

        get_k_closest_tokens(vocab, embedding, 10, example_token)
In [12]:
Beginnign epoch 1 and resampling data.
epoch 1, iteration 0, loss 0.00, throughput=0.55K wps
epoch 1, iteration 500, loss 0.54, throughput=171.19K wps
epoch 1, iteration 1000, loss 0.48, throughput=207.05K wps
epoch 1, iteration 1500, loss 0.46, throughput=198.94K wps
epoch 1, iteration 2000, loss 0.45, throughput=239.00K wps
epoch 1, iteration 2500, loss 0.44, throughput=176.87K wps
epoch 1, iteration 3000, loss 0.43, throughput=253.77K wps
epoch 1, iteration 3500, loss 0.43, throughput=248.42K wps
epoch 1, iteration 4000, loss 0.42, throughput=246.81K wps
epoch 1, iteration 4500, loss 0.42, throughput=248.00K wps
epoch 1, iteration 5000, loss 0.42, throughput=246.81K wps
epoch 1, iteration 5500, loss 0.41, throughput=245.28K wps
epoch 1, iteration 6000, loss 0.41, throughput=237.65K wps
epoch 1, iteration 6500, loss 0.41, throughput=238.05K wps
epoch 1, iteration 7000, loss 0.41, throughput=182.17K wps
epoch 1, iteration 7500, loss 0.41, throughput=246.51K wps
epoch 1, iteration 8000, loss 0.41, throughput=190.47K wps
epoch 1, iteration 8500, loss 0.41, throughput=158.99K wps
epoch 1, iteration 9000, loss 0.41, throughput=126.27K wps
epoch 1, iteration 9500, loss 0.40, throughput=143.27K wps
epoch 1, iteration 10000, loss 0.41, throughput=175.93K wps
epoch 1, iteration 10500, loss 0.41, throughput=175.93K wps
epoch 1, iteration 11000, loss 0.40, throughput=166.16K wps
epoch 1, iteration 11500, loss 0.40, throughput=179.53K wps
epoch 1, iteration 12000, loss 0.40, throughput=165.69K wps
closest tokens to "vector": bivector, eigenvector, vectors, vectoring, polynomials, parametric, polynomial, symmetric, eigenvectors, functor

Word Similarity and Relatedness Task

Word embeddings should capture the relationship between words in natural language. In the Word Similarity and Relatedness Task word embeddings are evaluated by comparing word similarity scores computed from a pair of words with human labels for the similarity or relatedness of the pair.

gluonnlp includes a number of common datasets for the Word Similarity and Relatedness Task. The included datasets are listed in the API documentation. We use several of them in the evaluation example below. We first show a few samples from the WordSim353 dataset, to get an overall feeling of the Dataset structure.


Thanks to the subword support of the FasttextEmbeddingModel we can evaluate on all words in the evaluation dataset, not only the ones that we observed during training.

We first compute a list of tokens in our evaluation dataset and then create a embedding matrix for them based on the fastText model.

In [13]:
rw =
rw_tokens  = list(set(itertools.chain.from_iterable((d[0], d[1]) for d in rw)))

rw_token_embedding = nlp.embedding.TokenEmbedding(unknown_token=None, allow_extend=True)
rw_token_embedding[rw_tokens]= embedding[rw_tokens]

print('There are', len(rw_tokens), 'unique tokens in the RareWords dataset. Examples are:')
for i in range(5):
    print('\t', rw[i])
print('The imputed TokenEmbedding has shape', rw_token_embedding.idx_to_vec.shape)
Downloading /var/lib/jenkins/workspace/gluon-nlp-gpu-py3@2/tests/data/datasets/rarewords/ from
There are 2951 unique tokens in the RareWords dataset. Examples are:
         ['squishing', 'squirt', 5.88]
         ['undated', 'undatable', 5.83]
         ['circumvents', 'beat', 5.33]
         ['circumvents', 'ebb', 3.25]
         ['dispossess', 'deprive', 6.83]
The imputed TokenEmbedding has shape (2951, 300)
In [14]:
evaluator = nlp.embedding.evaluation.WordEmbeddingSimilarity(
In [15]:
words1, words2, scores = zip(*([rw_token_embedding.token_to_idx[d[0]],
                                d[2]] for d in rw))
words1 = mx.nd.array(words1, ctx=context)
words2 = mx.nd.array(words2, ctx=context)
In [16]:
pred_similarity = evaluator(words1, words2)
sr = stats.spearmanr(pred_similarity.asnumpy(), np.array(scores))
print('Spearman rank correlation on {} pairs of {}: {}'.format(
    len(words1), rw.__class__.__name__, sr.correlation.round(3)))
Spearman rank correlation on 2034 pairs of RareWords: 0.346

Further information

For further information and examples on training and evaluating word embeddings with GluonNLP take a look at the Word Embedding section on the Scripts / Model Zoo page. There you will find more thorough evaluation techniques and other embedding models. In fact, the and files used in this example are the same as the ones used in the script.