[Download]

Word Embeddings Training and Evaluation

In [1]:
import warnings
warnings.filterwarnings('ignore')

import itertools
import time
import math
import logging
import random

import mxnet as mx
import gluonnlp as nlp
import numpy as np
from scipy import stats

# context = mx.cpu()  # Enable this to run on CPU
context = mx.gpu(0)  # Enable this to run on GPU

Data

Here we use the Text8 corpus from the Large Text Compression Benchmark which includes the first 100 MB of cleaned text from the English Wikipedia.

In [2]:
text8 = nlp.data.Text8()
print('# sentences:', len(text8))
for sentence in text8[:3]:
    print('# tokens:', len(sentence), sentence[:5])
# sentences: 1701
# tokens: 10000 ['anarchism', 'originated', 'as', 'a', 'term']
# tokens: 10000 ['reciprocity', 'qualitative', 'impairments', 'in', 'communication']
# tokens: 10000 ['with', 'the', 'aegis', 'of', 'zeus']

Given the tokenized data, we first count all tokens and then construct a vocabulary of all tokens that occur at least 5 times in the dataset. The vocabulary contains a one-to-one mapping between tokens and integers (also called indices or short idx).

We further store the frequency count of each token in the vocabulary as we will require this information later on for sampling random negative (or noise) words. Finally we replace all tokens with their integer representation based on the vocabulary.

In [3]:
counter = nlp.data.count_tokens(itertools.chain.from_iterable(text8))
vocab = nlp.Vocab(counter, unknown_token=None, padding_token=None,
                  bos_token=None, eos_token=None, min_freq=5)
idx_to_counts = [counter[w] for w in vocab.idx_to_token]

def code(sentence):
    return [vocab[token] for token in sentence if token in vocab]

text8 = text8.transform(code, lazy=False)

print('# sentences:', len(text8))
for sentence in text8[:3]:
    print('# tokens:', len(sentence), sentence[:5])
# sentences: 1701
# tokens: 9895 [5233, 3083, 11, 5, 194]
# tokens: 9858 [18214, 17356, 36672, 4, 1753]
# tokens: 9926 [23, 0, 19754, 1, 4829]

Next we need to transform the coded Text8 dataset into batches useful for training an embedding model. In this tutorial we train with the SkipGram objective made popular by [1].

For SkipGram, we sample pairs of co-occurring words from the corpus. Two words are said to co-occur if they occurr with distance less then a specified window size. The window size is usually chosen around 5.

For obtaining the samples from the corpus, we can shuffle the sentences and the proceed linearly through each sentence, considering each word as well as all the words in it’s window. In this case, we call the current word in focus the center word, and the words in it’s window the context words. GluonNLP contains gluonnlp.data.EmbeddingCenterContextBatchify batchify transformation, that takes a corpus, such as the coded Text8 we have here, and returns a DataStream of batches of center and context words.

To obtain good results, each sentence is further subsampled, meaning that words are deleted with a probability proportional to their frequency. [1] proposes to discard individual occurences of words from the dataset with probability

\[P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}}\]

where \(f(w_i)\) is the frequency with which a word is observed in a dataset and \(t\) is a subsampling constant typically chosen around \(10^{-5}\). [1] has also shown that the final performance is improved if the window size is chosen uniformly random for each center words out of the range [1, window].

For this notebook, we are interested in training a fastText embedding model [2]. A fastText model not only associates a embedding vector to each token in the vocabulary, but also to a pre-specified number of subwords. Commonly 2 million subword vectors are obtained and each subword vector is associated with zero, one or multiple character-ngrams. The mapping between character-ngrams and subwords is based on a hash function. The final embedding vector of a token is the mean of the vectors associated with the token and all character-ngrams occuring in the string representation of the token. Thereby a fastText embedding model can compute meaningful embedding vectors for tokens that were not seen during training.

For this notebook, we have prepared a helper function transform_data which builds a series of transformations of the text8 Dataset created above, applying “tricks” mentioned before. It returns a DataStream over batches as well as a batchify_fn function that applied to a batch looks up and includes the fastText subwords associated with the center words and finally the subword function that can be used to obtain the subwords of a given string representation of a token. We will take a closer look at the subword function shortly.

Note that the number of subwords is potentially different for every word. Therefore the batchify_fn represents a word with its subwords as a row in a compressed sparse row (CSR) matrix. Take a look at https://mxnet.incubator.apache.org/tutorials/sparse/csr.html if you are not familia with CSR. Separating the batchify_fn from the previous word-pair sampling is useful, as it allows to parallelize the CSR matrix construction over multiple CPU cores for separate batches.

You can find it in data.py in the archive that can be downloaded via the Download button at the top of this page. - [1] Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionality.” Advances in neural information processing systems. 2013. - [2] Bojanowski et al., “Enriching Word Vectors with Subword Information” Transactions of the Association for Computational Linguistics 2017

In [4]:
from data import transform_data

batch_size=4096
text8 = nlp.data.SimpleDataStream([text8])  # input is a stream of datasets, here just 1. Allows scaling to larger corpora that don't fit in memory
data, batchify_fn, subword_function = transform_data(
    text8, vocab, idx_to_counts, cbow=False, ngrams=[3,4,5,6], ngram_buckets=100000, batch_size=batch_size, window_size=5)
In [5]:
batches = data.transform(batchify_fn)

Subwords

gluonnlp provides the concept of a SubwordFunction which maps words to a list of indices representing their subword. Possible SubwordFunctions include mapping a word to the sequence of it’s characters/bytes or hashes of all its ngrams.

FastText models use a hash function to map each ngram of a word to a number in range [0, num_subwords). We include the same hash function. Above transform_data has also returned a subword_function object. Let’s try it with a few words:

In [6]:
idx_to_subwordidxs = subword_function(vocab.idx_to_token)
for word, subwords in zip(vocab.idx_to_token[:3], idx_to_subwordidxs[:3]):
    print('<'+word+'>', subwords, sep = '\t')
<the>     [51151, 9726, 48960, 61980, 60934, 16280]
<of>      [97102, 64528, 28930]
<and>     [78080, 35020, 30390, 95046, 19624, 25443]

Model

Here we define a SkipGram model for training fastText embeddings. For Skip-Gram, the model consists of two independent embedding networks. One for the center words, and one for the context words. For center words, subwords are taken into account while for context words only the token itself is taken into account.

GluonNLP provides a nlp.model.train.FasttextEmbeddingModel Block which defines the fastText style embedding with subword support. It can be used for training, but also supports loading models trained with the original C++ fastText library from .bin files. After training, vectors for arbitrary words can be looked up via embedding[['a', 'list', 'of', 'potentially', 'unknown', 'words']] where embedding is a nlp.model.train.FasttextEmbeddingModel.

In the model.py script we provide a definition for the fastText model for the SkipGram objective. The model definition is a Gluon HybridBlock, meaning that the complete forward / backward pass are compiled and executed directly in the MXNet backend. Not only does the Block include the FasttextEmbeddingModel for the center words and a simple embedding matrix for the context words, but it also takes care of sampling a specified number of noise words for each center- context pair. These noise words are called negatives, as the resulting center- negative pair is unlikely to occur in the dataset. The model then must learn which word-pairs are negatives and which ones are real. Thereby it obtains meaningful word and subword vectors for all considered tokens. The negatives are sampled from the smoothed unigram frequency distribution.

Let’s instantiate and initialize the model. We also create a trainer object for updating the parameters with AdaGrad. Finally we print a summary of the model.

In [7]:
from model import SG as SkipGramNet

emsize = 300
num_negatives = 5

negatives_weights = mx.nd.array(idx_to_counts)
embedding = SkipGramNet(
    vocab.token_to_idx, emsize, batch_size, negatives_weights, subword_function, num_negatives=5, smoothing=0.75)
embedding.initialize(ctx=context)
embedding.hybridize()
trainer = mx.gluon.Trainer(embedding.collect_params(), 'adagrad', dict(learning_rate=0.05))

print(embedding)
SG(
  (embedding): FasttextEmbeddingModel(71290 + 100000 -> 300, float32)
  (embedding_out): Embedding(71290 -> 300, float32)
  (negatives_sampler): UnigramCandidateSampler(71290, <class 'numpy.int64'>)
)

Let’s take a look at the documentation of the forward pass.

In [8]:
print(SkipGramNet.hybrid_forward.__doc__)
SkipGram forward pass.

        Parameters
        ----------
        center : mxnet.nd.NDArray or mxnet.sym.Symbol
            Sparse CSR array of word / subword indices of shape (batch_size,
            len(token_to_idx) + num_subwords). Embedding for center words are
            computed via F.sparse.dot between the CSR center array and the
            weight matrix.
        context : mxnet.nd.NDArray or mxnet.sym.Symbol
            Dense array of context words of shape (batch_size, ). Also used for
            row-wise independently masking negatives equal to one of context.
        center_words : mxnet.nd.NDArray or mxnet.sym.Symbol
            Dense array of center words of shape (batch_size, ). Only used for
            row-wise independently masking negatives equal to one of
            center_words.

Before we start training, let’s examine the quality of our randomly initialized embeddings:

In [9]:
def norm_vecs_by_row(x):
    return x / (mx.nd.sum(x * x, axis=1) + 1e-10).sqrt().reshape((-1, 1))


def get_k_closest_tokens(vocab, embedding, k, word):
    word_vec = norm_vecs_by_row(embedding[[word]])
    vocab_vecs = norm_vecs_by_row(embedding[vocab.idx_to_token])
    dot_prod = mx.nd.dot(vocab_vecs, word_vec.T)
    indices = mx.nd.topk(
        dot_prod.reshape((len(vocab.idx_to_token), )),
        k=k + 1,
        ret_typ='indices')
    indices = [int(i.asscalar()) for i in indices]
    result = [vocab.idx_to_token[i] for i in indices[1:]]
    print('closest tokens to "%s": %s' % (word, ", ".join(result)))
In [10]:
example_token = "vector"
get_k_closest_tokens(vocab, embedding, 10, example_token)
closest tokens to "vector": vectors, vectoring, bivector, sector, rector, lector, spector, director, vectorborne, hector

We can see that in the randomly initialized fastText model the closest tokens to “vector” are based on overlapping ngrams.

Training

Thanks to the Gluon data pipeline and the HybridBlock handling all complexity, our training code is very simple. We iterate over all batches, move them to the appropriate context (GPU), do forward, backward and parameter update and finally include some helpful print statements for following the training process.

In [11]:
log_interval = 500

def train_embedding(num_epochs):
    for epoch in range(1, num_epochs + 1):
        start_time = time.time()
        l_avg = 0
        log_wc = 0

        print('Beginnign epoch %d and resampling data.' % epoch)
        for i, batch in enumerate(batches):
            batch = [array.as_in_context(context) for array in batch]
            with mx.autograd.record():
                l = embedding(*batch)
            l.backward()
            trainer.step(1)

            l_avg += l.mean()
            log_wc += l.shape[0]
            if i % log_interval == 0:
                mx.nd.waitall()
                wps = log_wc / (time.time() - start_time)
                l_avg = l_avg.asscalar() / log_interval
                print('epoch %d, iteration %d, loss %.2f, throughput=%.2fK wps'
                      % (epoch, i, l_avg, wps / 1000))
                start_time = time.time()
                log_wc = 0
                l_avg = 0

        get_k_closest_tokens(vocab, embedding, 10, example_token)
        print("")
In [12]:
train_embedding(num_epochs=1)
Beginnign epoch 1 and resampling data.
epoch 1, iteration 0, loss 0.00, throughput=0.56K wps
epoch 1, iteration 500, loss 0.53, throughput=223.89K wps
epoch 1, iteration 1000, loss 0.48, throughput=217.82K wps
epoch 1, iteration 1500, loss 0.46, throughput=235.90K wps
epoch 1, iteration 2000, loss 0.45, throughput=217.48K wps
epoch 1, iteration 2500, loss 0.44, throughput=204.12K wps
epoch 1, iteration 3000, loss 0.43, throughput=196.32K wps
epoch 1, iteration 3500, loss 0.43, throughput=190.44K wps
epoch 1, iteration 4000, loss 0.43, throughput=190.83K wps
epoch 1, iteration 4500, loss 0.42, throughput=192.61K wps
epoch 1, iteration 5000, loss 0.42, throughput=191.21K wps
epoch 1, iteration 5500, loss 0.42, throughput=188.95K wps
epoch 1, iteration 6000, loss 0.41, throughput=190.74K wps
epoch 1, iteration 6500, loss 0.41, throughput=194.47K wps
epoch 1, iteration 7000, loss 0.41, throughput=200.57K wps
epoch 1, iteration 7500, loss 0.41, throughput=206.25K wps
epoch 1, iteration 8000, loss 0.41, throughput=205.80K wps
epoch 1, iteration 8500, loss 0.41, throughput=207.05K wps
epoch 1, iteration 9000, loss 0.41, throughput=181.66K wps
epoch 1, iteration 9500, loss 0.40, throughput=181.63K wps
epoch 1, iteration 10000, loss 0.40, throughput=168.69K wps
epoch 1, iteration 10500, loss 0.40, throughput=166.63K wps
epoch 1, iteration 11000, loss 0.40, throughput=133.92K wps
epoch 1, iteration 11500, loss 0.40, throughput=172.69K wps
epoch 1, iteration 12000, loss 0.40, throughput=160.53K wps
closest tokens to "vector": eigenvector, bivector, vectoring, functor, vectors, inverse, vectra, euclidean, eigenvectors, coalgebra

Word Similarity and Relatedness Task

Word embeddings should capture the relationsship between words in natural language. In the Word Similarity and Relatedness Task word embeddings are evaluated by comparing word similarity scores computed from a pair of words with human labels for the similarity or relatedness of the pair.

gluonnlp includes a number of common datasets for the Word Similarity and Relatedness Task. The included datasets are listed in the API documentation. We use several of them in the evaluation example below. We first show a few samples from the WordSim353 dataset, to get an overall feeling of the Dataset structur

Evaluation

Thanks to the subword support of the FasttextEmbeddingModel we can evaluate on all words in the evaluation dataset, not only the ones that we observed during training.

We first compute a list of tokens in our evaluation dataset and then create a embedding matrix for them based on the fastText model.

In [13]:
rw = nlp.data.RareWords()
rw_tokens  = list(set(itertools.chain.from_iterable((d[0], d[1]) for d in rw)))

rw_token_embedding = nlp.embedding.TokenEmbedding(unknown_token=None, allow_extend=True)
rw_token_embedding[rw_tokens]= embedding[rw_tokens]

print('There are', len(rw_tokens), 'unique tokens in the RareWords dataset. Examples are:')
for i in range(5):
    print('\t', rw[i])
print('The imputed TokenEmbedding has shape', rw_token_embedding.idx_to_vec.shape)
There are 2951 unique tokens in the RareWords dataset. Examples are:
         ['squishing', 'squirt', 5.88]
         ['undated', 'undatable', 5.83]
         ['circumvents', 'beat', 5.33]
         ['circumvents', 'ebb', 3.25]
         ['dispossess', 'deprive', 6.83]
The imputed TokenEmbedding has shape (2951, 300)
In [14]:
evaluator = nlp.embedding.evaluation.WordEmbeddingSimilarity(
    idx_to_vec=rw_token_embedding.idx_to_vec,
    similarity_function="CosineSimilarity")
evaluator.initialize(ctx=context)
evaluator.hybridize()
In [15]:
words1, words2, scores = zip(*([rw_token_embedding.token_to_idx[d[0]],
                                rw_token_embedding.token_to_idx[d[1]],
                                d[2]] for d in rw))
words1 = mx.nd.array(words1, ctx=context)
words2 = mx.nd.array(words2, ctx=context)
In [16]:
pred_similarity = evaluator(words1, words2)
sr = stats.spearmanr(pred_similarity.asnumpy(), np.array(scores))
print('Spearman rank correlation on {} pairs of {}: {}'.format(
    len(words1), rw.__class__.__name__, sr.correlation.round(3)))
Spearman rank correlation on 2034 pairs of RareWords: 0.349

Further information

For further information and examples on training and evaluating word embeddings with GluonNLP take a look at the Word Embedding section on the Scripts / Model Zoo page. There you will find more thorough evaluation techniques and other embedding models. In fact, the data.py and model.py files used in this example are the same as the ones used in the script.