gluonnlp.embedding

This page describes gluonnlp APIs for text embedding, such as loading pre-trained embedding vectors for text tokens and storing them in the mxnet.ndarray.NDArray format, and utility for intrinsic evaluation of text embeddings.

gluonnlp.embedding Word embeddings.
register Registers a new token embedding.
create Creates an instance of token embedding.
list_sources Get valid token embedding names and their pre-trained file names.
TokenEmbedding Token embedding base class.
GloVe The GloVe word embedding.
FastText The fastText word embedding.

API Reference

Word embeddings.

gluonnlp.embedding.register(embedding_cls)[source]

Registers a new token embedding.

Once an embedding is registered, we can create an instance of this embedding with create().

Examples

>>> @gluonnlp.embedding.register
... class MyTextEmbed(gluonnlp.embedding.TokenEmbedding):
...     def __init__(self, source='my_pretrain_file'):
...         pass
>>> embed = gluonnlp.embedding.create('MyTokenEmbed')
>>> print(type(embed))
<class '__main__.MyTokenEmbed'>
gluonnlp.embedding.create(embedding_name, **kwargs)[source]

Creates an instance of token embedding.

Creates a token embedding instance by loading embedding vectors from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText. To get all the valid embedding_name and source, use gluonnlp.embedding.list_sources().

Parameters:
  • embedding_name (str) – The token embedding name (case-insensitive).
  • kwargs (dict) – All other keyword arguments are passed to the initializer of token embedding class. For example create(embedding_name=’fasttext’, source=’wiki.simple’, load_ngrams=True) will return FastText(source=’wiki.simple’, load_ngrams=True).
Returns:

A token embedding instance that loads embedding vectors from an externally hosted pre-trained token embedding file.

Return type:

An instance of gluonnlp.embedding.TokenEmbedding

gluonnlp.embedding.list_sources(embedding_name=None)[source]

Get valid token embedding names and their pre-trained file names.

To load token embedding vectors from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText, one should use gluonnlp.embedding.create(embedding_name, source). This method returns all the valid names of source for the specified embedding_name. If embedding_name is set to None, this method returns all the valid names of embedding_name with their associated source.

Parameters:embedding_name (str or None, default None) – The pre-trained token embedding name.
Returns:A list of all the valid pre-trained token embedding file names (source) for the specified token embedding name (embedding_name). If the text embeding name is set to None, returns a dict mapping each valid token embedding name to a list of valid pre-trained files (source). They can be plugged into gluonnlp.embedding.create(embedding_name, source).
Return type:dict or list
class gluonnlp.embedding.TokenEmbedding(unknown_token='<unk>', init_unknown_vec=<function zeros>, allow_extend=False, unknown_lookup=None, unknown_autoextend=True)[source]

Token embedding base class.

To load token embedding from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText, use gluonnlp.embedding.create(). To get all the available embedding_name and source, use gluonnlp.embedding.list_sources().

Alternatively, to load embedding vectors from a custom pre-trained token embedding file, use gluonnlp.embedding.from_file().

If unknown_token is None, looking up unknown tokens results in KeyError. Otherwise, for every unknown token, if its representation self.unknown_token is encountered in the pre-trained token embedding file, index 0 of self.idx_to_vec maps to the pre-trained token embedding vector loaded from the file; otherwise, index 0 of self.idx_to_vec maps to the token embedding vector initialized by init_unknown_vec.

If a token is encountered multiple times in the pre-trained token embedding file, only the first-encountered token embedding vector will be loaded and the rest will be skipped.

Parameters:
  • unknown_token (hashable object or None, default '<unk>') – Any unknown token will be replaced by unknown_token and consequently will be indexed as the same representation. Only used if oov_imputer is not specified.
  • init_unknown_vec (callback) – The callback used to initialize the embedding vector for the unknown token. Only used if unknown_token is not None.
  • allow_extend (bool, default False) – If True, embedding vectors for previously unknown words can be added via token_embedding[tokens] = vecs. If False, only vectors for known tokens can be updated.
  • unknown_lookup (object subscriptable with list of tokens returning nd.NDarray, default None) – If not None, unknown_lookup[tokens] is called for any unknown tokens. The result is cached if unknown_autoextend is True.
  • unknown_autoextend (bool, default True) – If True, any unknown token for which a vector was looked up in unknown_lookup together with the resulting vector will be added to token_to_idx, idx_to_token and idx_to_vec, adding a new index. This option is ignored if allow_extend is False.
__contains__(token)[source]

Check if token is known.

Parameters:token (str) – A token.
Returns:Return True if the token is known. A token is known if it has been assigned an index and vector.
Return type:bool
__getitem__(tokens)[source]

Looks up embedding vectors of text tokens.

Parameters:tokens (str or list of strs) – A token or a list of tokens.
Returns:The embedding vector(s) of the token(s). According to numpy conventions, if tokens is a string, returns a 1-D NDArray (vector); if tokens is a list of strings, returns a 2-D NDArray (matrix) of shape=(len(tokens), vec_len).
Return type:mxnet.ndarray.NDArray
__setitem__(tokens, new_embedding)[source]

Updates embedding vectors for tokens.

If self.allow_extend is True, vectors for previously unknown tokens can be introduced.

Parameters:
  • tokens (hashable object or a list or tuple of hashable objects) – A token or a list of tokens whose embedding vector are to be updated.
  • new_embedding (mxnet.ndarray.NDArray) – An NDArray to be assigned to the embedding vectors of tokens. Its length must be equal to the number of tokens and its width must be equal to the dimension of embedding of the glossary. If tokens is a singleton, it must be 1-D or 2-D. If tokens is a list of multiple strings, it must be 2-D.
allow_extend

Allow extension of the TokenEmbedding with new tokens.

If True, TokenEmbedding[tokens] = vec can introduce new tokens that were previously unknown. New indices will be assigned to the newly introduced tokens. If False, only known tokens can be updated.

Returns:Extension of the TokenEmbedding is allowed.
Return type:bool
classmethod deserialize(file_path, **kwargs)[source]

Create a new TokenEmbedding from a serialized one.

TokenEmbedding is serialized by converting the list of tokens, the array of word embeddings and other metadata to numpy arrays, saving all in a single (optionally compressed) Zipfile. See https://docs.scipy.org/doc/numpy/neps/npy-format.html for more information on the format.

Parameters:
  • file_path (str or file) – The path to a file that holds the serialized TokenEmbedding.
  • kwargs (dict) – Keyword arguments are passed to the TokenEmbedding initializer. Useful for attaching unknown_lookup.
static from_file(file_path, elem_delim=' ', encoding='utf8', **kwargs)[source]

Creates a user-defined token embedding from a pre-trained embedding file.

This is to load embedding vectors from a user-defined pre-trained token embedding file. For example, if elem_delim = ‘ ‘, the expected format of a custom pre-trained token embedding file may look like:

‘hello 0.1 0.2 0.3 0.4 0.5\nworld 1.1 1.2 1.3 1.4 1.5\n’

where embedding vectors of words hello and world are [0.1, 0.2, 0.3, 0.4, 0.5] and [1.1, 1.2, 1.3, 1.4, 1.5] respectively.

Parameters:
  • file_path (str) – The path to the user-defined pre-trained token embedding file.
  • elem_delim (str, default ' ') – The delimiter for splitting a token and every embedding vector element value on the same line of the custom pre-trained token embedding file.
  • encoding (str, default 'utf8') – The encoding scheme for reading the custom pre-trained token embedding file.
  • kwargs (dict) – All other keyword arguments are passed to the TokenEmbedding initializer.
Returns:

The user-defined token embedding instance.

Return type:

instance of gluonnlp.embedding.TokenEmbedding

idx_to_token

Index to token mapping.

Returns:A list of indexed tokens where the list indices and the token indices are aligned.
Return type:list of str
idx_to_vec

Index to vector mapping.

Returns:For all the indexed tokens in this embedding, this NDArray maps each token’s index to an embedding vector.
Return type:mxnet.ndarray.NDArray
serialize(file_path, compress=True)[source]

Serializes the TokenEmbedding to a file specified by file_path.

TokenEmbedding is serialized by converting the list of tokens, the array of word embeddings and other metadata to numpy arrays, saving all in a single (optionally compressed) Zipfile. See https://docs.scipy.org/doc/numpy/neps/npy-format.html for more information on the format.

Parameters:
  • file_path (str or file) – The path at which to create the file holding the serialized TokenEmbedding. If file is a string or a Path, the .npz extension will be appended to the file name if it is not already there.
  • compress (bool, default True) – Compress the Zipfile or leave it uncompressed.
token_to_idx

Token to index mapping.

Returns:A dictionary of tokens with their corresponding index numbers; inverse vocab.
Return type:dict of int to strs
unknown_autoextend

Autoextension behavior for unknown token lookup.

If True, any unknown token for which a vector was looked up in unknown_lookup together with the resulting vector will be added to token_to_idx, idx_to_token and idx_to_vec, adding a new index. Applies only if unknown_lookup is not None.

Returns:Autoextension behavior
Return type:bool
unknown_lookup

Vector lookup for unknown tokens.

If not None, unknown_lookup[tokens] is called for any unknown tokens. The result is cached if unknown_autoextend is True.

Returns:Vector lookup mapping from tokens to vectors.
Return type:Mapping[List[str], nd.NDarray]
unknown_token

Unknown token representation.

Any token that is unknown will be indexed using the representation of unknown_token.

Returns:Unknown token representation
Return type:hashable object or None
class gluonnlp.embedding.GloVe(source='glove.6B.50d', embedding_root='/var/lib/jenkins/.mxnet/embedding', **kwargs)[source]

The GloVe word embedding.

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. (Source from https://nlp.stanford.edu/projects/glove/)

Reference:

GloVe: Global Vectors for Word Representation. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. https://nlp.stanford.edu/pubs/glove.pdf

Website:

https://nlp.stanford.edu/projects/glove/

To get the updated URLs to the externally hosted pre-trained token embedding files, visit https://nlp.stanford.edu/projects/glove/

License for pre-trained embedding:

https://opendatacommons.org/licenses/pddl/

Parameters:
  • source (str, default 'glove.6B.50d') – The name of the pre-trained token embedding file.
  • embedding_root (str, default '$MXNET_HOME/embedding') – The root directory for storing embedding-related files. MXNET_HOME defaults to ‘~/.mxnet’.
  • kwargs – All other keyword arguments are passed to gluonnlp.embedding.TokenEmbedding.
Variables:
  • idx_to_vec (mxnet.ndarray.NDArray) – For all the indexed tokens in this embedding, this NDArray maps each token’s index to an embedding vector.
  • unknown_token (hashable object) – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation.
class gluonnlp.embedding.FastText(source='wiki.simple', embedding_root='/var/lib/jenkins/.mxnet/embedding', load_ngrams=False, ctx=cpu(0), **kwargs)[source]

The fastText word embedding.

FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices. (Source from https://fasttext.cc/)

References:

Enriching Word Vectors with Subword Information. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. https://arxiv.org/abs/1607.04606

Bag of Tricks for Efficient Text Classification. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. https://arxiv.org/abs/1607.01759

FastText.zip: Compressing text classification models. Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herve Jegou, and Tomas Mikolov. https://arxiv.org/abs/1612.03651

For ‘wiki.multi’ embedding: Word Translation Without Parallel Data Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. https://arxiv.org/abs/1710.04087

Website:

https://fasttext.cc/

To get the updated URLs to the externally hosted pre-trained token embedding files, visit https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

License for pre-trained embedding:

https://creativecommons.org/licenses/by-sa/3.0/

Parameters:
  • source (str, default 'wiki.simple') – The name of the pre-trained token embedding file.
  • embedding_root (str, default '$MXNET_HOME/embedding') – The root directory for storing embedding-related files. MXNET_HOME defaults to ‘~/.mxnet’.
  • load_ngrams (bool, default False) – Load vectors for ngrams so that computing vectors for OOV words is possible. This is disabled by default as it requires downloading an additional 2GB file containing the vectors for ngrams. Note that facebookresearch did not publish ngram vectors for all their models. If load_ngrams is True, but no ngram vectors are available for the chosen source this a RuntimeError is thrown. The ngram vectors are passed to the resulting TokenEmbedding as unknown_lookup.
  • ctx (mx.Context, default mxnet.cpu()) – Context to load the FasttextEmbeddingModel for ngram vectors to. This parameter is ignored if load_ngrams is False.
  • kwargs – All other keyword arguments are passed to gluonnlp.embedding.TokenEmbedding.
Variables:
  • idx_to_vec (mxnet.ndarray.NDArray) – For all the indexed tokens in this embedding, this NDArray maps each token’s index to an embedding vector.
  • unknown_token (hashable object) – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation.
class gluonnlp.embedding.Word2Vec(source='GoogleNews-vectors-negative300', embedding_root='/var/lib/jenkins/.mxnet/embedding', **kwargs)[source]

The Word2Vec word embedding.

Word2Vec is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed with continuous bag-of-words or skip-gram architecture for computing vector representations of words.

References:

[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.

[3] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.

Website:

https://code.google.com/archive/p/word2vec/

License for pre-trained embedding:

Unspecified

Parameters:
  • source (str, default 'GoogleNews-vectors-negative300') – The name of the pre-trained token embedding file.
  • embedding_root (str, default '$MXNET_HOME/embedding') – The root directory for storing embedding-related files. MXNET_HOME defaults to ‘~/.mxnet’.
  • kwargs – All other keyword arguments are passed to gluonnlp.embedding.TokenEmbedding.
Variables:
  • idx_to_vec (mxnet.ndarray.NDArray) – For all the indexed tokens in this embedding, this NDArray maps each token’s index to an embedding vector.
  • unknown_token (hashable object) – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation.