gluonnlp.embedding

GluonNLP Toolkit provides tools for working with embeddings.

This page describes the gluonnlp APIs for text embedding, such as loading pre-trained embedding vectors for text tokens and storing them in the mxnet.ndarray.NDArray format as well as utilities for intrinsic evaluation of text embeddings.

Pre-trained Embeddings

register Registers a new token embedding.
create Creates an instance of token embedding.
list_sources Get valid token embedding names and their pre-trained file names.
TokenEmbedding Token embedding base class.
GloVe The GloVe word embedding.
FastText The fastText word embedding.
Word2Vec The Word2Vec word embedding.

Intrinsic evaluation

register Registers a new word embedding evaluation function.
create Creates an instance of a registered word embedding evaluation function.
list_evaluation_functions Get valid word embedding functions names.
WordEmbeddingSimilarityFunction Base class for word embedding similarity functions.
WordEmbeddingAnalogyFunction Base class for word embedding analogy functions.
CosineSimilarity Computes the cosine similarity.
ThreeCosAdd The 3CosAdd analogy function.
ThreeCosMul The 3CosMul analogy function.
WordEmbeddingSimilarity Word embeddings similarity task evaluator.
WordEmbeddingAnalogy Word embeddings analogy task evaluator.

API Reference

Word embeddings.

gluonnlp.embedding.register(embedding_cls)[source]

Registers a new token embedding.

Once an embedding is registered, we can create an instance of this embedding with create().

Examples

>>> @gluonnlp.embedding.register
... class MyTextEmbed(gluonnlp.embedding.TokenEmbedding):
...     def __init__(self, source='my_pretrain_file'):
...         pass
>>> embed = gluonnlp.embedding.create('MyTextEmbed')
>>> print(type(embed))
<class 'MyTextEmbed'>
gluonnlp.embedding.create(embedding_name, **kwargs)[source]

Creates an instance of token embedding.

Creates a token embedding instance by loading embedding vectors from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText. To get all the valid embedding_name and source, use gluonnlp.embedding.list_sources().

Parameters:
  • embedding_name (str) – The token embedding name (case-insensitive).
  • kwargs (dict) – All other keyword arguments are passed to the initializer of token embedding class. For example create(embedding_name=’fasttext’, source=’wiki.simple’, load_ngrams=True) will return FastText(source=’wiki.simple’, load_ngrams=True).
Returns:

A token embedding instance that loads embedding vectors from an externally hosted pre-trained token embedding file.

Return type:

An instance of gluonnlp.embedding.TokenEmbedding

gluonnlp.embedding.list_sources(embedding_name=None)[source]

Get valid token embedding names and their pre-trained file names.

To load token embedding vectors from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText, one should use gluonnlp.embedding.create(embedding_name, source). This method returns all the valid names of source for the specified embedding_name. If embedding_name is set to None, this method returns all the valid names of embedding_name with their associated source.

Parameters:embedding_name (str or None, default None) – The pre-trained token embedding name.
Returns:A list of all the valid pre-trained token embedding file names (source) for the specified token embedding name (embedding_name). If the text embeding name is set to None, returns a dict mapping each valid token embedding name to a list of valid pre-trained files (source). They can be plugged into gluonnlp.embedding.create(embedding_name, source).
Return type:dict or list
class gluonnlp.embedding.TokenEmbedding(unknown_token='<unk>', init_unknown_vec=<function zeros>, allow_extend=False, unknown_lookup=None, unknown_autoextend=True)[source]

Token embedding base class.

To load token embedding from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText, use gluonnlp.embedding.create(). To get all the available embedding_name and source, use gluonnlp.embedding.list_sources().

Alternatively, to load embedding vectors from a custom pre-trained token embedding file, use gluonnlp.embedding.from_file().

If unknown_token is None, looking up unknown tokens results in KeyError. Otherwise, for every unknown token, if its representation self.unknown_token is encountered in the pre-trained token embedding file, index 0 of self.idx_to_vec maps to the pre-trained token embedding vector loaded from the file; otherwise, index 0 of self.idx_to_vec maps to the token embedding vector initialized by init_unknown_vec.

If a token is encountered multiple times in the pre-trained token embedding file, only the first-encountered token embedding vector will be loaded and the rest will be skipped.

Parameters:
  • unknown_token (hashable object or None, default '<unk>') – Any unknown token will be replaced by unknown_token and consequently will be indexed as the same representation. Only used if oov_imputer is not specified.
  • init_unknown_vec (callback) – The callback used to initialize the embedding vector for the unknown token. Only used if unknown_token is not None.
  • allow_extend (bool, default False) – If True, embedding vectors for previously unknown words can be added via token_embedding[tokens] = vecs. If False, only vectors for known tokens can be updated.
  • unknown_lookup (object subscriptable with list of tokens returning nd.NDarray, default None) – If not None, unknown_lookup[tokens] is called for any unknown tokens. The result is cached if unknown_autoextend is True.
  • unknown_autoextend (bool, default True) – If True, any unknown token for which a vector was looked up in unknown_lookup together with the resulting vector will be added to token_to_idx, idx_to_token and idx_to_vec, adding a new index. This option is ignored if allow_extend is False.
__contains__(token)[source]

Check if token is known.

Parameters:token (str) – A token.
Returns:Return True if the token is known. A token is known if it has been assigned an index and vector.
Return type:bool
__getitem__(tokens)[source]

Looks up embedding vectors of text tokens.

Parameters:tokens (str or list of strs) – A token or a list of tokens.
Returns:The embedding vector(s) of the token(s). According to numpy conventions, if tokens is a string, returns a 1-D NDArray (vector); if tokens is a list of strings, returns a 2-D NDArray (matrix) of shape=(len(tokens), vec_len).
Return type:mxnet.ndarray.NDArray
__setitem__(tokens, new_embedding)[source]

Updates embedding vectors for tokens.

If self.allow_extend is True, vectors for previously unknown tokens can be introduced.

Parameters:
  • tokens (hashable object or a list or tuple of hashable objects) – A token or a list of tokens whose embedding vector are to be updated.
  • new_embedding (mxnet.ndarray.NDArray) – An NDArray to be assigned to the embedding vectors of tokens. Its length must be equal to the number of tokens and its width must be equal to the dimension of embedding of the glossary. If tokens is a singleton, it must be 1-D or 2-D. If tokens is a list of multiple strings, it must be 2-D.
allow_extend

Allow extension of the TokenEmbedding with new tokens.

If True, TokenEmbedding[tokens] = vec can introduce new tokens that were previously unknown. New indices will be assigned to the newly introduced tokens. If False, only known tokens can be updated.

Returns:Extension of the TokenEmbedding is allowed.
Return type:bool
classmethod deserialize(file_path, **kwargs)[source]

Create a new TokenEmbedding from a serialized one.

TokenEmbedding is serialized by converting the list of tokens, the array of word embeddings and other metadata to numpy arrays, saving all in a single (optionally compressed) Zipfile. See https://docs.scipy.org/doc/numpy-1.14.2/neps/npy-format.html for more information on the format.

Parameters:
  • file_path (str or file) – The path to a file that holds the serialized TokenEmbedding.
  • kwargs (dict) – Keyword arguments are passed to the TokenEmbedding initializer. Useful for attaching unknown_lookup.
static from_file(file_path, elem_delim=' ', encoding='utf8', **kwargs)[source]

Creates a user-defined token embedding from a pre-trained embedding file.

This is to load embedding vectors from a user-defined pre-trained token embedding file. For example, if elem_delim = ‘ ‘, the expected format of a custom pre-trained token embedding file may look like:

‘hello 0.1 0.2 0.3 0.4 0.5\nworld 1.1 1.2 1.3 1.4 1.5\n’

where embedding vectors of words hello and world are [0.1, 0.2, 0.3, 0.4, 0.5] and [1.1, 1.2, 1.3, 1.4, 1.5] respectively.

Parameters:
  • file_path (str) – The path to the user-defined pre-trained token embedding file.
  • elem_delim (str, default ' ') – The delimiter for splitting a token and every embedding vector element value on the same line of the custom pre-trained token embedding file.
  • encoding (str, default 'utf8') – The encoding scheme for reading the custom pre-trained token embedding file.
  • kwargs (dict) – All other keyword arguments are passed to the TokenEmbedding initializer.
Returns:

The user-defined token embedding instance.

Return type:

instance of gluonnlp.embedding.TokenEmbedding

idx_to_token

Index to token mapping.

Returns:A list of indexed tokens where the list indices and the token indices are aligned.
Return type:list of str
idx_to_vec

Index to vector mapping.

Returns:For all the indexed tokens in this embedding, this NDArray maps each token’s index to an embedding vector.
Return type:mxnet.ndarray.NDArray
serialize(file_path, compress=True)[source]

Serializes the TokenEmbedding to a file specified by file_path.

TokenEmbedding is serialized by converting the list of tokens, the array of word embeddings and other metadata to numpy arrays, saving all in a single (optionally compressed) Zipfile. See https://docs.scipy.org/doc/numpy-1.14.2/neps/npy-format.html for more information on the format.

Parameters:
  • file_path (str or file) – The path at which to create the file holding the serialized TokenEmbedding. If file is a string or a Path, the .npz extension will be appended to the file name if it is not already there.
  • compress (bool, default True) – Compress the Zipfile or leave it uncompressed.
token_to_idx

Token to index mapping.

Returns:A dictionary of tokens with their corresponding index numbers; inverse vocab.
Return type:dict of int to strs
unknown_autoextend

Autoextension behavior for unknown token lookup.

If True, any unknown token for which a vector was looked up in unknown_lookup together with the resulting vector will be added to token_to_idx, idx_to_token and idx_to_vec, adding a new index. Applies only if unknown_lookup is not None.

Returns:Autoextension behavior
Return type:bool
unknown_lookup

Vector lookup for unknown tokens.

If not None, unknown_lookup[tokens] is called for any unknown tokens. The result is cached if unknown_autoextend is True.

Returns:Vector lookup mapping from tokens to vectors.
Return type:Mapping[List[str], nd.NDarray]
unknown_token

Unknown token representation.

Any token that is unknown will be indexed using the representation of unknown_token.

Returns:Unknown token representation
Return type:hashable object or None
class gluonnlp.embedding.GloVe(source='glove.6B.50d', embedding_root='/var/lib/jenkins/.mxnet/embedding', **kwargs)[source]

The GloVe word embedding.

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. (Source from https://nlp.stanford.edu/projects/glove/)

Reference:

GloVe: Global Vectors for Word Representation. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. https://nlp.stanford.edu/pubs/glove.pdf

Website: https://nlp.stanford.edu/projects/glove/

To get the updated URLs to the externally hosted pre-trained token embedding files, visit https://nlp.stanford.edu/projects/glove/

License for pre-trained embedding: https://opendatacommons.org/licenses/pddl/

Available sources


>>> import warnings; warnings.filterwarnings('ignore');
>>> import gluonnlp as nlp
>>> nlp.embedding.list_sources('GloVe')
['glove.42B.300d', 'glove.6B.100d', 'glove.6B.200d', 'glove.6B.300d', 'glove.6B.50d', 'glove.840B.300d', 'glove.twitter.27B.100d', 'glove.twitter.27B.200d', 'glove.twitter.27B.25d', 'glove.twitter.27B.50d']
Parameters:
  • source (str, default 'glove.6B.50d') – The name of the pre-trained token embedding file.
  • embedding_root (str, default '$MXNET_HOME/embedding') – The root directory for storing embedding-related files. MXNET_HOME defaults to ‘~/.mxnet’.
  • kwargs – All other keyword arguments are passed to gluonnlp.embedding.TokenEmbedding.
Variables:
  • idx_to_vec (mxnet.ndarray.NDArray) – For all the indexed tokens in this embedding, this NDArray maps each token’s index to an embedding vector.
  • unknown_token (hashable object) – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation.
class gluonnlp.embedding.FastText(source='wiki.simple', embedding_root='/var/lib/jenkins/.mxnet/embedding', load_ngrams=False, ctx=cpu(0), **kwargs)[source]

The fastText word embedding.

FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices. (Source from https://fasttext.cc/)

References:

Enriching Word Vectors with Subword Information. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. https://arxiv.org/abs/1607.04606

Bag of Tricks for Efficient Text Classification. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. https://arxiv.org/abs/1607.01759

FastText.zip: Compressing text classification models. Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herve Jegou, and Tomas Mikolov. https://arxiv.org/abs/1612.03651

For ‘wiki.multi’ embedding: Word Translation Without Parallel Data Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. https://arxiv.org/abs/1710.04087

Website: https://fasttext.cc/

To get the updated URLs to the externally hosted pre-trained token embedding files, visit https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

License for pre-trained embedding: https://creativecommons.org/licenses/by-sa/3.0/

Available sources


>>> import warnings; warnings.filterwarnings('ignore');
>>> import gluonnlp as nlp
>>> nlp.embedding.list_sources('FastText')
['crawl-300d-2M', 'crawl-300d-2M-subword', 'wiki.aa', 'wiki.ab', 'wiki.ace', 'wiki.ady', 'wiki.af', 'wiki.ak', 'wiki.als', 'wiki.am', 'wiki.ang', 'wiki.an', 'wiki.arc', 'wiki.ar', 'wiki.arz', 'wiki.as', 'wiki.ast', 'wiki.av', 'wiki.ay', 'wiki.azb', 'wiki.az', 'wiki.ba', 'wiki.bar', 'wiki.bat_smg', 'wiki.bcl', 'wiki.be', 'wiki.bg', 'wiki.bh', 'wiki.bi', 'wiki.bjn', 'wiki.bm', 'wiki.bn', 'wiki.bo', 'wiki.bpy', 'wiki.br', 'wiki.bs', 'wiki.bug', 'wiki.bxr', 'wiki.ca', 'wiki.cbk_zam', 'wiki.cdo', 'wiki.ceb', 'wiki.ce', 'wiki.ch', 'wiki.cho', 'wiki.chr', 'wiki.chy', 'wiki.ckb', 'wiki.co', 'wiki.crh', 'wiki.cr', 'wiki.csb', 'wiki.cs', 'wiki.cu', 'wiki.cv', 'wiki.cy', 'wiki.da', 'wiki.de', 'wiki.diq', 'wiki.dsb', 'wiki.dv', 'wiki.dz', 'wiki.ee', 'wiki.el', 'wiki.eml', 'wiki.en', 'wiki.eo', 'wiki.es', 'wiki.et', 'wiki.eu', 'wiki.ext', 'wiki.fa', 'wiki.ff', 'wiki.fi', 'wiki.fiu_vro', 'wiki.fj', 'wiki.fo', 'wiki.fr', 'wiki.frp', 'wiki.frr', 'wiki.fur', 'wiki.fy', 'wiki.gag', 'wiki.gan', 'wiki.ga', 'wiki.gd', 'wiki.glk', 'wiki.gl', 'wiki.gn', 'wiki.gom', 'wiki.got', 'wiki.gu', 'wiki.gv', 'wiki.hak', 'wiki.ha', 'wiki.haw', 'wiki.he', 'wiki.hif', 'wiki.hi', 'wiki.ho', 'wiki.hr', 'wiki.hsb', 'wiki.ht', 'wiki.hu', 'wiki.hy', 'wiki.hz', 'wiki.ia', 'wiki.id', 'wiki.ie', 'wiki.ig', 'wiki.ii', 'wiki.ik', 'wiki.ilo', 'wiki.io', 'wiki.is', 'wiki.it', 'wiki.iu', 'wiki.jam', 'wiki.ja', 'wiki.jbo', 'wiki.jv', 'wiki.kaa', 'wiki.kab', 'wiki.ka', 'wiki.kbd', 'wiki.kg', 'wiki.ki', 'wiki.kj', 'wiki.kk', 'wiki.kl', 'wiki.km', 'wiki.kn', 'wiki.koi', 'wiki.ko', 'wiki.krc', 'wiki.kr', 'wiki.ksh', 'wiki.ks', 'wiki.ku', 'wiki.kv', 'wiki.kw', 'wiki.ky', 'wiki.lad', 'wiki.la', 'wiki.lbe', 'wiki.lb', 'wiki.lez', 'wiki.lg', 'wiki.lij', 'wiki.li', 'wiki.lmo', 'wiki.ln', 'wiki.lo', 'wiki.lrc', 'wiki.ltg', 'wiki.lt', 'wiki.lv', 'wiki.mai', 'wiki.map_bms', 'wiki.mdf', 'wiki.mg', 'wiki.mh', 'wiki.mhr', 'wiki.min', 'wiki.mi', 'wiki.mk', 'wiki.ml', 'wiki.mn', 'wiki.mo', 'wiki.mrj', 'wiki.mr', 'wiki.ms', 'wiki.mt', 'wiki.multi.ar', 'wiki.multi.bg', 'wiki.multi.ca', 'wiki.multi.cs', 'wiki.multi.da', 'wiki.multi.de', 'wiki.multi.el', 'wiki.multi.en', 'wiki.multi.es', 'wiki.multi.et', 'wiki.multi.fi', 'wiki.multi.fr', 'wiki.multi.he', 'wiki.multi.hr', 'wiki.multi.hu', 'wiki.multi.id', 'wiki.multi.it', 'wiki.multi.mk', 'wiki.multi.nl', 'wiki.multi.no', 'wiki.multi.pl', 'wiki.multi.pt', 'wiki.multi.ro', 'wiki.multi.ru', 'wiki.multi.sk', 'wiki.multi.sl', 'wiki.multi.sv', 'wiki.multi.tr', 'wiki.multi.uk', 'wiki.multi.vi', 'wiki.mus', 'wiki.mwl', 'wiki.my', 'wiki.myv', 'wiki.mzn', 'wiki.nah', 'wiki.na', 'wiki.nap', 'wiki.nds_nl', 'wiki.nds', 'wiki.ne', 'wiki.new', 'wiki-news-300d-1M', 'wiki-news-300d-1M-subword', 'wiki.ng', 'wiki.nl', 'wiki.nn', 'wiki.no', 'wiki.nov', 'wiki.vec', 'wiki.nrm', 'wiki.nso', 'wiki.nv', 'wiki.ny', 'wiki.oc', 'wiki.olo', 'wiki.om', 'wiki.or', 'wiki.os', 'wiki.pag', 'wiki.pam', 'wiki.pa', 'wiki.pap', 'wiki.pcd', 'wiki.pdc', 'wiki.pfl', 'wiki.pih', 'wiki.pi', 'wiki.pl', 'wiki.pms', 'wiki.pnb', 'wiki.pnt', 'wiki.ps', 'wiki.pt', 'wiki.qu', 'wiki.rm', 'wiki.rmy', 'wiki.rn', 'wiki.roa_rup', 'wiki.roa_tara', 'wiki.ro', 'wiki.rue', 'wiki.ru', 'wiki.rw', 'wiki.sah', 'wiki.sa', 'wiki.scn', 'wiki.sc', 'wiki.sco', 'wiki.sd', 'wiki.se', 'wiki.sg', 'wiki.sh', 'wiki.simple', 'wiki.si', 'wiki.sk', 'wiki.sl', 'wiki.sm', 'wiki.sn', 'wiki.so', 'wiki.sq', 'wiki.srn', 'wiki.sr', 'wiki.ss', 'wiki.st', 'wiki.stq', 'wiki.su', 'wiki.sv', 'wiki.sw', 'wiki.szl', 'wiki.ta', 'wiki.tcy', 'wiki.te', 'wiki.tet', 'wiki.tg', 'wiki.th', 'wiki.ti', 'wiki.tk', 'wiki.tl', 'wiki.tn', 'wiki.to', 'wiki.tpi', 'wiki.tr', 'wiki.ts', 'wiki.tt', 'wiki.tum', 'wiki.tw', 'wiki.ty', 'wiki.tyv', 'wiki.udm', 'wiki.ug', 'wiki.uk', 'wiki.ur', 'wiki.uz', 'wiki.ve', 'wiki.vep', 'wiki.vi', 'wiki.vls', 'wiki.vo', 'wiki.wa', 'wiki.war', 'wiki.wo', 'wiki.wuu', 'wiki.xal', 'wiki.xh', 'wiki.xmf', 'wiki.yi', 'wiki.yo', 'wiki.za', 'wiki.zea', 'wiki.zh_classical', 'wiki.zh_min_nan', 'wiki.zh', 'wiki.zh_yue', 'wiki.zu', 'cc.af.300', 'cc.als.300', 'cc.am.300', 'cc.an.300', 'cc.ar.300', 'cc.arz.300', 'cc.as.300', 'cc.ast.300', 'cc.az.300', 'cc.azb.300', 'cc.ba.300', 'cc.bar.300', 'cc.bcl.300', 'cc.be.300', 'cc.bg.300', 'cc.bh.300', 'cc.bn.300', 'cc.bo.300', 'cc.bpy.300', 'cc.br.300', 'cc.bs.300', 'cc.ca.300', 'cc.ce.300', 'cc.ceb.300', 'cc.ckb.300', 'cc.co.300', 'cc.cs.300', 'cc.cv.300', 'cc.cy.300', 'cc.da.300', 'cc.de.300', 'cc.diq.300', 'cc.dv.300', 'cc.el.300', 'cc.eml.300', 'cc.en.300', 'cc.eo.300', 'cc.es.300', 'cc.et.300', 'cc.eu.300', 'cc.fa.300', 'cc.fi.300', 'cc.fr.300', 'cc.frr.300', 'cc.fy.300', 'cc.ga.300', 'cc.gd.300', 'cc.gl.300', 'cc.gom.300', 'cc.gu.300', 'cc.gv.300', 'cc.he.300', 'cc.hi.300', 'cc.hif.300', 'cc.hr.300', 'cc.hsb.300', 'cc.ht.300', 'cc.hu.300', 'cc.hy.300', 'cc.ia.300', 'cc.id.300', 'cc.ilo.300', 'cc.io.300', 'cc.is.300', 'cc.it.300', 'cc.ja.300', 'cc.jv.300', 'cc.ka.300', 'cc.kk.300', 'cc.km.300', 'cc.kn.300', 'cc.ko.300', 'cc.ku.300', 'cc.ky.300', 'cc.la.300', 'cc.lb.300', 'cc.li.300', 'cc.lmo.300', 'cc.lt.300', 'cc.lv.300', 'cc.mai.300', 'cc.mg.300', 'cc.mhr.300', 'cc.min.300', 'cc.mk.300', 'cc.ml.300', 'cc.mn.300', 'cc.mr.300', 'cc.mrj.300', 'cc.ms.300', 'cc.mt.300', 'cc.mwl.300', 'cc.my.300', 'cc.myv.300', 'cc.mzn.300', 'cc.nah.300', 'cc.nap.300', 'cc.nds.300', 'cc.ne.300', 'cc.new.300', 'cc.nl.300', 'cc.nn.300', 'cc.no.300', 'cc.nso.300', 'cc.oc.300', 'cc.or.300', 'cc.os.300', 'cc.pa.300', 'cc.pam.300', 'cc.pfl.300', 'cc.pl.300', 'cc.pms.300', 'cc.pnb.300', 'cc.ps.300', 'cc.pt.300', 'cc.qu.300', 'cc.rm.300', 'cc.ro.300', 'cc.ru.300', 'cc.sa.300', 'cc.sah.300', 'cc.sc.300', 'cc.scn.300', 'cc.sco.300', 'cc.sd.300', 'cc.sh.300', 'cc.si.300', 'cc.sk.300', 'cc.sl.300', 'cc.so.300', 'cc.sq.300', 'cc.sr.300', 'cc.su.300', 'cc.sv.300', 'cc.sw.300', 'cc.ta.300', 'cc.te.300', 'cc.tg.300', 'cc.th.300', 'cc.tk.300', 'cc.tl.300', 'cc.tr.300', 'cc.tt.300', 'cc.ug.300', 'cc.uk.300', 'cc.ur.300', 'cc.uz.300', 'cc.vec.300', 'cc.vi.300', 'cc.vls.300', 'cc.vo.300', 'cc.wa.300', 'cc.war.300', 'cc.xmf.300', 'cc.yi.300', 'cc.yo.300', 'cc.zea.300', 'cc.zh.300']
Parameters:
  • source (str, default 'wiki.simple') – The name of the pre-trained token embedding file.
  • embedding_root (str, default '$MXNET_HOME/embedding') – The root directory for storing embedding-related files. MXNET_HOME defaults to ‘~/.mxnet’.
  • load_ngrams (bool, default False) – Load vectors for ngrams so that computing vectors for OOV words is possible. This is disabled by default as it requires downloading an additional 2GB file containing the vectors for ngrams. Note that facebookresearch did not publish ngram vectors for all their models. If load_ngrams is True, but no ngram vectors are available for the chosen source this a RuntimeError is thrown. The ngram vectors are passed to the resulting TokenEmbedding as unknown_lookup.
  • ctx (mx.Context, default mxnet.cpu()) – Context to load the FasttextEmbeddingModel for ngram vectors to. This parameter is ignored if load_ngrams is False.
  • kwargs – All other keyword arguments are passed to gluonnlp.embedding.TokenEmbedding.
Variables:
  • idx_to_vec (mxnet.ndarray.NDArray) – For all the indexed tokens in this embedding, this NDArray maps each token’s index to an embedding vector.
  • unknown_token (hashable object) – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation.
class gluonnlp.embedding.Word2Vec(source='GoogleNews-vectors-negative300', embedding_root='/var/lib/jenkins/.mxnet/embedding', **kwargs)[source]

The Word2Vec word embedding.

Word2Vec is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed with continuous bag-of-words or skip-gram architecture for computing vector representations of words.

References:

[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.

[3] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.

Website: https://code.google.com/archive/p/word2vec/

License for pre-trained embedding: Unspecified

Available sources


>>> import warnings; warnings.filterwarnings('ignore');
>>> import gluonnlp as nlp
>>> nlp.embedding.list_sources('Word2Vec')
['GoogleNews-vectors-negative300', 'freebase-vectors-skipgram1000-en', 'freebase-vectors-skipgram1000']
Parameters:
  • source (str, default 'GoogleNews-vectors-negative300') – The name of the pre-trained token embedding file.
  • embedding_root (str, default '$MXNET_HOME/embedding') – The root directory for storing embedding-related files. MXNET_HOME defaults to ‘~/.mxnet’.
  • kwargs – All other keyword arguments are passed to gluonnlp.embedding.TokenEmbedding.
Variables:
  • idx_to_vec (mxnet.ndarray.NDArray) – For all the indexed tokens in this embedding, this NDArray maps each token’s index to an embedding vector.
  • unknown_token (hashable object) – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation.

Models for intrinsic and extrinsic word embedding evaluation

gluonnlp.embedding.evaluation.register(class_)[source]

Registers a new word embedding evaluation function.

Once registered, we can create an instance with create().

Examples

>>> @gluonnlp.embedding.evaluation.register
... class MySimilarityFunction(gluonnlp.embedding.evaluation.WordEmbeddingSimilarityFunction):
...     def __init__(self, eps=1e-10):
...         pass
>>> similarity_function = gluonnlp.embedding.evaluation.create('similarity',
...                                                            'MySimilarityFunction')
>>> print(type(similarity_function))
<class 'MySimilarityFunction'>
>>> @gluonnlp.embedding.evaluation.register
... class MyAnalogyFunction(gluonnlp.embedding.evaluation.WordEmbeddingAnalogyFunction):
...     def __init__(self, k=1, eps=1E-10):
...         pass
>>> analogy_function = gluonnlp.embedding.evaluation.create('analogy', 'MyAnalogyFunction')
>>> print(type(analogy_function))
<class 'MyAnalogyFunction'>
gluonnlp.embedding.evaluation.create(kind, name, **kwargs)[source]

Creates an instance of a registered word embedding evaluation function.

Parameters:
  • kind (['similarity', 'analogy']) – Return only valid names for similarity, analogy or both kinds of functions.
  • name (str) – The evaluation function name (case-insensitive).
Returns:

gluonnlp.embedding.evaluation.list_evaluation_functions(kind=None)[source]

Get valid word embedding functions names.

Parameters:kind (['similarity', 'analogy', None]) – Return only valid names for similarity, analogy or both kinds of functions.
Returns:A list of all the valid evaluation function names for the specified kind. If kind is set to None, returns a dict mapping each valid name to its respective output list. The valid names can be plugged in gluonnlp.model.word_evaluation_model.create(name).
Return type:dict or list
class gluonnlp.embedding.evaluation.WordEmbeddingSimilarityFunction(prefix=None, params=None)[source]

Base class for word embedding similarity functions.

class gluonnlp.embedding.evaluation.WordEmbeddingAnalogyFunction(prefix=None, params=None)[source]

Base class for word embedding analogy functions.

Parameters:
  • idx_to_vec (mxnet.ndarray.NDArray) – Embedding matrix.
  • k (int, default 1) – Number of analogies to predict per input triple.
  • eps (float, optional, default=1e-10) – A small constant for numerical stability.
class gluonnlp.embedding.evaluation.CosineSimilarity(eps=1e-10, **kwargs)[source]

Computes the cosine similarity.

Parameters:eps (float, optional, default=1e-10) – A small constant for numerical stability.
hybrid_forward(F, x, y)[source]

Compute the cosine similarity between two batches of vectors.

The cosine similarity is the dot product between the L2 normalized vectors.

Parameters:
  • x (Symbol or NDArray) –
  • y (Symbol or NDArray) –
Returns:

similarity – The similarity computed by WordEmbeddingSimilarity.similarity_function.

Return type:

Symbol or NDArray

class gluonnlp.embedding.evaluation.ThreeCosMul(idx_to_vec, k=1, eps=1e-10, exclude_question_words=True, **kwargs)[source]

The 3CosMul analogy function.

The 3CosMul analogy function is defined as

\[\arg\max_{b^* ∈ V}\frac{\cos(b^∗, b) \cos(b^*, a)}{cos(b^*, a^*) + ε}\]

See the following paper for more details:

  • Levy, O., & Goldberg, Y. (2014). Linguistic regularities in sparse and explicit word representations. In R. Morante, & W. Yih, Proceedings of the Eighteenth Conference on Computational Natural Language Learning, CoNLL 2014, Baltimore, Maryland, USA, June 26-27, 2014 (pp. 171–180). : ACL.
Parameters:
  • idx_to_vec (mxnet.ndarray.NDArray) – Embedding matrix.
  • k (int, default 1) – Number of analogies to predict per input triple.
  • exclude_question_words (bool, default True) – Exclude the 3 question words from being a valid answer.
  • eps (float, optional, default=1e-10) – A small constant for numerical stability.
hybrid_forward(F, words1, words2, words3, weight)[source]

Compute ThreeCosMul for given question words.

Parameters:
  • words1 (Symbol or NDArray) – Question words at first posiiton. Shape (batch_size, )
  • words2 (Symbol or NDArray) – Question words at second posiiton. Shape (batch_size, )
  • words3 (Symbol or NDArray) – Question words at third posiiton. Shape (batch_size, )
Returns:

Predicted answer words. Shape (batch_size, k).

Return type:

Symbol or NDArray

class gluonnlp.embedding.evaluation.ThreeCosAdd(idx_to_vec, normalize=True, k=1, eps=1e-10, exclude_question_words=True, **kwargs)[source]

The 3CosAdd analogy function.

The 3CosAdd analogy function is defined as

\[\arg\max_{b^* ∈ V}[\cos(b^∗, b - a + a^*)]\]

See the following paper for more details:

  • Levy, O., & Goldberg, Y. (2014). Linguistic regularities in sparse and explicit word representations. In R. Morante, & W. Yih, Proceedings of the Eighteenth Conference on Computational Natural Language Learning, CoNLL 2014, Baltimore, Maryland, USA, June 26-27, 2014 (pp. 171–180). : ACL.
Parameters:
  • idx_to_vec (mxnet.ndarray.NDArray) – Embedding matrix.
  • normalize (bool, default True) – Normalize all word embeddings before computing the analogy.
  • k (int, default 1) – Number of analogies to predict per input triple.
  • exclude_question_words (bool, default True) – Exclude the 3 question words from being a valid answer.
  • eps (float, optional, default=1e-10) – A small constant for numerical stability.
hybrid_forward(F, words1, words2, words3, weight)[source]

Compute ThreeCosAdd for given question words.

Parameters:
  • words1 (Symbol or NDArray) – Question words at first posiiton. Shape (batch_size, )
  • words2 (Symbol or NDArray) – Question words at second posiiton. Shape (batch_size, )
  • words3 (Symbol or NDArray) – Question words at third posiiton. Shape (batch_size, )
Returns:

Predicted answer words. Shape (batch_size, k).

Return type:

Symbol or NDArray

class gluonnlp.embedding.evaluation.WordEmbeddingSimilarity(idx_to_vec, similarity_function='CosineSimilarity', eps=1e-10, **kwargs)[source]

Word embeddings similarity task evaluator.

Parameters:
  • idx_to_vec (mxnet.ndarray.NDArray) – Embedding matrix.
  • similarity_function (str, default 'CosineSimilarity') – Name of a registered WordEmbeddingSimilarityFunction.
  • eps (float, optional, default=1e-10) – A small constant for numerical stability.
hybrid_forward(F, words1, words2, weight)[source]

Predict the similarity of words1 and words2.

Parameters:
  • words1 (Symbol or NDArray) – The indices of the words the we wish to compare to the words in words2.
  • words2 (Symbol or NDArray) – The indices of the words the we wish to compare to the words in words1.
Returns:

similarity – The similarity computed by WordEmbeddingSimilarity.similarity_function.

Return type:

Symbol or NDArray

class gluonnlp.embedding.evaluation.WordEmbeddingAnalogy(idx_to_vec, analogy_function='ThreeCosMul', k=1, exclude_question_words=True, **kwargs)[source]

Word embeddings analogy task evaluator.

Parameters:
  • idx_to_vec (mxnet.ndarray.NDArray) – Embedding matrix.
  • analogy_function (str, default 'ThreeCosMul') – Name of a registered WordEmbeddingAnalogyFunction.
  • k (int, default 1) – Number of analogies to predict per input triple.
  • exclude_question_words (bool, default True) – Exclude the 3 question words from being a valid answer.
hybrid_forward(F, words1, words2, words3)[source]

Compute analogies for given question words.

Parameters:
  • words1 (Symbol or NDArray) – Word indices of first question words. Shape (batch_size, ).
  • words2 (Symbol or NDArray) – Word indices of second question words. Shape (batch_size, ).
  • words3 (Symbol or NDArray) – Word indices of third question words. Shape (batch_size, ).
Returns:

predicted_indices – Indices of predicted analogies of shape (batch_size, k)

Return type:

Symbol or NDArray