gluonnlp.embedding

GluonNLP Toolkit provides tools for working with embeddings.

This page describes the gluonnlp APIs for text embedding, such as loading pre-trained embedding vectors for text tokens and storing them in the mxnet.ndarray.NDArray format as well as utilities for intrinsic evaluation of text embeddings.

Pre-trained Embeddings

register

Registers a new token embedding.

create

Creates an instance of token embedding.

list_sources

Get valid token embedding names and their pre-trained file names.

TokenEmbedding

Token embedding base class.

GloVe

The GloVe word embedding.

FastText

The fastText word embedding.

Word2Vec

The Word2Vec word embedding.

Intrinsic evaluation

register

Registers a new word embedding evaluation function.

create

Creates an instance of a registered word embedding evaluation function.

list_evaluation_functions

Get valid word embedding functions names.

WordEmbeddingSimilarityFunction

Base class for word embedding similarity functions.

WordEmbeddingAnalogyFunction

Base class for word embedding analogy functions.

CosineSimilarity

Computes the cosine similarity.

ThreeCosAdd

The 3CosAdd analogy function.

ThreeCosMul

The 3CosMul analogy function.

WordEmbeddingSimilarity

Word embeddings similarity task evaluator.

WordEmbeddingAnalogy

Word embeddings analogy task evaluator.

API Reference

Word embeddings.

gluonnlp.embedding.register(embedding_cls)[source]

Registers a new token embedding.

Once an embedding is registered, we can create an instance of this embedding with create().

Examples

>>> @gluonnlp.embedding.register
... class MyTextEmbed(gluonnlp.embedding.TokenEmbedding):
...     def __init__(self, source='my_pretrain_file'):
...         pass
>>> embed = gluonnlp.embedding.create('MyTextEmbed')
>>> print(type(embed))
<class 'gluonnlp.embedding.token_embedding.MyTextEmbed'>
gluonnlp.embedding.create(embedding_name, **kwargs)[source]

Creates an instance of token embedding.

Creates a token embedding instance by loading embedding vectors from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText. To get all the valid embedding_name and source, use gluonnlp.embedding.list_sources().

Parameters
  • embedding_name (str) – The token embedding name (case-insensitive).

  • kwargs (dict) – All other keyword arguments are passed to the initializer of token embedding class. For example create(embedding_name=’fasttext’, source=’wiki.simple’, load_ngrams=True) will return FastText(source=’wiki.simple’, load_ngrams=True).

Returns

A token embedding instance that loads embedding vectors from an externally hosted pre-trained token embedding file.

Return type

An instance of gluonnlp.embedding.TokenEmbedding

gluonnlp.embedding.list_sources(embedding_name=None)[source]

Get valid token embedding names and their pre-trained file names.

To load token embedding vectors from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText, one should use gluonnlp.embedding.create(embedding_name, source). This method returns all the valid names of source for the specified embedding_name. If embedding_name is set to None, this method returns all the valid names of embedding_name with their associated source.

Parameters

embedding_name (str or None, default None) – The pre-trained token embedding name.

Returns

A list of all the valid pre-trained token embedding file names (source) for the specified token embedding name (embedding_name). If the text embedding name is set to None, returns a dict mapping each valid token embedding name to a list of valid pre-trained files (source). They can be plugged into gluonnlp.embedding.create(embedding_name, source).

Return type

dict or list

class gluonnlp.embedding.TokenEmbedding(unknown_token='<unk>', init_unknown_vec=<function zeros>, allow_extend=False, unknown_lookup=None, idx_to_token=None, idx_to_vec=None)[source]

Token embedding base class.

To load token embedding from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText, use gluonnlp.embedding.create(). To get all the available embedding_name and source, use gluonnlp.embedding.list_sources().

Alternatively, to load embedding vectors from a custom pre-trained token embedding file, use gluonnlp.embedding.TokenEmbedding.from_file().

If unknown_token is None, looking up unknown tokens results in KeyError. Otherwise, for every unknown token, if its representation self.unknown_token is encountered in the pre-trained token embedding file, index 0 of self.idx_to_vec maps to the pre-trained token embedding vector loaded from the file; otherwise, index 0 of self.idx_to_vec maps to the token embedding vector initialized by init_unknown_vec.

If a token is encountered multiple times in the pre-trained token embedding file, only the first-encountered token embedding vector will be loaded and the rest will be skipped.

Parameters
  • unknown_token (hashable object or None, default '<unk>') – Any unknown token will be replaced by unknown_token and consequently will be indexed as the same representation. Only used if oov_imputer is not specified.

  • init_unknown_vec (callback, default nd.zeros) – The callback used to initialize the embedding vector for the unknown token. Only used if unknown_token is not None and idx_to_token is not None and does not contain unknown_vec.

  • allow_extend (bool, default False) – If True, embedding vectors for previously unknown words can be added via token_embedding[tokens] = vecs. If False, only vectors for known tokens can be updated.

  • unknown_lookup (object subscriptable with list of tokens returning nd.NDarray, default None) – If not None, the TokenEmbedding obtains embeddings for unknown tokens automatically from unknown_lookup[unknown_tokens]. For example, in a FastText model, embeddings for unknown tokens can be computed from the subword information.

  • idx_to_token (list of str or None, default None) – If not None, a list of tokens for which the idx_to_vec argument provides embeddings. The list indices and the indices of idx_to_vec must be aligned. If idx_to_token is not None, idx_to_vec must not be None either. If idx_to_token is None, an empty TokenEmbedding object is created. If allow_extend is True, tokens and their embeddings can be added to the TokenEmbedding at a later stage.

  • idx_to_vec (mxnet.ndarray.NDArray or None, default None) – If not None, a NDArray containing embeddings for the tokens specified in idx_to_token. The first dimension of idx_to_vec must be aligned with idx_to_token. If idx_to_vec is not None, idx_to_token must not be None either. If idx_to_vec is None, an empty TokenEmbedding object is created. If allow_extend is True, tokens and their embeddings can be added to the TokenEmbedding at a later stage. No copy of the idx_to_vec array is made as long as unknown_token is None or an embedding for unknown_token is specified in idx_to_vec.

__contains__(token)[source]

Check if token is known.

Parameters

token (str) – A token.

Returns

Return True if the token is known. A token is known if it has been assigned an index and vector.

Return type

bool

__getitem__(tokens)[source]

Looks up embedding vectors of text tokens.

Parameters

tokens (str or list of strs) – A token or a list of tokens.

Returns

The embedding vector(s) of the token(s). According to numpy conventions, if tokens is a string, returns a 1-D NDArray (vector); if tokens is a list of strings, returns a 2-D NDArray (matrix) of shape=(len(tokens), vec_len).

Return type

mxnet.ndarray.NDArray

__setitem__(tokens, new_embedding)[source]

Updates embedding vectors for tokens.

If self.allow_extend is True, vectors for previously unknown tokens can be introduced.

Parameters
  • tokens (hashable object or a list or tuple of hashable objects) – A token or a list of tokens whose embedding vector are to be updated.

  • new_embedding (mxnet.ndarray.NDArray) – An NDArray to be assigned to the embedding vectors of tokens. Its length must be equal to the number of tokens and its width must be equal to the dimension of embedding of the glossary. If tokens is a singleton, it must be 1-D or 2-D. If tokens is a list of multiple strings, it must be 2-D.

property allow_extend

Allow extension of the TokenEmbedding with new tokens.

If True, TokenEmbedding[tokens] = vec can introduce new tokens that were previously unknown. New indices will be assigned to the newly introduced tokens. If False, only known tokens can be updated.

Returns

Extension of the TokenEmbedding is allowed.

Return type

bool

static deserialize(file_path, **kwargs)[source]

Create a new TokenEmbedding from a serialized one.

TokenEmbedding is serialized by converting the list of tokens, the array of word embeddings and other metadata to numpy arrays, saving all in a single (optionally compressed) Zipfile. See https://docs.scipy.org/doc/numpy-1.14.2/neps/npy-format.html for more information on the format.

Parameters
  • file_path (str or file) – The path to a file that holds the serialized TokenEmbedding.

  • kwargs (dict) – Keyword arguments are passed to the TokenEmbedding initializer. Useful for attaching unknown_lookup.

static from_file(file_path, elem_delim=' ', encoding='utf8', **kwargs)[source]

Creates a user-defined token embedding from a pre-trained embedding file.

This is to load embedding vectors from a user-defined pre-trained token embedding file. For example, if elem_delim = ‘ ‘, the expected format of a custom pre-trained token embedding file may look like:

‘hello 0.1 0.2 0.3 0.4 0.5\nworld 1.1 1.2 1.3 1.4 1.5\n’

where embedding vectors of words hello and world are [0.1, 0.2, 0.3, 0.4, 0.5] and [1.1, 1.2, 1.3, 1.4, 1.5] respectively.

Parameters
  • file_path (str) – The path to the user-defined pre-trained token embedding file.

  • elem_delim (str, default ' ') – The delimiter for splitting a token and every embedding vector element value on the same line of the custom pre-trained token embedding file.

  • encoding (str, default 'utf8') – The encoding scheme for reading the custom pre-trained token embedding file.

  • kwargs (dict) – All other keyword arguments are passed to the TokenEmbedding initializer.

Returns

The user-defined token embedding instance.

Return type

instance of gluonnlp.embedding.TokenEmbedding

property idx_to_token

Index to token mapping.

Returns

A list of indexed tokens where the list indices and the token indices are aligned.

Return type

list of str

property idx_to_vec

Index to vector mapping.

Returns

For all the indexed tokens in this embedding, this NDArray maps each token’s index to an embedding vector.

Return type

mxnet.ndarray.NDArray

serialize(file_path, compress=True)[source]

Serializes the TokenEmbedding to a file specified by file_path.

TokenEmbedding is serialized by converting the list of tokens, the array of word embeddings and other metadata to numpy arrays, saving all in a single (optionally compressed) Zipfile. See https://docs.scipy.org/doc/numpy-1.14.2/neps/npy-format.html for more information on the format.

Parameters
  • file_path (str or file) – The path at which to create the file holding the serialized TokenEmbedding. If file is a string or a Path, the .npz extension will be appended to the file name if it is not already there.

  • compress (bool, default True) – Compress the Zipfile or leave it uncompressed.

property token_to_idx

Token to index mapping.

Returns

A dictionary of tokens with their corresponding index numbers; inverse vocab.

Return type

dict of int to strs

property unknown_lookup

Vector lookup for unknown tokens.

If not None, unknown_lookup[tokens] is automatically called for any unknown tokens.

Returns

Vector lookup mapping from tokens to vectors.

Return type

Mapping[List[str], nd.NDarray]

property unknown_token

Unknown token representation.

Any token that is unknown will be indexed using the representation of unknown_token.

Returns

Unknown token representation

Return type

hashable object or None

class gluonnlp.embedding.GloVe(source='glove.6B.50d', embedding_root='/var/lib/jenkins/.mxnet/embedding', **kwargs)[source]

The GloVe word embedding.

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. (Source from https://nlp.stanford.edu/projects/glove/)

Reference:

GloVe: Global Vectors for Word Representation. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. https://nlp.stanford.edu/pubs/glove.pdf

Website: https://nlp.stanford.edu/projects/glove/

To get the updated URLs to the externally hosted pre-trained token embedding files, visit https://nlp.stanford.edu/projects/glove/

License for pre-trained embedding: https://opendatacommons.org/licenses/pddl/

Available sources

>>> import gluonnlp as nlp
>>> nlp.embedding.list_sources('GloVe')
['glove.42B.300d', 'glove.6B.100d', 'glove.6B.200d', 'glove.6B.300d', 'glove.6B.50d', 'glove.840B.300d', 'glove.twitter.27B.100d', 'glove.twitter.27B.200d', 'glove.twitter.27B.25d', 'glove.twitter.27B.50d']
Parameters
  • source (str, default 'glove.6B.50d') – The name of the pre-trained token embedding file.

  • embedding_root (str, default '$MXNET_HOME/embedding') – The root directory for storing embedding-related files. MXNET_HOME defaults to ‘~/.mxnet’.

  • kwargs – All other keyword arguments are passed to gluonnlp.embedding.TokenEmbedding.

Variables
  • idx_to_vec (mxnet.ndarray.NDArray) – For all the indexed tokens in this embedding, this NDArray maps each token’s index to an embedding vector.

  • unknown_token (hashable object) – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation.

class gluonnlp.embedding.FastText(source='wiki.simple', embedding_root='/var/lib/jenkins/.mxnet/embedding', load_ngrams=False, ctx=cpu(0), **kwargs)[source]

The fastText word embedding.

FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices. (Source from https://fasttext.cc/)

References:

Enriching Word Vectors with Subword Information. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. https://arxiv.org/abs/1607.04606

Bag of Tricks for Efficient Text Classification. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. https://arxiv.org/abs/1607.01759

FastText.zip: Compressing text classification models. Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herve Jegou, and Tomas Mikolov. https://arxiv.org/abs/1612.03651

For ‘wiki.multi’ embedding: Word Translation Without Parallel Data Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. https://arxiv.org/abs/1710.04087

Website: https://fasttext.cc/

To get the updated URLs to the externally hosted pre-trained token embedding files, visit https://github.com/facebookresearch/fastText/blob/master/docs/pretrained-vectors.md

License for pre-trained embedding: https://creativecommons.org/licenses/by-sa/3.0/

Available sources

>>> import gluonnlp as nlp
>>> nlp.embedding.list_sources('FastText')
['crawl-300d-2M', 'crawl-300d-2M-subword', 'wiki.aa', 'wiki.ab', 'wiki.ace', 'wiki.ady', 'wiki.af', 'wiki.ak', 'wiki.als', 'wiki.am', 'wiki.ang', 'wiki.an', 'wiki.arc', 'wiki.ar', 'wiki.arz', 'wiki.as', 'wiki.ast', 'wiki.av', 'wiki.ay', 'wiki.azb', 'wiki.az', 'wiki.ba', 'wiki.bar', 'wiki.bat_smg', 'wiki.bcl', 'wiki.be', 'wiki.bg', 'wiki.bh', 'wiki.bi', 'wiki.bjn', 'wiki.bm', 'wiki.bn', 'wiki.bo', 'wiki.bpy', 'wiki.br', 'wiki.bs', 'wiki.bug', 'wiki.bxr', 'wiki.ca', 'wiki.cbk_zam', 'wiki.cdo', 'wiki.ceb', 'wiki.ce', 'wiki.ch', 'wiki.cho', 'wiki.chr', 'wiki.chy', 'wiki.ckb', 'wiki.co', 'wiki.crh', 'wiki.cr', 'wiki.csb', 'wiki.cs', 'wiki.cu', 'wiki.cv', 'wiki.cy', 'wiki.da', 'wiki.de', 'wiki.diq', 'wiki.dsb', 'wiki.dv', 'wiki.dz', 'wiki.ee', 'wiki.el', 'wiki.eml', 'wiki.en', 'wiki.eo', 'wiki.es', 'wiki.et', 'wiki.eu', 'wiki.ext', 'wiki.fa', 'wiki.ff', 'wiki.fi', 'wiki.fiu_vro', 'wiki.fj', 'wiki.fo', 'wiki.fr', 'wiki.frp', 'wiki.frr', 'wiki.fur', 'wiki.fy', 'wiki.gag', 'wiki.gan', 'wiki.ga', 'wiki.gd', 'wiki.glk', 'wiki.gl', 'wiki.gn', 'wiki.gom', 'wiki.got', 'wiki.gu', 'wiki.gv', 'wiki.hak', 'wiki.ha', 'wiki.haw', 'wiki.he', 'wiki.hif', 'wiki.hi', 'wiki.ho', 'wiki.hr', 'wiki.hsb', 'wiki.ht', 'wiki.hu', 'wiki.hy', 'wiki.hz', 'wiki.ia', 'wiki.id', 'wiki.ie', 'wiki.ig', 'wiki.ii', 'wiki.ik', 'wiki.ilo', 'wiki.io', 'wiki.is', 'wiki.it', 'wiki.iu', 'wiki.jam', 'wiki.ja', 'wiki.jbo', 'wiki.jv', 'wiki.kaa', 'wiki.kab', 'wiki.ka', 'wiki.kbd', 'wiki.kg', 'wiki.ki', 'wiki.kj', 'wiki.kk', 'wiki.kl', 'wiki.km', 'wiki.kn', 'wiki.koi', 'wiki.ko', 'wiki.krc', 'wiki.kr', 'wiki.ksh', 'wiki.ks', 'wiki.ku', 'wiki.kv', 'wiki.kw', 'wiki.ky', 'wiki.lad', 'wiki.la', 'wiki.lbe', 'wiki.lb', 'wiki.lez', 'wiki.lg', 'wiki.lij', 'wiki.li', 'wiki.lmo', 'wiki.ln', 'wiki.lo', 'wiki.lrc', 'wiki.ltg', 'wiki.lt', 'wiki.lv', 'wiki.mai', 'wiki.map_bms', 'wiki.mdf', 'wiki.mg', 'wiki.mh', 'wiki.mhr', 'wiki.min', 'wiki.mi', 'wiki.mk', 'wiki.ml', 'wiki.mn', 'wiki.mo', 'wiki.mrj', 'wiki.mr', 'wiki.ms', 'wiki.mt', 'wiki.multi.ar', 'wiki.multi.bg', 'wiki.multi.ca', 'wiki.multi.cs', 'wiki.multi.da', 'wiki.multi.de', 'wiki.multi.el', 'wiki.multi.en', 'wiki.multi.es', 'wiki.multi.et', 'wiki.multi.fi', 'wiki.multi.fr', 'wiki.multi.he', 'wiki.multi.hr', 'wiki.multi.hu', 'wiki.multi.id', 'wiki.multi.it', 'wiki.multi.mk', 'wiki.multi.nl', 'wiki.multi.no', 'wiki.multi.pl', 'wiki.multi.pt', 'wiki.multi.ro', 'wiki.multi.ru', 'wiki.multi.sk', 'wiki.multi.sl', 'wiki.multi.sv', 'wiki.multi.tr', 'wiki.multi.uk', 'wiki.multi.vi', 'wiki.mus', 'wiki.mwl', 'wiki.my', 'wiki.myv', 'wiki.mzn', 'wiki.nah', 'wiki.na', 'wiki.nap', 'wiki.nds_nl', 'wiki.nds', 'wiki.ne', 'wiki.new', 'wiki-news-300d-1M', 'wiki-news-300d-1M-subword', 'wiki.ng', 'wiki.nl', 'wiki.nn', 'wiki.no', 'wiki.nov', 'wiki.vec', 'wiki.nrm', 'wiki.nso', 'wiki.nv', 'wiki.ny', 'wiki.oc', 'wiki.olo', 'wiki.om', 'wiki.or', 'wiki.os', 'wiki.pag', 'wiki.pam', 'wiki.pa', 'wiki.pap', 'wiki.pcd', 'wiki.pdc', 'wiki.pfl', 'wiki.pih', 'wiki.pi', 'wiki.pl', 'wiki.pms', 'wiki.pnb', 'wiki.pnt', 'wiki.ps', 'wiki.pt', 'wiki.qu', 'wiki.rm', 'wiki.rmy', 'wiki.rn', 'wiki.roa_rup', 'wiki.roa_tara', 'wiki.ro', 'wiki.rue', 'wiki.ru', 'wiki.rw', 'wiki.sah', 'wiki.sa', 'wiki.scn', 'wiki.sc', 'wiki.sco', 'wiki.sd', 'wiki.se', 'wiki.sg', 'wiki.sh', 'wiki.simple', 'wiki.si', 'wiki.sk', 'wiki.sl', 'wiki.sm', 'wiki.sn', 'wiki.so', 'wiki.sq', 'wiki.srn', 'wiki.sr', 'wiki.ss', 'wiki.st', 'wiki.stq', 'wiki.su', 'wiki.sv', 'wiki.sw', 'wiki.szl', 'wiki.ta', 'wiki.tcy', 'wiki.te', 'wiki.tet', 'wiki.tg', 'wiki.th', 'wiki.ti', 'wiki.tk', 'wiki.tl', 'wiki.tn', 'wiki.to', 'wiki.tpi', 'wiki.tr', 'wiki.ts', 'wiki.tt', 'wiki.tum', 'wiki.tw', 'wiki.ty', 'wiki.tyv', 'wiki.udm', 'wiki.ug', 'wiki.uk', 'wiki.ur', 'wiki.uz', 'wiki.ve', 'wiki.vep', 'wiki.vi', 'wiki.vls', 'wiki.vo', 'wiki.wa', 'wiki.war', 'wiki.wo', 'wiki.wuu', 'wiki.xal', 'wiki.xh', 'wiki.xmf', 'wiki.yi', 'wiki.yo', 'wiki.za', 'wiki.zea', 'wiki.zh_classical', 'wiki.zh_min_nan', 'wiki.zh', 'wiki.zh_yue', 'wiki.zu', 'cc.af.300', 'cc.als.300', 'cc.am.300', 'cc.an.300', 'cc.ar.300', 'cc.arz.300', 'cc.as.300', 'cc.ast.300', 'cc.az.300', 'cc.azb.300', 'cc.ba.300', 'cc.bar.300', 'cc.bcl.300', 'cc.be.300', 'cc.bg.300', 'cc.bh.300', 'cc.bn.300', 'cc.bo.300', 'cc.bpy.300', 'cc.br.300', 'cc.bs.300', 'cc.ca.300', 'cc.ce.300', 'cc.ceb.300', 'cc.ckb.300', 'cc.co.300', 'cc.cs.300', 'cc.cv.300', 'cc.cy.300', 'cc.da.300', 'cc.de.300', 'cc.diq.300', 'cc.dv.300', 'cc.el.300', 'cc.eml.300', 'cc.en.300', 'cc.eo.300', 'cc.es.300', 'cc.et.300', 'cc.eu.300', 'cc.fa.300', 'cc.fi.300', 'cc.fr.300', 'cc.frr.300', 'cc.fy.300', 'cc.ga.300', 'cc.gd.300', 'cc.gl.300', 'cc.gom.300', 'cc.gu.300', 'cc.gv.300', 'cc.he.300', 'cc.hi.300', 'cc.hif.300', 'cc.hr.300', 'cc.hsb.300', 'cc.ht.300', 'cc.hu.300', 'cc.hy.300', 'cc.ia.300', 'cc.id.300', 'cc.ilo.300', 'cc.io.300', 'cc.is.300', 'cc.it.300', 'cc.ja.300', 'cc.jv.300', 'cc.ka.300', 'cc.kk.300', 'cc.km.300', 'cc.kn.300', 'cc.ko.300', 'cc.ku.300', 'cc.ky.300', 'cc.la.300', 'cc.lb.300', 'cc.li.300', 'cc.lmo.300', 'cc.lt.300', 'cc.lv.300', 'cc.mai.300', 'cc.mg.300', 'cc.mhr.300', 'cc.min.300', 'cc.mk.300', 'cc.ml.300', 'cc.mn.300', 'cc.mr.300', 'cc.mrj.300', 'cc.ms.300', 'cc.mt.300', 'cc.mwl.300', 'cc.my.300', 'cc.myv.300', 'cc.mzn.300', 'cc.nah.300', 'cc.nap.300', 'cc.nds.300', 'cc.ne.300', 'cc.new.300', 'cc.nl.300', 'cc.nn.300', 'cc.no.300', 'cc.nso.300', 'cc.oc.300', 'cc.or.300', 'cc.os.300', 'cc.pa.300', 'cc.pam.300', 'cc.pfl.300', 'cc.pl.300', 'cc.pms.300', 'cc.pnb.300', 'cc.ps.300', 'cc.pt.300', 'cc.qu.300', 'cc.rm.300', 'cc.ro.300', 'cc.ru.300', 'cc.sa.300', 'cc.sah.300', 'cc.sc.300', 'cc.scn.300', 'cc.sco.300', 'cc.sd.300', 'cc.sh.300', 'cc.si.300', 'cc.sk.300', 'cc.sl.300', 'cc.so.300', 'cc.sq.300', 'cc.sr.300', 'cc.su.300', 'cc.sv.300', 'cc.sw.300', 'cc.ta.300', 'cc.te.300', 'cc.tg.300', 'cc.th.300', 'cc.tk.300', 'cc.tl.300', 'cc.tr.300', 'cc.tt.300', 'cc.ug.300', 'cc.uk.300', 'cc.ur.300', 'cc.uz.300', 'cc.vec.300', 'cc.vi.300', 'cc.vls.300', 'cc.vo.300', 'cc.wa.300', 'cc.war.300', 'cc.xmf.300', 'cc.yi.300', 'cc.yo.300', 'cc.zea.300', 'cc.zh.300']
Parameters
  • source (str, default 'wiki.simple') – The name of the pre-trained token embedding file.

  • embedding_root (str, default '$MXNET_HOME/embedding') – The root directory for storing embedding-related files. MXNET_HOME defaults to ‘~/.mxnet’.

  • load_ngrams (bool, default False) – Load vectors for ngrams so that computing vectors for OOV words is possible. This is disabled by default as it requires downloading an additional 2GB file containing the vectors for ngrams. Note that facebookresearch did not publish ngram vectors for all their models. If load_ngrams is True, but no ngram vectors are available for the chosen source this a RuntimeError is thrown. The ngram vectors are passed to the resulting TokenEmbedding as unknown_lookup.

  • ctx (mx.Context, default mxnet.cpu()) – Context to load the FasttextEmbeddingModel for ngram vectors to. This parameter is ignored if load_ngrams is False.

  • kwargs – All other keyword arguments are passed to gluonnlp.embedding.TokenEmbedding.

Variables
  • idx_to_vec (mxnet.ndarray.NDArray) – For all the indexed tokens in this embedding, this NDArray maps each token’s index to an embedding vector.

  • unknown_token (hashable object) – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation.

class gluonnlp.embedding.Word2Vec(source='GoogleNews-vectors-negative300', embedding_root='/var/lib/jenkins/.mxnet/embedding', encoding='utf8', **kwargs)[source]

The Word2Vec word embedding.

Word2Vec is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed with continuous bag-of-words or skip-gram architecture for computing vector representations of words.

References:

[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.

[3] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.

Website: https://code.google.com/archive/p/word2vec/

License for pre-trained embedding: Unspecified

Available sources

>>> import gluonnlp as nlp
>>> nlp.embedding.list_sources('Word2Vec')
['GoogleNews-vectors-negative300', 'freebase-vectors-skipgram1000-en', 'freebase-vectors-skipgram1000']
Parameters
  • source (str, default 'GoogleNews-vectors-negative300') – The name of the pre-trained token embedding file. A binary pre-trained file outside from the source list can be used for this constructor by passing the path to it which ends with .bin as file extension name.

  • embedding_root (str, default '$MXNET_HOME/embedding') – The root directory for storing embedding-related files. MXNET_HOME defaults to ‘~/.mxnet’.

  • kwargs – All other keyword arguments are passed to gluonnlp.embedding.TokenEmbedding.

Variables
  • idx_to_vec (mxnet.ndarray.NDArray) – For all the indexed tokens in this embedding, this NDArray maps each token’s index to an embedding vector.

  • unknown_token (hashable object) – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation.

classmethod from_w2v_binary(pretrained_file_path, encoding='utf8')[source]

Load embedding vectors from a binary pre-trained token embedding file.

Parameters
  • pretrained_file_path (str) – The path to a binary pre-trained token embedding file end with .bin as file extension name.

  • encoding (str) – The encoding type of the file.

Models for intrinsic and extrinsic word embedding evaluation

gluonnlp.embedding.evaluation.register(class_)[source]

Registers a new word embedding evaluation function.

Once registered, we can create an instance with create().

Examples

>>> @gluonnlp.embedding.evaluation.register
... class MySimilarityFunction(gluonnlp.embedding.evaluation.WordEmbeddingSimilarityFunction):
...     def __init__(self, eps=1e-10):
...         pass
>>> similarity_function = gluonnlp.embedding.evaluation.create('similarity',
...                                                            'MySimilarityFunction')
>>> print(type(similarity_function))
<class 'gluonnlp.embedding.evaluation.MySimilarityFunction'>
>>> @gluonnlp.embedding.evaluation.register
... class MyAnalogyFunction(gluonnlp.embedding.evaluation.WordEmbeddingAnalogyFunction):
...     def __init__(self, k=1, eps=1E-10):
...         pass
>>> analogy_function = gluonnlp.embedding.evaluation.create('analogy', 'MyAnalogyFunction')
>>> print(type(analogy_function))
<class 'gluonnlp.embedding.evaluation.MyAnalogyFunction'>
gluonnlp.embedding.evaluation.create(kind, name, **kwargs)[source]

Creates an instance of a registered word embedding evaluation function.

Parameters
  • kind (['similarity', 'analogy']) – Return only valid names for similarity, analogy or both kinds of functions.

  • name (str) – The evaluation function name (case-insensitive).

Returns

gluonnlp.embedding.evaluation.list_evaluation_functions(kind=None)[source]

Get valid word embedding functions names.

Parameters

kind (['similarity', 'analogy', None]) – Return only valid names for similarity, analogy or both kinds of functions.

Returns

A list of all the valid evaluation function names for the specified kind. If kind is set to None, returns a dict mapping each valid name to its respective output list. The valid names can be plugged in gluonnlp.model.word_evaluation_model.create(name).

Return type

dict or list

class gluonnlp.embedding.evaluation.WordEmbeddingSimilarityFunction(prefix=None, params=None)[source]

Base class for word embedding similarity functions.

class gluonnlp.embedding.evaluation.WordEmbeddingAnalogyFunction(prefix=None, params=None)[source]

Base class for word embedding analogy functions.

Parameters
  • idx_to_vec (mxnet.ndarray.NDArray) – Embedding matrix.

  • k (int, default 1) – Number of analogies to predict per input triple.

  • eps (float, optional, default=1e-10) – A small constant for numerical stability.

class gluonnlp.embedding.evaluation.CosineSimilarity(eps=1e-10, **kwargs)[source]

Computes the cosine similarity.

Parameters

eps (float, optional, default=1e-10) – A small constant for numerical stability.

hybrid_forward(F, x, y)[source]

Compute the cosine similarity between two batches of vectors.

The cosine similarity is the dot product between the L2 normalized vectors.

Parameters
  • x (Symbol or NDArray) –

  • y (Symbol or NDArray) –

Returns

similarity – The similarity computed by WordEmbeddingSimilarity.similarity_function.

Return type

Symbol or NDArray

class gluonnlp.embedding.evaluation.ThreeCosMul(idx_to_vec, k=1, eps=1e-10, exclude_question_words=True, **kwargs)[source]

The 3CosMul analogy function.

The 3CosMul analogy function is defined as

\[\arg\max_{b^* ∈ V}\frac{\cos(b^∗, b) \cos(b^*, a)}{cos(b^*, a^*) + ε}\]

See the following paper for more details:

  • Levy, O., & Goldberg, Y. (2014). Linguistic regularities in sparse and explicit word representations. In R. Morante, & W. Yih, Proceedings of the Eighteenth Conference on Computational Natural Language Learning, CoNLL 2014, Baltimore, Maryland, USA, June 26-27, 2014 (pp. 171–180). : ACL.

Parameters
  • idx_to_vec (mxnet.ndarray.NDArray) – Embedding matrix.

  • k (int, default 1) – Number of analogies to predict per input triple.

  • exclude_question_words (bool, default True) – Exclude the 3 question words from being a valid answer.

  • eps (float, optional, default=1e-10) – A small constant for numerical stability.

hybrid_forward(F, words1, words2, words3, weight)[source]

Compute ThreeCosMul for given question words.

Parameters
  • words1 (Symbol or NDArray) – Question words at first position. Shape (batch_size, )

  • words2 (Symbol or NDArray) – Question words at second position. Shape (batch_size, )

  • words3 (Symbol or NDArray) – Question words at third position. Shape (batch_size, )

Returns

Predicted answer words. Shape (batch_size, k).

Return type

Symbol or NDArray

class gluonnlp.embedding.evaluation.ThreeCosAdd(idx_to_vec, normalize=True, k=1, eps=1e-10, exclude_question_words=True, **kwargs)[source]

The 3CosAdd analogy function.

The 3CosAdd analogy function is defined as

\[\arg\max_{b^* ∈ V}[\cos(b^∗, b - a + a^*)]\]

See the following paper for more details:

  • Levy, O., & Goldberg, Y. (2014). Linguistic regularities in sparse and explicit word representations. In R. Morante, & W. Yih, Proceedings of the Eighteenth Conference on Computational Natural Language Learning, CoNLL 2014, Baltimore, Maryland, USA, June 26-27, 2014 (pp. 171–180). : ACL.

Parameters
  • idx_to_vec (mxnet.ndarray.NDArray) – Embedding matrix.

  • normalize (bool, default True) – Normalize all word embeddings before computing the analogy.

  • k (int, default 1) – Number of analogies to predict per input triple.

  • exclude_question_words (bool, default True) – Exclude the 3 question words from being a valid answer.

  • eps (float, optional, default=1e-10) – A small constant for numerical stability.

hybrid_forward(F, words1, words2, words3, weight)[source]

Compute ThreeCosAdd for given question words.

Parameters
  • words1 (Symbol or NDArray) – Question words at first position. Shape (batch_size, )

  • words2 (Symbol or NDArray) – Question words at second position. Shape (batch_size, )

  • words3 (Symbol or NDArray) – Question words at third position. Shape (batch_size, )

Returns

Predicted answer words. Shape (batch_size, k).

Return type

Symbol or NDArray

class gluonnlp.embedding.evaluation.WordEmbeddingSimilarity(idx_to_vec, similarity_function='CosineSimilarity', eps=1e-10, **kwargs)[source]

Word embeddings similarity task evaluator.

Parameters
  • idx_to_vec (mxnet.ndarray.NDArray) – Embedding matrix.

  • similarity_function (str, default 'CosineSimilarity') – Name of a registered WordEmbeddingSimilarityFunction.

  • eps (float, optional, default=1e-10) – A small constant for numerical stability.

hybrid_forward(F, words1, words2, weight)[source]

Predict the similarity of words1 and words2.

Parameters
  • words1 (Symbol or NDArray) – The indices of the words the we wish to compare to the words in words2.

  • words2 (Symbol or NDArray) – The indices of the words the we wish to compare to the words in words1.

Returns

similarity – The similarity computed by WordEmbeddingSimilarity.similarity_function.

Return type

Symbol or NDArray

class gluonnlp.embedding.evaluation.WordEmbeddingAnalogy(idx_to_vec, analogy_function='ThreeCosMul', k=1, exclude_question_words=True, **kwargs)[source]

Word embeddings analogy task evaluator.

Parameters
  • idx_to_vec (mxnet.ndarray.NDArray) – Embedding matrix.

  • analogy_function (str, default 'ThreeCosMul') – Name of a registered WordEmbeddingAnalogyFunction.

  • k (int, default 1) – Number of analogies to predict per input triple.

  • exclude_question_words (bool, default True) – Exclude the 3 question words from being a valid answer.

hybrid_forward(F, words1, words2, words3)[source]

Compute analogies for given question words.

Parameters
  • words1 (Symbol or NDArray) – Word indices of first question words. Shape (batch_size, ).

  • words2 (Symbol or NDArray) – Word indices of second question words. Shape (batch_size, ).

  • words3 (Symbol or NDArray) – Word indices of third question words. Shape (batch_size, ).

Returns

predicted_indices – Indices of predicted analogies of shape (batch_size, k)

Return type

Symbol or NDArray