gluonnlp.vocab¶

This page describes the gluonnlp.Vocab class for text data numericalization and the subword functionality provided in gluonnlp.vocab.

Vocabulary¶

The vocabulary builds indices for text tokens and can be attached with token embeddings. The input counter whose keys are candidate indices may be obtained via gluonnlp.data.count_tokens()

 Vocab Indexing and embedding attachment for text tokens.

Subword functionality¶

When using a vocabulary of fixed size, out of vocabulary words may be encountered. However, words are composed of characters, allowing intelligent fallbacks for out of vocabulary words based on subword units such as the characters or ngrams in a word. gluonnlp.vocab.SubwordFunction provides an API to map words to their subword units. gluonnlp.model.train contains models that make use of subword information to word embeddings.

 SubwordFunction A SubwordFunction maps words to lists of subword indices. ByteSubwords Map words to a list of bytes. NGramHashes Map words to a list of hashes in a restricted domain.

API Reference¶

NLP toolkit.

class gluonnlp.Vocab(counter=None, max_size=None, min_freq=1, unknown_token='<unk>', padding_token='<pad>', bos_token='<bos>', eos_token='<eos>', reserved_tokens=None)[source]

Indexing and embedding attachment for text tokens.

Parameters: counter (Counter or None, default None) – Counts text token frequencies in the text data. Its keys will be indexed according to frequency thresholds such as max_size and min_freq. Keys of counter, unknown_token, and values of reserved_tokens must be of the same hashable type. Examples: str, int, and tuple. max_size (None or int, default None) – The maximum possible number of the most frequent tokens in the keys of counter that can be indexed. Note that this argument does not count any token from reserved_tokens. Suppose that there are different keys of counter whose frequency are the same, if indexing all of them will exceed this argument value, such keys will be indexed one by one according to their __cmp__() order until the frequency threshold is met. If this argument is None or larger than its largest possible value restricted by counter and reserved_tokens, this argument has no effect. min_freq (int, default 1) – The minimum frequency required for a token in the keys of counter to be indexed. unknown_token (hashable object or None, default '') – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation. If None, looking up an unknown token will result in KeyError. padding_token (hashable object or None, default '') – The representation for the special token of padding token. bos_token (hashable object or None, default '') – The representation for the special token of beginning-of-sequence token. eos_token (hashable object or None, default '') – The representation for the special token of end-of-sequence token. reserved_tokens (list of hashable objects or None, default None) – A list of reserved tokens (excluding unknown_token) that will always be indexed, such as special symbols representing padding, beginning of sentence, and end of sentence. It cannot contain unknown_token or duplicate reserved tokens. Keys of counter, unknown_token, and values of reserved_tokens must be of the same hashable type. Examples: str, int, and tuple. embedding (instance of gluonnlp.embedding.TokenEmbedding) – The embedding of the indexed tokens. idx_to_token (list of strs) – A list of indexed tokens where the list indices and the token indices are aligned. reserved_tokens (list of strs or None) – A list of reserved tokens that will always be indexed. token_to_idx (dict mapping str to int) – A dict mapping each token to its index integer. unknown_token (hashable object or None) – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation. padding_token (hashable object or None) – The representation for padding token. bos_token (hashable object or None) – The representation for beginning-of-sentence token. eos_token (hashable object or None) – The representation for end-of-sentence token.

Examples

>>> text_data = " hello world \\n hello nice world \\n hi world \\n"
>>> counter = gluonnlp.data.count_tokens(text_data)
>>> my_vocab = gluonnlp.Vocab(counter)
>>> fasttext = gluonnlp.embedding.create('fasttext', source='wiki.simple.vec')
>>> my_vocab.set_embedding(fasttext)
>>> my_vocab.embedding[['hello', 'world']]
[[  3.95669997e-01   2.14540005e-01  -3.53889987e-02  -2.42990002e-01
...
-7.54180014e-01  -3.14429998e-01   2.40180008e-02  -7.61009976e-02]
[  1.04440004e-01  -1.08580001e-01   2.72119999e-01   1.32990003e-01
...
-3.73499990e-01   5.67310005e-02   5.60180008e-01   2.90190000e-02]]
<NDArray 2x300 @cpu(0)>

>>> my_vocab[['hello', 'world']]
[5, 4]

>>> input_dim, output_dim = my_vocab.embedding.idx_to_vec.shape
>>> layer = gluon.nn.Embedding(input_dim, output_dim)
>>> layer.initialize()
>>> layer.weight.set_data(my_vocab.embedding.idx_to_vec)
>>> layer(nd.array([5, 4]))
[[  3.95669997e-01   2.14540005e-01  -3.53889987e-02  -2.42990002e-01
...
-7.54180014e-01  -3.14429998e-01   2.40180008e-02  -7.61009976e-02]
[  1.04440004e-01  -1.08580001e-01   2.72119999e-01   1.32990003e-01
...
-3.73499990e-01   5.67310005e-02   5.60180008e-01   2.90190000e-02]]
<NDArray 2x300 @cpu(0)>

>>> glove = gluonnlp.embedding.create('glove', source='glove.6B.50d.txt')
>>> my_vocab.set_embedding(glove)
>>> my_vocab.embedding[['hello', 'world']]
[[  -0.38497001  0.80092001
...
0.048833    0.67203999]
[  -0.41486001  0.71847999
...
-0.37639001 -0.67541999]]
<NDArray 2x50 @cpu(0)>

__call__(tokens)[source]

Looks up indices of text tokens according to the vocabulary.

Parameters: tokens (str or list of strs) – A source token or tokens to be converted. A token index or a list of token indices according to the vocabulary. int or list of ints
static from_json(json_str)[source]

Deserialize Vocab object from json string.

Parameters: json_str (str) – Serialized json string of a Vocab object. Vocab
set_embedding(*embeddings)[source]

Attaches one or more embeddings to the indexed text tokens.

Parameters: embeddings (None or tuple of gluonnlp.embedding.TokenEmbedding instances) – The embedding to be attached to the indexed tokens. If a tuple of multiple embeddings are provided, their embedding vectors will be concatenated for the same token.
to_indices(tokens)[source]

Looks up indices of text tokens according to the vocabulary.

Parameters: tokens (str or list of strs) – A source token or tokens to be converted. A token index or a list of token indices according to the vocabulary. int or list of ints
to_json()[source]

Serialize Vocab object to json string.

This method does not serialize the underlying embedding.

to_tokens(indices)[source]

Converts token indices to tokens according to the vocabulary.

Parameters: indices (int or list of ints) – A source token index or token indices to be converted. A token or a list of tokens according to the vocabulary. str or list of strs

Vocabulary.

class gluonnlp.vocab.SubwordFunction[source]

A SubwordFunction maps words to lists of subword indices.

This class is abstract and to be subclassed. Use gluonnlp.vocab.list_subword_functions to list all available subword functions.

A SubwordFunction object is callable and returns a list of ndarrays of subwordindices for the given words in a call.

__call__(words)[source]

Return a list of ndarrays of subwordindices for the given words.

__len__()[source]

Return the number of subwords modeled.

indices_to_subwords(indices)[source]

Return list of subwords associated with subword indices.

This may raise RuntimeError if the subword function is not invertible.

Parameters: subwordindices (iterable of int) – Subword indices to look up. Iterable of str.
subwords_to_indices(subwords)[source]

Return list of subwordindices associated with subwords.

Parameters: subwords (iterable of str) – Subwords to replace by indices. Iterable of int.
class gluonnlp.vocab.ByteSubwords(encoding='utf-8')[source]

Map words to a list of bytes.

Parameters: encoding (str, default 'utf-8) – Encoding to use for obtaining bytes.
__call__(words)[source]

Return a list of ndarrays of subwordindices for the given words.

__len__()[source]

Return the number of subwords modeled.

indices_to_subwords(indices)[source]

Return list of subwords associated with subword indices.

This may raise RuntimeError if the subword function is not invertible.

Parameters: subwordindices (iterable of int) – Subword indices to look up. Iterable of str.
subwords_to_indices(subwords)[source]

Return list of subwordindices associated with subwords.

Parameters: subwords (iterable of str) – Subwords to replace by indices. Iterable of int.
class gluonnlp.vocab.NGramHashes(num_subwords, ngrams=(3, 4, 5, 6), special_tokens=None)[source]

Map words to a list of hashes in a restricted domain.

The hash function is the same as in https://github.com/facebookresearch/fastText

Parameters: num_subwords (int) – Size of target set for the hash function. ngrams (list of int, default [3, 4, 5, 6]) – n-s for which to hash the ngrams special_tokens (set of str, default None) – Set of words for which not to look up subwords.
__call__(words)[source]

Return a list of ndarrays of subwordindices for the given words.

__len__()[source]

Return the number of subwords modeled.

indices_to_subwords(indices)[source]

Return list of subwords associated with subword indices.

This may raise RuntimeError if the subword function is not invertible.

Parameters: subwordindices (iterable of int) – Subword indices to look up. Iterable of str.
subwords_to_indices(subwords)[source]

Return list of subwordindices associated with subwords.

Parameters: subwords (iterable of str) – Subwords to replace by indices. Iterable of int.
gluonnlp.vocab.register_subword_function(subword_cls)[source]

Registers a new subword function.

gluonnlp.vocab.create_subword_function(subword_function_name, **kwargs)[source]

Creates an instance of a subword function.

gluonnlp.vocab.list_subword_functions()[source]

Get valid subword function names.