gluonnlp.vocab

This page describes the gluonnlp.Vocab class for text data numericalization and the subword functionality provided in gluonnlp.vocab.

Vocabulary

The vocabulary builds indices for text tokens and can be attached with token embeddings. The input counter whose keys are candidate indices may be obtained via gluonnlp.data.count_tokens()

Vocab

Indexing and embedding attachment for text tokens.

Subword functionality

When using a vocabulary of fixed size, out of vocabulary words may be encountered. However, words are composed of characters, allowing intelligent fallbacks for out of vocabulary words based on subword units such as the characters or ngrams in a word. gluonnlp.vocab.SubwordFunction provides an API to map words to their subword units. gluonnlp.model.train contains models that make use of subword information to word embeddings.

SubwordFunction

A SubwordFunction maps words to lists of subword indices.

ByteSubwords

Map words to a list of bytes.

NGramHashes

Map words to a list of hashes in a restricted domain.

ELMo Character-level Vocabulary

In the original ELMo pre-trained models, the character-level vocabulary relies on UTF-8 encoding in a specific setting. We provide the following vocabulary class to keep consistent with ELMo pre-trained models.

ELMoCharVocab

ELMo special character vocabulary

BERT Vocabulary

The vocabulary for BERT, inherited from gluon.Vocab , provides some additional special tokens for ease of use.

BERTVocab

Specialization of gluonnlp.Vocab for BERT models.

API Reference

NLP toolkit.

class gluonnlp.Vocab(counter=None, max_size=None, min_freq=1, unknown_token='<unk>', padding_token='<pad>', bos_token='<bos>', eos_token='<eos>', reserved_tokens=None, token_to_idx=None, **kwargs)[source]

Indexing and embedding attachment for text tokens.

Parameters
  • counter (Optional[Counter]) – Counts text token frequencies in the text data. Its keys will be indexed according to frequency thresholds such as max_size and min_freq. Keys of counter, unknown_token, and values of reserved_tokens must be of the same hashable type. Examples: str, int, and tuple.

  • max_size (Optional[int]) – The maximum possible number of the most frequent tokens in the keys of counter that can be indexed. Note that this argument does not count any token from reserved_tokens. Suppose that there are different keys of counter whose frequency are the same, if indexing all of them will exceed this argument value, such keys will be indexed one by one according to their __cmp__() order until the frequency threshold is met. If this argument is None or larger than its largest possible value restricted by counter and reserved_tokens, this argument has no effect.

  • min_freq (int) – The minimum frequency required for a token in the keys of counter to be indexed.

  • unknown_token (Hashable) – The representation for any unknown token. If unknown_token is not None, looking up any token that is not part of the vocabulary and thus considered unknown will return the index of unknown_token. If None, looking up an unknown token will result in KeyError.

  • padding_token (Hashable) – The representation for the special token of padding token.

  • bos_token (Hashable) – The representation for the special token of beginning-of-sequence token.

  • eos_token (Hashable) – The representation for the special token of end-of-sequence token.

  • reserved_tokens (Optional[List[Hashable]]) – A list specifying additional tokens to be added to the vocabulary. reserved_tokens must not contain the value of unknown_token or duplicate tokens. It must neither contain special tokens specified via keyword arguments.

  • token_to_idx (Optional[Dict[Hashable, int]]) – If not None, specifies the indices of tokens to be used by the vocabulary. Each token in token_to_index must be part of the Vocab and each index can only be associated with a single token. token_to_idx is not required to contain a mapping for all tokens. For example, it is valid to only set the unknown_token index to 10 (instead of the default of 0) with token_to_idx = {‘<unk>’: 10}, assuming that there are at least 10 tokens in the vocabulary.

  • **kwargs – Keyword arguments of the format xxx_token can be used to specify further special tokens that will be exposed as attribute of the vocabulary and associated with an index. For example, specifying mask_token=’<mask> as additional keyword argument when constructing a vocabulary v leads to v.mask_token exposing the value of the special token: <mask>. If the specified token is not part of the Vocabulary, it will be added, just as if it was listed in the reserved_tokens argument. The specified tokens are listed together with reserved tokens in the reserved_tokens attribute of the vocabulary object.

Variables
  • embedding (instance of gluonnlp.embedding.TokenEmbedding) – The embedding of the indexed tokens.

  • idx_to_token (list of strs) – A list of indexed tokens where the list indices and the token indices are aligned.

  • reserved_tokens (list of strs or None) – A list of reserved tokens that will always be indexed.

  • token_to_idx (dict mapping str to int) – A dict mapping each token to its index integer.

  • unknown_token (hashable object or None) – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation.

  • padding_token (hashable object or None) – The representation for padding token.

  • bos_token (hashable object or None) – The representation for beginning-of-sentence token.

  • eos_token (hashable object or None) – The representation for end-of-sentence token.

Examples

>>> text_data = ['hello', 'world', 'hello', 'nice', 'world', 'hi', 'world']
>>> counter = gluonnlp.data.count_tokens(text_data)
>>> my_vocab = gluonnlp.Vocab(counter)
>>> fasttext = gluonnlp.embedding.create('fasttext', source='wiki.simple')
-etc-
>>> my_vocab.set_embedding(fasttext)
>>> my_vocab.embedding[['hello', 'world']][:, :5]
<BLANKLINE>
[[ 0.39567   0.21454  -0.035389 -0.24299  -0.095645]
 [ 0.10444  -0.10858   0.27212   0.13299  -0.33165 ]]
<NDArray 2x5 @cpu(0)>
>>> my_vocab[['hello', 'world']]
[5, 4]
>>> input_dim, output_dim = my_vocab.embedding.idx_to_vec.shape
>>> layer = gluon.nn.Embedding(input_dim, output_dim)
>>> layer.initialize()
>>> layer.weight.set_data(my_vocab.embedding.idx_to_vec)
>>> layer(mx.nd.array([5, 4]))[:, :5]
<BLANKLINE>
[[ 0.39567   0.21454  -0.035389 -0.24299  -0.095645]
 [ 0.10444  -0.10858   0.27212   0.13299  -0.33165 ]]
<NDArray 2x5 @cpu(0)>
>>> glove = gluonnlp.embedding.create('glove', source='glove.6B.50d')
-etc-
>>> my_vocab.set_embedding(glove)
>>> my_vocab.embedding[['hello', 'world']][:, :5]
<BLANKLINE>
[[-0.38497   0.80092   0.064106 -0.28355  -0.026759]
 [-0.41486   0.71848  -0.3045    0.87445   0.22441 ]]
<NDArray 2x5 @cpu(0)>

Extra keyword arguments of the format xxx_token are used to expose specified tokens as attributes.

>>> my_vocab2 = gluonnlp.Vocab(counter, special_token='hi')
>>> my_vocab2.special_token
'hi'

With the token_to_idx argument the order of the Vocab’s index can be adapted. For example, Vocab assigns the index 0 to the unknown_token by default. With the token_to_idx argument, the default can be overwritten. Here we assign index 3 to the unknown token representation <unk>.

>>> tok2idx = {'<unk>': 3}
>>> my_vocab3 = gluonnlp.Vocab(counter, token_to_idx=tok2idx)
>>> my_vocab3.unknown_token
'<unk>'
>>> my_vocab3[my_vocab3.unknown_token]
3
>>> my_vocab[my_vocab.unknown_token]
0
__call__(tokens)[source]

Looks up indices of text tokens according to the vocabulary.

Parameters

tokens (str or list of strs) – A source token or tokens to be converted.

Returns

A token index or a list of token indices according to the vocabulary.

Return type

int or list of ints

classmethod from_json(json_str)[source]

Deserialize Vocab object from json string.

Parameters

json_str (str) – Serialized json string of a Vocab object.

Returns

Return type

Vocab

set_embedding(*embeddings)[source]

Attaches one or more embeddings to the indexed text tokens.

Parameters

embeddings (None or tuple of gluonnlp.embedding.TokenEmbedding instances) – The embedding to be attached to the indexed tokens. If a tuple of multiple embeddings are provided, their embedding vectors will be concatenated for the same token.

to_indices(tokens)[source]

Looks up indices of text tokens according to the vocabulary.

Parameters

tokens (str or list of strs) – A source token or tokens to be converted.

Returns

A token index or a list of token indices according to the vocabulary.

Return type

int or list of ints

to_json()[source]

Serialize Vocab object to json string.

This method does not serialize the underlying embedding.

to_tokens(indices)[source]

Converts token indices to tokens according to the vocabulary.

Parameters

indices (int or list of ints) – A source token index or token indices to be converted.

Returns

A token or a list of tokens according to the vocabulary.

Return type

str or list of strs

Vocabulary.

class gluonnlp.vocab.SubwordFunction[source]

A SubwordFunction maps words to lists of subword indices.

This class is abstract and to be subclassed. Use gluonnlp.vocab.list_subword_functions to list all available subword functions.

A SubwordFunction object is callable and returns a list of ndarrays of subwordindices for the given words in a call.

__call__(words)[source]

Return a list of ndarrays of subwordindices for the given words.

__len__()[source]

Return the number of subwords modeled.

indices_to_subwords(subwordindices)[source]

Return list of subwords associated with subword indices.

This may raise RuntimeError if the subword function is not invertible.

Parameters

subwordindices (iterable of int) – Subword indices to look up.

Returns

Return type

Iterable of str.

subwords_to_indices(subwords)[source]

Return list of subwordindices associated with subwords.

Parameters

subwords (iterable of str) – Subwords to replace by indices.

Returns

Return type

Iterable of int.

class gluonnlp.vocab.ByteSubwords(encoding='utf-8')[source]

Map words to a list of bytes.

Parameters

encoding (str, default 'utf-8) – Encoding to use for obtaining bytes.

__call__(words)[source]

Return a list of ndarrays of subwordindices for the given words.

__len__()[source]

Return the number of subwords modeled.

indices_to_subwords(subwordindices)[source]

Return list of subwords associated with subword indices.

Parameters

subwordindices (iterable of int) – Subword indices to look up.

Returns

Return type

Iterable of str.

subwords_to_indices(subwords)[source]

Return list of subwordindices associated with subwords.

Parameters

subwords (iterable of str) – Subwords to replace by indices.

Returns

Return type

Iterable of int.

class gluonnlp.vocab.NGramHashes(num_subwords, ngrams=(3, 4, 5, 6), special_tokens=None)[source]

Map words to a list of hashes in a restricted domain.

The hash function is the same as in https://github.com/facebookresearch/fastText

Parameters
  • num_subwords (int) – Size of target set for the hash function.

  • ngrams (list of int, default [3, 4, 5, 6]) – n-s for which to hash the ngrams

  • special_tokens (set of str, default None) – Set of words for which not to look up subwords.

__call__(words)[source]

Return a list of ndarrays of subwordindices for the given words.

__len__()[source]

Return the number of subwords modeled.

indices_to_subwords(subwordindices)[source]

This raises RuntimeError because the subword function is not invertible.

Parameters

subwordindices (iterable of int) – Subword indices to look up.

Returns

Return type

Iterable of str.

subwords_to_indices(subwords)[source]

Return list of subwordindices associated with subwords.

Parameters

subwords (iterable of str) – Subwords to replace by indices.

Returns

Return type

Iterable of int.

gluonnlp.vocab.register_subword_function(subword_cls)[source]

Registers a new subword function.

gluonnlp.vocab.create_subword_function(subword_function_name, **kwargs)[source]

Creates an instance of a subword function.

gluonnlp.vocab.list_subword_functions()[source]

Get valid subword function names.

class gluonnlp.vocab.ELMoCharVocab(bos_token='<bos>', eos_token='<eos>')[source]

ELMo special character vocabulary

The vocab aims to map individual tokens to sequences of character ids, compatible with ELMo. To be consistent with previously trained models, we include it here.

Specifically, char ids 0-255 come from utf-8 encoding bytes. Above 256 are reserved for special tokens.

Parameters
  • bos_token (hashable object or None, default '<bos>') – The representation for the special token of beginning-of-sequence token.

  • eos_token (hashable object or None, default '<eos>') – The representation for the special token of end-of-sequence token.

Variables
  • max_word_length (50) – The maximum number of character a word contains is 50 in ELMo.

  • bos_id (256) – The index of beginning of the sentence character is 256 in ELMo.

  • eos_id (257) – The index of end of the sentence character is 257 in ELMo.

  • bow_id (258) – The index of beginning of the word character is 258 in ELMo.

  • eow_id (259) – The index of end of the word character is 259 in ELMo.

  • pad_id (260) – The index of padding character is 260 in ELMo.

__call__(tokens)[source]

Looks up indices of text tokens according to the vocabulary.

Parameters

tokens (str or list of strs) – A source token or tokens to be converted.

Returns

A list of char indices or a list of list of char indices according to the vocabulary.

Return type

int or list of ints

class gluonnlp.vocab.BERTVocab(counter=None, max_size=None, min_freq=1, unknown_token='[UNK]', padding_token='[PAD]', bos_token=None, eos_token=None, mask_token='[MASK]', sep_token='[SEP]', cls_token='[CLS]', reserved_tokens=None, token_to_idx=None)[source]

Specialization of gluonnlp.Vocab for BERT models.

BERTVocab changes default token representations of unknown and other special tokens of gluonnlp.Vocab and adds convenience parameters to specify mask, sep and cls tokens typically used by Bert models.

Parameters
  • counter (Counter or None, default None) – Counts text token frequencies in the text data. Its keys will be indexed according to frequency thresholds such as max_size and min_freq. Keys of counter, unknown_token, and values of reserved_tokens must be of the same hashable type. Examples: str, int, and tuple.

  • max_size (None or int, default None) – The maximum possible number of the most frequent tokens in the keys of counter that can be indexed. Note that this argument does not count any token from reserved_tokens. Suppose that there are different keys of counter whose frequency are the same, if indexing all of them will exceed this argument value, such keys will be indexed one by one according to their __cmp__() order until the frequency threshold is met. If this argument is None or larger than its largest possible value restricted by counter and reserved_tokens, this argument has no effect.

  • min_freq (int, default 1) – The minimum frequency required for a token in the keys of counter to be indexed.

  • unknown_token (hashable object or None, default '[UNK]') – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation. If None, looking up an unknown token will result in KeyError.

  • padding_token (hashable object or None, default '[PAD]') – The representation for the special token of padding token.

  • bos_token (hashable object or None, default None) – The representation for the special token of beginning-of-sequence token.

  • eos_token (hashable object or None, default None) – The representation for the special token of end-of-sequence token.

  • mask_token (hashable object or None, default '[MASK]') – The representation for the special token of mask token for BERT.

  • sep_token (hashable object or None, default '[SEP]') – A token used to separate sentence pairs for BERT.

  • cls_token (hashable object or None, default '[CLS]') – Classification symbol for BERT.

  • reserved_tokens (list of hashable objects or None, default None) – A list specifying additional tokens to be added to the vocabulary. reserved_tokens cannot contain unknown_token or duplicate reserved tokens. Keys of counter, unknown_token, and values of reserved_tokens must be of the same hashable type. Examples of hashable types are str, int, and tuple.

  • token_to_idx (dict mapping tokens (hashable objects) to int or None, default None) – Optionally specifies the indices of tokens to be used by the vocabulary. Each token in token_to_index must be part of the Vocab and each index can only be associated with a single token. token_to_idx is not required to contain a mapping for all tokens. For example, it is valid to only set the unknown_token index to 10 (instead of the default of 0) with token_to_idx = {‘<unk>’: 10}.

Variables
  • embedding (instance of gluonnlp.embedding.TokenEmbedding) – The embedding of the indexed tokens.

  • idx_to_token (list of strs) – A list of indexed tokens where the list indices and the token indices are aligned.

  • reserved_tokens (list of strs or None) – A list of reserved tokens that will always be indexed.

  • token_to_idx (dict mapping str to int) – A dict mapping each token to its index integer.

  • unknown_token (hashable object or None, default '[UNK]') – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation.

  • padding_token (hashable object or None, default '[PAD]') – The representation for padding token.

  • bos_token (hashable object or None, default None) – The representation for beginning-of-sentence token.

  • eos_token (hashable object or None, default None) – The representation for end-of-sentence token.

  • mask_token (hashable object or None, default '[MASK]') – The representation for the special token of mask token for BERT.

  • sep_token (hashable object or None, default '[SEP]') – a token used to separate sentence pairs for BERT.

  • cls_token (hashable object or None, default '[CLS]') –

classmethod from_json(json_str)[source]

Deserialize BERTVocab object from json string.

Parameters

json_str (str) – Serialized json string of a BERTVocab object.

Returns

Return type

BERTVocab

classmethod from_sentencepiece(path, mask_token='[MASK]', sep_token='[SEP]', cls_token='[CLS]', unknown_token=None, padding_token=None, bos_token=None, eos_token=None, reserved_tokens=None)[source]

BERTVocab from pre-trained sentencepiece Tokenizer

Parameters
  • path (str) – Path to the pre-trained subword tokenization model.

  • mask_token (hashable object or None, default '[MASK]') – The representation for the special token of mask token for BERT

  • sep_token (hashable object or None, default '[SEP]') – a token used to separate sentence pairs for BERT.

  • cls_token (hashable object or None, default '[CLS]') –

  • unknown_token (hashable object or None, default None) – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation. If set to None, it is set to the token corresponding to the unk_id() in the loaded sentencepiece model.

  • padding_token (hashable object or None, default '[PAD]') – The representation for padding token.

  • bos_token (hashable object or None, default None) – The representation for the begin of sentence token. If set to None, it is set to the token corresponding to the bos_id() in the loaded sentencepiece model.

  • eos_token (hashable object or None, default None) – The representation for the end of sentence token. If set to None, it is set to the token corresponding to the bos_id() in the loaded sentencepiece model.

  • reserved_tokens (list of strs or None, optional) – A list of reserved tokens that will always be indexed.

Returns

Return type

BERTVocab