gluonnlp.data

Gluon NLP Toolkit provides tools for building efficient data pipelines for NLP tasks.

Public Datasets

Popular datasets for NLP tasks are provided in gluonnlp. By default, all built-in datasets are automatically downloaded from public repo and reside in ~/.mxnet/datasets/.

Language modeling: WikiText

WikiText is a popular language modeling dataset from Salesforce. It is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

WikiText2 WikiText-2 word-level dataset for language modeling, from Salesforce research.
WikiText103 WikiText-103 word-level dataset for language modeling, from Salesforce research.

Language modeling: Google 1 Billion Words

Google 1 Billion Words is a popular language modeling dataset. It is a collection of over 0.8 billion tokens extracted from the WMT11 website. The dataset is available under Apache License.

GBWStream 1-Billion-Word word-level dataset for language modeling, from Google.

Sentiment Analysis: IMDB

IMDB is a popular dataset for binary sentiment classification. It provides a set of 25,000 highly polar movie reviews for training, 25,000 for testing, and additional unlabeled data.

IMDB IMDB reviews for sentiment analysis.

Word Embedding Evaluation Datasets

There are a number of commonly used datasets for intrinsic evaluation for word embeddings.

The similarity-based evaluation datasets include:

WordSim353 WordSim353 dataset.
MEN MEN dataset for word-similarity and relatedness.
RadinskyMTurk MTurk dataset for word-similarity and relatedness by Radinsky et al..
RareWords Rare words dataset word-similarity and relatedness.
SimLex999 SimLex999 dataset word-similarity.
SimVerb3500 SimVerb3500 dataset word-similarity.
SemEval17Task2 SemEval17Task2 dataset for word-similarity.
BakerVerb143 Verb143 dataset.
YangPowersVerb130 Verb-130 dataset.

Analogy-based evaluation datasets include:

GoogleAnalogyTestSet Google analogy test set
BiggerAnalogyTestSet Bigger analogy test set

CoNLL Datasets

The CoNLL datasets are from a series of annual competitions held at the top tier conference of the same name. The conference is organized by SIGNLL.

These datasets include data for the shared tasks, such as part-of-speech (POS) tagging, chunking, named entity recognition (NER), semantic role labeling (SRL), etc.

We provide built in support for CoNLL 2000 – 2002, 2004, as well as the Universal Dependencies dataset which is used in the 2017 and 2018 competitions.

CoNLL2000 CoNLL2000 Part-of-speech (POS) tagging and chunking joint task dataset.
CoNLL2001 CoNLL2001 Clause Identification dataset.
CoNLL2002 CoNLL2002 Named Entity Recognition (NER) task dataset.
CoNLL2004 CoNLL2004 Semantic Role Labeling (SRL) task dataset.
UniversalDependencies21 Universal dependencies tree banks.

Machine Translation Datasets

We provide several standard datasets for machine translation.

IWSLT2015 Preprocessed IWSLT English-Vietnamese Translation Dataset.
WMT2014 Translation Corpus of the WMT2014 Evaluation Campaign.
WMT2014BPE Preprocessed Translation Corpus of the WMT2014 Evaluation Campaign.
WMT2016 Translation Corpus of the WMT2016 Evaluation Campaign.
WMT2016BPE Preprocessed Translation Corpus of the WMT2016 Evaluation Campaign.
SQuAD Stanford Question Answering Dataset (SQuAD) - reading comprehension dataset.

Datasets

Dataset API for processing common text formats. The following classes can be used or subclassed to load custom datasets.

TextLineDataset Dataset that comprises lines in a file.
CorpusDataset Common text dataset that reads a whole corpus based on provided sample splitter and word tokenizer.
LanguageModelDataset Reads a whole corpus and produces a language modeling dataset given the provided sample splitter and word tokenizer.

DataStreams

DataStream API for streaming and processing common text formats. The following classes can be used or subclassed to stream large custom data.

DataStream Abstract Data Stream Interface.
SimpleDataStream Simple DataStream wrapper for a stream.
CorpusStream Common text data stream that streams a corpus consisting of multiple text files that match provided file_pattern.
LanguageModelStream Streams a corpus consisting of multiple text files that match provided file_pattern, and produces a language modeling stream given the provided sample splitter and word tokenizer.
PrefetchingStream Performs pre-fetch for other data iterators.

Transforms

Text data transformation functions. They can be used for processing text sequences in conjunction with Dataset.transform method.

ClipSequence Clip the sequence to have length no more than length.
PadSequence Pad the sequence.
NLTKMosesTokenizer Apply the Moses Tokenizer implemented in NLTK.
SpacyTokenizer Apply the Spacy Tokenizer.
NLTKMosesDetokenizer Apply the Moses Detokenizer implemented in NLTK.

Samplers

Samplers determine how to iterate through datasets. The below samplers and batch samplers can help iterate through sequence data.

SortedSampler Sort the samples based on the sort key and then sample sequentially.
FixedBucketSampler Assign each data sample to a fixed bucket based on its length.
SortedBucketSampler Batches are samled from sorted buckets of data.

The FixedBucketSampler uses following bucket scheme classes to generate bucket keys.

ConstWidthBucket Buckets with constant width.
LinearWidthBucket Buckets with linearly increasing width: \(w_i = \alpha * i + 1\) for all \(i \geq 1\).
ExpWidthBucket Buckets with exponentially increasing width: \(w_i = bucket_len_step * w_{i-1}\) for all \(i \geq 2\).

Utilities

Miscellaneous utility classes and functions for processing text and sequence data.

Counter Counter class for keeping token frequencies.
count_tokens Counts tokens in the specified string.
concat_sequence Concatenate sequences of tokens into a single flattened list of tokens.
slice_sequence Slice a flat sequence of tokens into sequences tokens, with each inner sequence’s length equal to the specified length, taking into account the requested sequence overlap.
train_valid_split Split the dataset into training and validation sets.
register Registers a dataset with segment specific hyperparameters.
create Creates an instance of a registered dataset.
list_datasets Get valid datasets and registered parameters.

API Reference

This module includes common utilities such as data readers and counter.

class gluonnlp.data.Counter(**kwds)[source]

Counter class for keeping token frequencies.

discard(min_freq, unknown_token)[source]

Discards tokens with frequency below min_frequency and represents them as unknown_token.

Parameters:
  • min_freq (int) – Tokens whose frequency is under min_freq is counted as unknown_token in the Counter returned.
  • unknown_token (str) – The representation for any unknown token.
Returns:

Return type:

The Counter instance.

Examples

>>> a = Counter({'a': 10, 'b': 1, 'c': 1})
>>> a.discard(3, '<unk>')
Counter({'a': 10, '<unk>': 2})
gluonnlp.data.count_tokens(tokens, to_lower=False, counter=None)[source]

Counts tokens in the specified string.

For token_delim=’(td)’ and seq_delim=’(sd)’, a specified string of two sequences of tokens may look like:

(td)token1(td)token2(td)token3(td)(sd)(td)token4(td)token5(td)(sd)
Parameters:
  • tokens (list of str) – A source list of tokens.
  • to_lower (bool, default False) – Whether to convert the source source_str to the lower case.
  • counter (Counter or None, default None) – The Counter instance to be updated with the counts of tokens. If None, return a new Counter instance counting tokens from tokens.
Returns:

  • The counter Counter instance after being updated with the token
  • counts of source_str. If counter is None, return a new Counter
  • instance counting tokens from source_str.

Examples

>>> source_str = ' Life is great ! \\n life is good . \\n'
>>> source_str_tokens = filter(None, re.split(' |\n', source_str))
>>> count_tokens(source_str_tokens)
Counter({'!': 1, '.': 1, 'good': 1, 'great': 1, 'is': 2, 'life': 2})
gluonnlp.data.concat_sequence(sequences)[source]

Concatenate sequences of tokens into a single flattened list of tokens.

Parameters:sequences (list of list of object) – Sequences of tokens, each of which is an iterable of tokens.
Returns:
Return type:Flattened list of tokens.
gluonnlp.data.slice_sequence(sequence, length, pad_last=False, pad_val='<pad>', overlap=0)[source]

Slice a flat sequence of tokens into sequences tokens, with each inner sequence’s length equal to the specified length, taking into account the requested sequence overlap.

Parameters:
  • sequence (list of object) – A flat list of tokens.
  • length (int) – The length of each of the samples.
  • pad_last (bool, default False) – Whether to pad the last sequence when its length doesn’t align. If the last sequence’s length doesn’t align and pad_last is False, it will be dropped.
  • pad_val (object, default) – The padding value to use when the padding of the last sequence is enabled. In general, the type of pad_val should be the same as the tokens.
  • overlap (int, default 0) – The extra number of items in current sample that should overlap with the next sample.
Returns:

Return type:

List of list of tokens, with the length of each inner list equal to length.

gluonnlp.data.train_valid_split(dataset, valid_ratio=0.05)[source]

Split the dataset into training and validation sets.

Parameters:
  • train (list) – A list of training samples.
  • valid_ratio (float, default 0.05) – Proportion of training samples to use for validation set range: [0, 1]
Returns:

  • train (SimpleDataset)
  • valid (SimpleDataset)

class gluonnlp.data.ClipSequence(length)[source]

Clip the sequence to have length no more than length.

Parameters:length (int) – Maximum length of the sequence

Examples

>>> from mxnet.gluon.data import SimpleDataset
>>> datasets = SimpleDataset([[1, 3, 5, 7], [1, 2, 3], [1, 2, 3, 4, 5, 6, 7, 8]])
>>> list(datasets.transform(ClipSequence(4)))
[[1, 3, 5, 7], [1, 2, 3], [1, 2, 3, 4]]
>>> datasets = SimpleDataset([np.array([[1, 3], [5, 7], [7, 5], [3, 1]]),
...                           np.array([[1, 2], [3, 4], [5, 6], [6, 5], [4, 3], [2, 1]]),
...                           np.array([[2, 4], [4, 2]])])
>>> list(datasets.transform(ClipSequence(3)))
[array([[1, 3],
        [5, 7],
        [7, 5]]), array([[1, 2],
        [3, 4],
        [5, 6]]), array([[2, 4],
        [4, 2]])]
class gluonnlp.data.PadSequence(length, pad_val=0, clip=True)[source]

Pad the sequence.

Pad the sequence to the given length by inserting pad_val. If clip is set, sequence that has length larger than length will be clipped.

Parameters:
  • length (int) – The maximum length to pad/clip the sequence
  • pad_val (number) – The pad value. Default 0
  • clip (bool) –

Examples

>>> from mxnet.gluon.data import SimpleDataset
>>> datasets = SimpleDataset([[1, 3, 5, 7], [1, 2, 3], [1, 2, 3, 4, 5, 6, 7, 8]])
>>> list(datasets.transform(PadSequence(6)))
[[1, 3, 5, 7, 0, 0], [1, 2, 3, 0, 0, 0], [1, 2, 3, 4, 5, 6]]
>>> list(datasets.transform(PadSequence(6, clip=False)))
[[1, 3, 5, 7, 0, 0], [1, 2, 3, 0, 0, 0], [1, 2, 3, 4, 5, 6, 7, 8]]
>>> list(datasets.transform(PadSequence(6, pad_val=-1, clip=False)))
[[1, 3, 5, 7, -1, -1], [1, 2, 3, -1, -1, -1], [1, 2, 3, 4, 5, 6, 7, 8]]
class gluonnlp.data.NLTKMosesTokenizer[source]

Apply the Moses Tokenizer implemented in NLTK.

Users of this class are required to install NLTK and install relevant NLTK packages, such as:

python -m nltk.downloader perluniprops nonbreaking_prefixes

Examples

>>> tokenizer = NLTKMosesTokenizer()
>>> tokenizer("Gluon NLP toolkit provides a suite of text processing tools.")
['Gluon',
 'NLP',
 'toolkit',
 'provides',
 'a',
 'suite',
 'of',
 'text',
 'processing',
 'tools',
 '.']
>>> tokenizer("Das Gluon NLP-Toolkit stellt eine Reihe von Textverarbeitungstools "
...           "zur Verfügung.")
['Das',
 'Gluon',
 'NLP-Toolkit',
 'stellt',
 'eine',
 'Reihe',
 'von',
 'Textverarbeitungstools',
 'zur',
 'Verfügung',
 '.']
class gluonnlp.data.SpacyTokenizer(lang='en')[source]

Apply the Spacy Tokenizer.

Users of this class are required to install spaCy and download corresponding NLP models, such as:

python -m spacy download en

Only spacy>=2.0.0 is supported.

Parameters:lang (str) – The language to tokenize. Default is “en”, i.e, English. You may refer to https://spacy.io/usage/models for supported languages.

Examples

>>> tokenizer = SpacyTokenizer()
>>> tokenizer(u"Gluon NLP toolkit provides a suite of text processing tools.")
['Gluon',
 'NLP',
 'toolkit',
 'provides',
 'a',
 'suite',
 'of',
 'text',
 'processing',
 'tools',
 '.']
>>> tokenizer = SpacyTokenizer('de')
>>> tokenizer(u"Das Gluon NLP-Toolkit stellt eine Reihe von Textverarbeitungstools"
...            " zur Verfügung.")
['Das',
 'Gluon',
 'NLP-Toolkit',
 'stellt',
 'eine',
 'Reihe',
 'von',
 'Textverarbeitungstools',
 'zur',
 'Verfügung',
 '.']
class gluonnlp.data.NLTKMosesDetokenizer[source]

Apply the Moses Detokenizer implemented in NLTK.

Users of this class are required to install NLTK and install relevant NLTK packages, such as:

python -m nltk.downloader perluniprops nonbreaking_prefixes

Examples

>>> detokenizer = NLTKMosesDetokenizer()
>>> detokenizer(['Gluon', 'NLP', 'toolkit', 'provides', 'a', 'suite', \
 'of', 'text', 'processing', 'tools', '.'], return_str=True)
"Gluon NLP toolkit provides a suite of text processing tools."
>>> detokenizer(['Das', 'Gluon','NLP-Toolkit','stellt','eine','Reihe','von', \
 'Textverarbeitungstools','zur','Verfügung','.'], return_str=True)
'Das Gluon NLP-Toolkit stellt eine Reihe von Textverarbeitungstools zur Verfügung.'
class gluonnlp.data.JiebaTokenizer[source]

Apply the jieba Tokenizer.

Users of this class are required to install jieba

Parameters:lang (str) – The language to tokenize. Default is “zh”, i.e, Chinese.

Examples

>>> tokenizer = JiebaTokenizer()
>>> tokenizer(u"我来到北京清华大学")
['我',
 '来到',
 '北京',
 '清华大学']
>>> tokenizer(u"小明硕士毕业于中国科学院计算所,后在日本京都大学深造")
['小明',
 '硕士',
 '毕业',
 '于',
 '中国科学院',
 '计算所',
 ',',
 '后',
 '在',
 '日本京都大学',
 '深造']
class gluonnlp.data.NLTKStanfordSegmenter(segmenter_root='/var/lib/jenkins/.mxnet/stanford-segmenter', slf4j_root='/var/lib/jenkins/.mxnet/slf4j', java_class='edu.stanford.nlp.ie.crf.CRFClassifier')[source]

Apply the Stanford Chinese Word Segmenter implemented in NLTK.

Users of this class are required to install Java, NLTK and download Stanford Word Segmenter

Parameters:
  • segmenter_root (str, default '$MXNET_HOME/stanford-segmenter') – Path to folder for storing stanford segmenter. MXNET_HOME defaults to ‘~/.mxnet’.
  • slf4j_root (str, default '$MXNET_HOME/slf4j') – Path to foler for storing slf4j. MXNET_HOME defaults to ‘~/.mxnet’
  • java_class (str, default 'edu.stanford.nlp.ie.crf.CRFClassifier') – The learning algorithm used for segmentation

Examples

>>> tokenizer = NLTKStanfordSegmenter()
>>> tokenizer(u"我来到北京清华大学")
['我',
 '来到',
 '北京',
 '清华大学']
>>> tokenizer(u"小明硕士毕业于中国科学院计算所,后在日本京都大学深造")
['小明',
 '硕士',
 '毕业',
 '于',
 '中国',
 '科学院',
 '计算所',
 ',',
 '后',
 '在',
 '日本'
 '京都大学',
 '深造']
class gluonnlp.data.ConstWidthBucket[source]

Buckets with constant width.

class gluonnlp.data.LinearWidthBucket[source]

Buckets with linearly increasing width: \(w_i = \alpha * i + 1\) for all \(i \geq 1\).

class gluonnlp.data.ExpWidthBucket(bucket_len_step=1.1)[source]

Buckets with exponentially increasing width: \(w_i = bucket_len_step * w_{i-1}\) for all \(i \geq 2\).

Parameters:bucket_len_step (float, default 1.1) – This is the increasing factor for the bucket width.
class gluonnlp.data.SortedSampler(sort_keys, reverse=True)[source]

Sort the samples based on the sort key and then sample sequentially.

Parameters:
  • sort_keys (list-like object) – List of the sort keys.
  • reverse (bool, default True) – Whether to sort by descending order.
class gluonnlp.data.FixedBucketSampler(lengths, batch_size, num_buckets=10, bucket_keys=None, ratio=0, shuffle=False, use_average_length=False, bucket_scheme=<gluonnlp.data.sampler.ConstWidthBucket object>)[source]

Assign each data sample to a fixed bucket based on its length. The bucket keys are either given or generated from the input sequence lengths.

Parameters:
  • lengths (list of int or list of tuple/list of int) – The length of the sequences in the input data sample.
  • batch_size (int) – The batch size of the sampler.
  • num_buckets (int or None, default 10) – The number of buckets. This will not be used if bucket_keys is set.
  • bucket_keys (None or list of int or list of tuple, default None) – The keys that will be used to create the buckets. It should usually be the lengths of the sequences. If it is None, the bucket_keys will be generated based on the maximum lengths of the data.
  • ratio (float, default 0) –

    Ratio to scale up the batch size of smaller buckets. Assume the \(i\) th key is \(K_i\) , the default batch size is \(B\) , the ratio to scale the batch size is \(\alpha\) and the batch size corresponds to the \(i\) th bucket is \(B_i\) . We have:

    \[B_i = \max(\alpha B \times \frac{\max_j sum(K_j)}{sum(K_i)}, B)\]

    Thus, setting this to a value larger than 0, like 0.5, will scale up the batch size of the smaller buckets.

  • shuffle (bool, default False) – Whether to shuffle the batches.
  • use_average_length (bool, default False) – False: each batch contains batch_size sequences, number of sequence elements varies. True: each batch contains batch_size elements, number of sequences varies. In this case, ratio option is ignored.
  • bucket_scheme (BucketScheme, default ConstWidthBucket) – It is used to generate bucket keys. It supports: ConstWidthBucket: all the buckets have the same width LinearWidthBucket: the width of ith bucket follows \(w_i = \alpha * i + 1\) ExpWidthBucket: the width of ith bucket follows \(w_i = bucket_len_step * w_{i-1}\)

Examples

>>> from gluonnlp.data import FixedBucketSampler
>>> import numpy as np
>>> lengths = [np.random.randint(1, 100) for _ in range(1000)]
>>> sampler = FixedBucketSampler(lengths, 8)
>>> print(sampler.stats())
FixedBucketSampler:
  sample_num=1000, batch_num=128
  key=[9, 19, 29, 39, 49, 59, 69, 79, 89, 99]
  cnt=[95, 103, 91, 97, 86, 79, 102, 100, 128, 119]
  batch_size=[8, 8, 8, 8, 8, 8, 8, 8, 8, 8]
>>> sampler = FixedBucketSampler(lengths, 8, ratio=0.5)
>>> print(sampler.stats())
FixedBucketSampler:
  sample_num=1000, batch_num=104
  key=[9, 19, 29, 39, 49, 59, 69, 79, 89, 99]
  cnt=[95, 103, 91, 97, 86, 79, 102, 100, 128, 119]
  batch_size=[44, 20, 13, 10, 8, 8, 8, 8, 8, 8]
stats()[source]

Return a string representing the statistics of the bucketing sampler.

Returns:ret – String representing the statistics of the buckets.
Return type:str
class gluonnlp.data.SortedBucketSampler(sort_keys, batch_size, mult=100, reverse=True, shuffle=False)[source]

Batches are samled from sorted buckets of data.

First, partition data in buckets of size batch_size * mult. Each bucket contains batch_size * mult elements. The samples inside each bucket are sorted based on sort_key and then batched.

Parameters:
  • sort_keys (list-like object) – The keys to sort the samples.
  • batch_size (int) – Batch size of the sampler.
  • mult (int or float, default 100) – The multiplier to determine the bucket size. Each bucket will have size mult * batch_size.
  • reverse (bool, default True) – Whether to sort in descending order.
  • shuffle (bool, default False) – Whether to shuffle the data.

Examples

>>> from gluonnlp.data import SortedBucketSampler
>>> import numpy as np
>>> lengths = [np.random.randint(1, 1000) for _ in range(1000)]
>>> sampler = SortedBucketSampler(lengths, 16)
>>> # The sequence lengths within the batch will be sorted
>>> for i, indices in enumerate(sampler):
...     if i == 0:
...         print([lengths[ele] for ele in indices])
[999, 999, 999, 997, 997, 996, 995, 993, 991, 991, 989, 989, 987, 987, 986, 985]
class gluonnlp.data.ContextSampler(coded, batch_size, window=5)[source]

Sample batches of contexts (and their masks) from a corpus.

The context size is choosen uniformly at random for every sample from [1, window]. The mask is used to mask entries that lie outside of the randomly chosen context size. Contexts do not cross sentence boundaries.

Batches are created lazily, to avoid generating all batches for shuffling before training, simply shuffle the dataset before passing it to the ContextSampler.

Parameters:
  • coded (list of lists of int) – List of coded sentences. A coded sentence itself is a list of token indices. Context samples do not cross sentence boundaries.
  • batch_size (int) – Maximum size of batches. Actual batch returned can be smaller when running out of samples.
  • window (int, default 5) – The maximum context size.
Variables:

num_samples (int) – Overall number of samples that are iterated over in batches. This is the total number of token indices in coded.

class gluonnlp.data.TextLineDataset(filename, encoding='utf8')[source]

Dataset that comprises lines in a file. Each line will be stripped.

Parameters:
  • filename (str) – Path to the input text file.
  • encoding (str, default 'utf8') – File encoding format.
class gluonnlp.data.CorpusDataset(filename, encoding='utf8', flatten=False, skip_empty=True, sample_splitter=<function CorpusDataset.<lambda>>, tokenizer=<function CorpusDataset.<lambda>>, bos=None, eos=None)[source]

Common text dataset that reads a whole corpus based on provided sample splitter and word tokenizer.

The returned dataset includes samples, each of which can either be a list of tokens if tokenizer is specified, or otherwise a single string segment produced by the sample_splitter.

Parameters:
  • filename (str or list of str) – Path to the input text file or list of paths to the input text files.
  • encoding (str, default 'utf8') – File encoding format.
  • flatten (bool, default False) – Whether to return all samples as flattened tokens. If True, each sample is a token.
  • skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.
  • sample_splitter (function, default str.splitlines) – A function that splits the dataset string into samples.
  • tokenizer (function or None, default str.split) – A function that splits each sample string into list of tokens. If None, raw samples are returned according to sample_splitter.
  • bos (str or None, default None) – The token to add at the begining of each sequence. If None, or if tokenizer is not specified, then nothing is added.
  • eos (str or None, default None) – The token to add at the end of each sequence. If None, or if tokenizer is not specified, then nothing is added.
class gluonnlp.data.LanguageModelDataset(filename, encoding='utf8', skip_empty=True, sample_splitter=<function LanguageModelDataset.<lambda>>, tokenizer=<function LanguageModelDataset.<lambda>>, bos=None, eos=None)[source]

Reads a whole corpus and produces a language modeling dataset given the provided sample splitter and word tokenizer.

Parameters:
  • filename (str) – Path to the input text file.
  • encoding (str, default 'utf8') – File encoding format.
  • skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.
  • sample_splitter (function, default str.splitlines) – A function that splits the dataset string into samples.
  • tokenizer (function, default str.split) – A function that splits each sample string into list of tokens.
  • bos (str or None, default None) – The token to add at the begining of each sentence. If None, nothing is added.
  • eos (str or None, default None) – The token to add at the end of each sentence. If None, nothing is added.
batchify(vocab, batch_size)[source]

Transform the dataset into N independent sequences, where N is the batch size.

Parameters:
  • vocab (gluonnlp.Vocab) – The vocabulary to use for numericalizing the dataset. Each token will be mapped to the index according to the vocabulary.
  • batch_size (int) – The number of samples in each batch.
Returns:

  • NDArray of shape (num_tokens // N, N). Excessive tokens that don’t align along
  • the batches are discarded.

bptt_batchify(vocab, seq_len, batch_size, last_batch='keep')[source]

Transform the dataset into batches of numericalized samples, in the way that the recurrent states from last batch connects with the current batch for each sample.

Each sample is of shape (seq_len, batch_size). When last_batch=’keep’, the first dimension of last sample may be shorter than seq_len.

Parameters:
  • vocab (gluonnlp.Vocab) – The vocabulary to use for numericalizing the dataset. Each token will be mapped to the index according to the vocabulary.
  • seq_len (int) – The length of each of the samples for truncated back-propagation-through-time (TBPTT).
  • batch_size (int) – The number of samples in each batch.
  • last_batch ({'keep', 'discard'}) –

    How to handle the last batch if the remaining length is less than seq_len.

    • keep: A batch with less samples than previous batches is returned. vocab.padding_token is used to pad the last batch based on batch size.
    • discard: The last batch is discarded if it’s smaller than (seq_len, batch_size).
class gluonnlp.data.WikiText2(segment='train', skip_empty=True, tokenizer=<function WikiText2.<lambda>>, bos=None, eos='<eos>', root='/var/lib/jenkins/.mxnet/datasets/wikitext-2', **kwargs)[source]

WikiText-2 word-level dataset for language modeling, from Salesforce research.

From https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset

License: Creative Commons Attribution-ShareAlike

Parameters:
  • segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.
  • skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.
  • tokenizer (function, default str.split) – A function that splits each sample string into list of tokens.
  • bos (str or None, default None) – The token to add at the begining of each sentence. If None, nothing is added.
  • eos (str or None, default '<eos>') – The token to add at the end of each sentence. If None, nothing is added.
  • root (str, default '$MXNET_HOME/datasets/wikitext-2') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.WikiText103(segment='train', skip_empty=True, tokenizer=<function WikiText103.<lambda>>, bos=None, eos='<eos>', root='/var/lib/jenkins/.mxnet/datasets/wikitext-103', **kwargs)[source]

WikiText-103 word-level dataset for language modeling, from Salesforce research.

From https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset

License: Creative Commons Attribution-ShareAlike

Parameters:
  • segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.
  • skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.
  • tokenizer (function, default str.split) – A function that splits each sample string into list of tokens.
  • bos (str or None, default None) – The token to add at the begining of each sentence. If None, nothing is added.
  • eos (str or None, default '<eos>') – The token to add at the end of each sentence. If None, nothing is added.
  • root (str, default '$MXNET_HOME/datasets/wikitext-103') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.WikiText2Raw(segment='train', skip_empty=True, bos=None, eos=None, tokenizer=<function WikiText2Raw.<lambda>>, root='/var/lib/jenkins/.mxnet/datasets/wikitext-2', **kwargs)[source]

WikiText-2 character-level dataset for language modeling

From Salesforce research: https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset

License: Creative Commons Attribution-ShareAlike

Parameters:
  • segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.
  • skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.
  • tokenizer (function, default s.encode('utf-8')) – A function that splits each sample string into list of tokens. The tokenizer can also be used to convert everything to lowercase. E.g. with tokenizer=lambda s: s.lower().encode(‘utf-8’)
  • bos (str or None, default None) – The token to add at the begining of each sentence. If None, nothing is added.
  • eos (str or None, default '<eos>') – The token to add at the end of each sentence. If None, nothing is added.
  • root (str, default '$MXNET_HOME/datasets/wikitext-2') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.WikiText103Raw(segment='train', skip_empty=True, tokenizer=<function WikiText103Raw.<lambda>>, bos=None, eos=None, root='/var/lib/jenkins/.mxnet/datasets/wikitext-103', **kwargs)[source]

WikiText-103 character-level dataset for language modeling

From Salesforce research: https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset

License: Creative Commons Attribution-ShareAlike

Parameters:
  • segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.
  • skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.
  • tokenizer (function, default s.encode('utf-8')) – A function that splits each sample string into list of tokens. The tokenizer can also be used to convert everything to lowercase. E.g. with tokenizer=lambda s: s.lower().encode(‘utf-8’)
  • bos (str or None, default None) – The token to add at the begining of each sentence. If None, nothing is added.
  • eos (str or None, default '<eos>') – The token to add at the end of each sentence. If None, nothing is added.
  • root (str, default '$MXNET_HOME/datasets/wikitext-103') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.GBWStream(segment='train', skip_empty=True, bos='<bos>', eos='<eos>', root='/var/lib/jenkins/.mxnet/datasets/gbw')[source]

1-Billion-Word word-level dataset for language modeling, from Google.

From http://www.statmt.org/lm-benchmark

License: Apache

Parameters:
  • segment ({'train', 'test'}, default 'train') – Dataset segment.
  • skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.
  • bos (str or None, default '<bos>') – The token to add at the begining of each sentence. If None, nothing is added.
  • eos (str or None, default '<eos>') – The token to add at the end of each sentence. If None, nothing is added.
  • root (str, default '$MXNET_HOME/datasets/gbw') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.IMDB(segment='train', root='/var/lib/jenkins/.mxnet/datasets/imdb')[source]

IMDB reviews for sentiment analysis.

From http://ai.stanford.edu/~amaas/data/sentiment/

Parameters:
  • segment (str, default 'train') – Dataset segment. Options are ‘train’, ‘test’, and ‘unsup’ for unsupervised.
  • root (str, default '$MXNET_HOME/datasets/imdb') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.WordSimilarityEvaluationDataset(root)[source]

Base class for word similarity or relatedness task datasets.

Inheriting classes are assumed to implement datasets of the form [‘word1’, ‘word2’, score] where score is a numerical similarity or relatedness score with respect to ‘word1’ and ‘word2’.

class gluonnlp.data.WordAnalogyEvaluationDataset(root)[source]

Base class for word analogy task datasets.

Inheriting classes are assumed to implement datasets of the form [‘word1’, ‘word2’, ‘word3’, ‘word4’] or [‘word1’, [ ‘word2a’, ‘word2b’, … ], ‘word3’, [ ‘word4a’, ‘word4b’, … ]].

class gluonnlp.data.WordSim353(segment='all', root='/var/lib/jenkins/.mxnet/datasets/wordsim353')[source]

WordSim353 dataset.

The dataset was collected by Finkelstein et al. (http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/). Agirre et al. proposed to split the collection into two datasets, one focused on measuring similarity, and the other one on relatedness (http://alfonseca.org/eng/research/wordsim353.html).

  • Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing search in context: the concept revisited. ACM} Trans. Inf. Syst., 20(1), 116–131. http://dx.doi.org/10.1145/503104.503110
  • Agirre, E., Alfonseca, E., Hall, K. B., Kravalova, J., Pasca, M., & Soroa, A. (2009). A study on similarity and relatedness using distributional and wordnet-based approaches. In , Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, May 31 - June 5, 2009, Boulder, Colorado, {USA (pp. 19–27). : The Association for Computational Linguistics.

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Parameters:
  • segment (str) – ‘relatedness’, ‘similiarity’ or ‘all’
  • root (str, default '$MXNET_HOME/datasets/wordsim353') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.MEN(segment='dev', root='/var/lib/jenkins/.mxnet/datasets/men')[source]

MEN dataset for word-similarity and relatedness.

The dataset was collected by Bruni et al. (http://clic.cimec.unitn.it/~elia.bruni/MEN.html).

  • Bruni, E., Boleda, G., Baroni, M., & Nam-Khanh Tran (2012). Distributional semantics in technicolor. In , The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, July 8-14, 2012, Jeju Island, Korea - Volume 1: Long Papers (pp. 136–145). : The Association for Computer Linguistics.

License: Creative Commons Attribution 2.0 Generic (CC BY 2.0)

Parameters:
  • root (str, default '$MXNET_HOME/datasets/men') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
  • segment (str, default 'train') – Dataset segment. Options are ‘train’, ‘dev’, ‘test’.
class gluonnlp.data.RadinskyMTurk(root='/var/lib/jenkins/.mxnet/datasets/radinskymturk')[source]

MTurk dataset for word-similarity and relatedness by Radinsky et al..

  • Radinsky, K., Agichtein, E., Gabrilovich, E., & Markovitch, S. (2011). A word at a time: computing word relatedness using temporal semantic analysis. In S. Srinivasan, K. Ramamritham, A. Kumar, M. P. Ravindra, E. Bertino, & R. Kumar, Proceedings of the 20th International Conference on World Wide Web, {WWW} 2011, Hyderabad, India, March 28 - April 1, 2011 (pp. 337–346). : ACM.

License: Unspecified

Parameters:root (str, default '$MXNET_HOME/datasets/radinskymturk') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.RareWords(root='/var/lib/jenkins/.mxnet/datasets/rarewords')[source]

Rare words dataset word-similarity and relatedness.

  • Luong, T., Socher, R., & Manning, C. D. (2013). Better word representations with recursive neural networks for morphology. In J. Hockenmaier, & S. Riedel, Proceedings of the Seventeenth Conference on Computational Natural Language Learning, CoNLL 2013, Sofia, Bulgaria, August 8-9, 2013 (pp. 104–113). : ACL.

License: Unspecified

Parameters:root (str, default '$MXNET_HOME/datasets/rarewords',) – MXNET_HOME defaults to ‘~/.mxnet’. Path to temp folder for storing data.
class gluonnlp.data.SimLex999(root='/var/lib/jenkins/.mxnet/datasets/simlex999')[source]

SimLex999 dataset word-similarity.

  • Hill, F., Reichart, R., & Korhonen, A. (2015). Simlex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4), 665–695. http://dx.doi.org/10.1162/COLI_a_00237

License: Unspecified

The dataset contains

  • word1: The first concept in the pair.
  • word2: The second concept in the pair. Note that the order is only relevant to the column Assoc(USF). These values (free association scores) are asymmetric. All other values are symmetric properties independent of the ordering word1, word2.
  • POS: The majority part-of-speech of the concept words, as determined by occurrence in the POS-tagged British National Corpus. Only pairs of matching POS are included in SimLex-999.
  • SimLex999: The SimLex999 similarity rating. Note that average annotator scores have been (linearly) mapped from the range [0,6] to the range [0,10] to match other datasets such as WordSim-353.
  • conc(w1): The concreteness rating of word1 on a scale of 1-7. Taken from the University of South Florida Free Association Norms database.
  • conc(w2): The concreteness rating of word2 on a scale of 1-7. Taken from the University of South Florida Free Association Norms database.
  • concQ: The quartile the pair occupies based on the two concreteness ratings. Used for some analyses in the above paper.
  • Assoc(USF): The strength of free association from word1 to word2. Values are taken from the University of South Florida Free Association Dataset.
  • SimAssoc333: Binary indicator of whether the pair is one of the 333 most associated in the dataset (according to Assoc(USF)). This subset of SimLex999 is often the hardest for computational models to capture because the noise from high association can confound the similarity rating. See the paper for more details.
  • SD(SimLex): The standard deviation of annotator scores when rating this pair. Low values indicate good agreement between the 15+ annotators on the similarity value SimLex999. Higher scores indicate less certainty.
Parameters:root (str, default '$MXNET_HOME/datasets/simlex999') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.SimVerb3500(segment='full', root='/var/lib/jenkins/.mxnet/datasets/simverb3500')[source]

SimVerb3500 dataset word-similarity.

  • Hill, F., Reichart, R., & Korhonen, A. (2015). Simlex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4), 665–695. http://dx.doi.org/10.1162/COLI_a_00237

License: Unspecified

The dataset contains

  • word1: The first verb of the pair.
  • word2: The second verb of the pair.
  • POS: The part-of-speech tag. Note that it is ‘V’ for all pairs, since the dataset exclusively contains verbs. We decided to include it nevertheless to make it compatible with SimLex-999.
  • score: The SimVerb-3500 similarity rating. Note that average annotator scores have been linearly mapped from the range [0,6] to the range [0,10] to match other datasets.
  • relation: the lexical relation of the pair. Possible values: ‘SYNONYMS’, ‘ANTONYMS’, ‘HYPER/HYPONYMS’, ‘COHYPONYMS’, ‘NONE’.
Parameters:root (str, default '$MXNET_HOME/datasets/verb3500') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.SemEval17Task2(segment='trial', language='en', root='/var/lib/jenkins/.mxnet/datasets/semeval17task2')[source]

SemEval17Task2 dataset for word-similarity.

The dataset was collected by Finkelstein et al. (http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/). Agirre et al. proposed to split the collection into two datasets, one focused on measuring similarity, and the other one on relatedness (http://alfonseca.org/eng/research/wordsim353.html).

  • Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing search in context: the concept revisited. ACM} Trans. Inf. Syst., 20(1), 116–131. http://dx.doi.org/10.1145/503104.503110
  • Agirre, E., Alfonseca, E., Hall, K. B., Kravalova, J., Pasca, M., & Soroa, A. (2009). A study on similarity and relatedness using distributional and wordnet-based approaches. In , Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, May 31 - June 5, 2009, Boulder, Colorado, {USA (pp. 19–27). : The Association for Computational Linguistics.

License: Unspecified

Parameters:
  • root (str, default '$MXNET_HOME/datasets/semeval17task2') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
  • segment (str, default 'train') – Dataset segment. Options are ‘trial’, ‘test’.
  • language (str, default 'en') – Dataset language.
class gluonnlp.data.BakerVerb143(root='/var/lib/jenkins/.mxnet/datasets/verb143')[source]

Verb143 dataset.

  • Baker, S., Reichart, R., & Korhonen, A. (2014). An unsupervised model for instance level subcategorization acquisition. In A. Moschitti, B. Pang, & W. Daelemans, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2014, October 25-29, 2014, Doha, Qatar, {A} meeting of SIGDAT, a Special Interest Group of the {ACL (pp. 278–289). : ACL.

144 pairs of verbs annotated by 10 annotators following the WS-353 guidelines.

License: unspecified

Parameters:root (str, default '$MXNET_HOME/datasets/verb143') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.YangPowersVerb130(root='~/.mxnet/datasets/verb130')[source]

Verb-130 dataset.

  • Yang, D., & Powers, D. M. (2006). Verb similarity on the taxonomy of wordnet. In The Third International WordNet Conference: GWC 2006

License: Unspecified

Parameters:root (str, default '$MXNET_HOME/datasets/verb130') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.GoogleAnalogyTestSet(group=None, category=None, lowercase=True, root='/var/lib/jenkins/.mxnet/datasets/google_analogy')[source]

Google analogy test set

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations (ICLR).

License: Unspecified

class gluonnlp.data.BiggerAnalogyTestSet(category=None, form_analogy_pairs=True, drop_alternative_solutions=True, root='/var/lib/jenkins/.mxnet/datasets/bigger_analogy')[source]

Bigger analogy test set

  • Gladkova, A., Drozd, A., & Matsuoka, S. (2016). Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In Proceedings of the NAACL-HLT SRW (pp. 47–54). San Diego, California, June 12-17, 2016: ACL. Retrieved from https://www.aclweb.org/anthology/N/N16/N16-2002.pdf

License: Unspecified

Parameters:root (str, default '$MXNET_HOME/datasets/bats') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.DataStream[source]

Abstract Data Stream Interface.

transform(fn)[source]
Returns:The data stream that lazily transforms the data while streaming.
Return type:DataStream
class gluonnlp.data.CorpusStream(file_pattern, encoding='utf8', flatten=False, skip_empty=True, sample_splitter=<function CorpusStream.<lambda>>, tokenizer=<function CorpusStream.<lambda>>, bos=None, eos=None, sampler='random', file_sampler='random')[source]

Common text data stream that streams a corpus consisting of multiple text files that match provided file_pattern. One file is read at a time.

The returned dataset includes samples, each of which can either be a list of tokens if tokenizer is specified, or otherwise a single string segment produced by the sample_splitter.

Parameters:
  • file_pattern (str) – Path to the input text files.
  • encoding (str, default 'utf8') – File encoding format.
  • flatten (bool, default False) – Whether to return all samples as flattened tokens. If True, each sample is a token.
  • skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.
  • sample_splitter (function, default str.splitlines) – A function that splits the dataset string into samples.
  • tokenizer (function or None, default str.split) – A function that splits each sample string into list of tokens. If None, raw samples are returned according to sample_splitter.
  • bos (str or None, default None) – The token to add at the begining of each sequence. If None, or if tokenizer is not specified, then nothing is added.
  • eos (str or None, default None) – The token to add at the end of each sequence. If None, or if tokenizer is not specified, then nothing is added.
  • sampler (str, {'sequential', 'random'}, defaults to 'random') –

    The sampler used to sample texts within a file.

    • ’sequential’: SequentialSampler
    • ’random’: RandomSampler
  • file_sampler (str, {'sequential', 'random'}, defaults to 'random') –

    The sampler used to sample a file.

    • ’sequential’: SequentialSampler
    • ’random’: RandomSampler
class gluonnlp.data.LanguageModelStream(file_pattern, encoding='utf8', skip_empty=True, sample_splitter=<function LanguageModelStream.<lambda>>, tokenizer=<function LanguageModelStream.<lambda>>, bos=None, eos=None, sampler='random', file_sampler='random')[source]

Streams a corpus consisting of multiple text files that match provided file_pattern, and produces a language modeling stream given the provided sample splitter and word tokenizer.

Parameters:
  • file_pattern (str) – Path to the input text files.
  • encoding (str, default 'utf8') – File encoding format.
  • skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.
  • sample_splitter (function, default str.splitlines) – A function that splits the dataset string into samples.
  • tokenizer (function or None, default str.split) – A function that splits each sample string into list of tokens. If None, raw samples are returned according to sample_splitter.
  • bos (str or None, default None) – The token to add at the begining of each sequence. If None, or if tokenizer is not specified, then nothing is added.
  • eos (str or None, default None) – The token to add at the end of each sequence. If None, or if tokenizer is not specified, then nothing is added.
  • sampler (str, {'sequential', 'random'}, defaults to 'random') –

    The sampler used to sample texts within a file.

    • ’sequential’: SequentialSampler
    • ’random’: RandomSampler
  • file_sampler (str, {'sequential', 'random'}, defaults to 'random') –

    The sampler used to sample a file.

    • ’sequential’: SequentialSampler
    • ’random’: RandomSampler
bptt_batchify(vocab, seq_len, batch_size, last_batch='keep')[source]

The corpus is transformed into batches of numericalized samples, in the way that the recurrent states from last batch connects with the current batch for each sample.

Each sample is of shape (seq_len, batch_size).

For example, the following 4 sequences:

<bos> a b c d <eos>
<bos> e f g h i j <eos>
<bos> k l m n <eos>
<bos> o <eos>

will generate 2 batches with seq_len = 5, batch_size = 2 as follow (transposed):

batch_0.data.T:

<bos> a b c d
<bos> e f g h

batch_0.target.T:

a b c d <eos>
e f g h i

batch_0.mask.T:

1 1 1 1 1
1 1 1 1 1

batch_1.data.T:

<bos> k l m n
i j <bos> o <padding>

batch_1.target.T:

k l m n <eos>
j <bos> o <eos> <padding>

batch_1.mask.T:

1 1 1 1 1
1 1 1 1 0
Parameters:
  • vocab (gluonnlp.Vocab) – The vocabulary to use for numericalizing the dataset. Each token will be mapped to the index according to the vocabulary.
  • seq_len (int) – The length of each of the samples for truncated back-propagation-through-time (TBPTT).
  • batch_size (int) – The number of samples in each batch.
  • last_batch ({'keep', 'discard'}) –

    How to handle the last batch if the remaining length is less than seq_len.

    • keep: A batch with less samples than previous batches is returned.
    • discard: The last batch is discarded if it’s smaller than (seq_len, batch_size).
class gluonnlp.data.SimpleDataStream(stream)[source]

Simple DataStream wrapper for a stream.

class gluonnlp.data.PrefetchingStream(streams)[source]

Performs pre-fetch for other data iterators. This iterator will create another thread to perform iter_next and then store the data in memory. It potentially accelerates the data read, at the cost of more memory usage.

Parameters:streams (DataStream or list of DataStream) – The data streams to be pre-fetched.
class gluonnlp.data.Text8(root='/var/lib/jenkins/.mxnet/datasets/text8', segment='train', max_sentence_length=10000)[source]

Text8 corpus

http://mattmahoney.net/dc/textdata.html

Part of the test data for the Large Text Compression Benchmark http://mattmahoney.net/dc/text.html. The first 10**8 bytes of the English Wikipedia dump on Mar. 3, 2006.

License: https://en.wikipedia.org/wiki/Wikipedia:Copyrights

Parameters:root (str, default '$MXNET_HOME/datasets/text8') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.CoNLL2000(segment='train', root='/var/lib/jenkins/.mxnet/datasets/conll2000')[source]

CoNLL2000 Part-of-speech (POS) tagging and chunking joint task dataset.

Each sample has three fields: word, POS tag, chunk label.

From https://www.clips.uantwerpen.be/conll2000/chunking/

Parameters:
  • segment ({'train', 'test'}, default 'train') – Dataset segment.
  • root (str, default '$MXNET_HOME/datasets/conll2000') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.CoNLL2001(part, segment='train', root='/var/lib/jenkins/.mxnet/datasets/conll2001')[source]

CoNLL2001 Clause Identification dataset.

Each sample has four fields: word, POS tag, chunk label, clause tag.

From https://www.clips.uantwerpen.be/conll2001/clauses/

Parameters:
  • part (int, {1, 2, 3}) – Part number of the dataset.
  • segment ({'train', 'testa', 'testb'}, default 'train') – Dataset segment.
  • root (str, default '$MXNET_HOME/datasets/conll2001') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.CoNLL2002(lang, segment='train', root='/var/lib/jenkins/.mxnet/datasets/conll2002')[source]

CoNLL2002 Named Entity Recognition (NER) task dataset.

For ‘esp’, each sample has two fields: word, NER label.

For ‘ned’, each sample has three fields: word, POS tag, NER label.

From https://www.clips.uantwerpen.be/conll2002/ner/

Parameters:
  • lang (str, {'esp', 'ned'}) – Dataset language.
  • segment ({'train', 'testa', 'testb'}, default 'train') – Dataset segment.
  • root (str, default '$MXNET_HOME/datasets/conll2002') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.CoNLL2004(segment='train', root='/var/lib/jenkins/.mxnet/datasets/conll2004')[source]

CoNLL2004 Semantic Role Labeling (SRL) task dataset.

Each sample has seven or more fields: word, POS tag, chunk label, clause tag, NER label, target verbs, and sense labels (of variable number per sample).

From http://www.cs.upc.edu/~srlconll/st04/st04.html

Parameters:
  • segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment.
  • root (str, default '$MXNET_HOME/datasets/conll2004') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.UniversalDependencies21(lang='en', segment='train', root='/var/lib/jenkins/.mxnet/datasets/ud2.1')[source]

Universal dependencies tree banks.

Each sample has 8 or more fields as described in http://universaldependencies.org/docs/format.html

From https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2515

Parameters:
  • lang (str, default 'en') – Dataset language.
  • segment (str, default 'train') – Dataset segment.
  • root (str, default '$MXNET_HOME/datasets/ud2.1') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.IWSLT2015(segment='train', src_lang='en', tgt_lang='vi', root='/var/lib/jenkins/.mxnet/datasets/iwslt2015')[source]

Preprocessed IWSLT English-Vietnamese Translation Dataset.

We use the preprocessed version provided in https://nlp.stanford.edu/projects/nmt/

Parameters:
  • segment (str or list of str, default 'train') – Dataset segment. Options are ‘train’, ‘val’, ‘test’ or their combinations.
  • src_lang (str, default 'en') – The source language. Option for source and target languages are ‘en’ <-> ‘vi’
  • tgt_lang (str, default 'vi') – The target language. Option for source and target languages are ‘en’ <-> ‘vi’
  • root (str, default '$MXNET_HOME/datasets/iwslt2015') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.WMT2014(segment='train', src_lang='en', tgt_lang='de', full=False, root='/var/lib/jenkins/.mxnet/datasets/wmt2014')[source]

Translation Corpus of the WMT2014 Evaluation Campaign.

Parameters:
  • segment (str or list of str, default 'train') – Dataset segment. Options are ‘train’, ‘newstest2009’, ‘newstest2010’, ‘newstest2011’, ‘newstest2012’, ‘newstest2013’, ‘newstest2014’ or their combinations
  • src_lang (str, default 'en') – The source language. Option for source and target languages are ‘en’ <-> ‘de’
  • tgt_lang (str, default 'de') – The target language. Option for source and target languages are ‘en’ <-> ‘de’
  • full (bool, default False) – In default, we use the test dataset in http://statmt.org/wmt14/test-filtered.tgz. When full is True, we use the test dataset in http://statmt.org/wmt14/test-full.tgz
  • root (str, default '$MXNET_HOME/datasets/wmt2014') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.WMT2014BPE(segment='train', src_lang='en', tgt_lang='de', full=False, root='/var/lib/jenkins/.mxnet/datasets/wmt2014')[source]

Preprocessed Translation Corpus of the WMT2014 Evaluation Campaign.

We preprocess the dataset by adapting https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh

Parameters:
  • segment (str or list of str, default 'train') – Dataset segment. Options are ‘train’, ‘newstest2009’, ‘newstest2010’, ‘newstest2011’, ‘newstest2012’, ‘newstest2013’, ‘newstest2014’ or their combinations
  • src_lang (str, default 'en') – The source language. Option for source and target languages are ‘en’ <-> ‘de’
  • tgt_lang (str, default 'de') – The target language. Option for source and target languages are ‘en’ <-> ‘de’
  • full (bool, default False) – In default, we use the test dataset in http://statmt.org/wmt14/test-filtered.tgz. When full is True, we use the test dataset in http://statmt.org/wmt14/test-full.tgz
  • root (str, default '$MXNET_HOME/datasets/wmt2014') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.WMT2016(segment='train', src_lang='en', tgt_lang='de', root='/var/lib/jenkins/.mxnet/datasets/wmt2016')[source]

Translation Corpus of the WMT2016 Evaluation Campaign.

Parameters:
  • segment (str or list of str, default 'train') – Dataset segment. Options are ‘train’, ‘newstest2012’, ‘newstest2013’, ‘newstest2014’, ‘newstest2015’, ‘newstest2016’ or their combinations
  • src_lang (str, default 'en') – The source language. Option for source and target languages are ‘en’ <-> ‘de’
  • tgt_lang (str, default 'de') – The target language. Option for source and target languages are ‘en’ <-> ‘de’
  • root (str, default '$MXNET_HOME/datasets/wmt2016') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
class gluonnlp.data.WMT2016BPE(segment='train', src_lang='en', tgt_lang='de', root='/var/lib/jenkins/.mxnet/datasets/wmt2016')[source]

Preprocessed Translation Corpus of the WMT2016 Evaluation Campaign.

We use the preprocessing script in https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh

Parameters:
  • segment (str or list of str, default 'train') – Dataset segment. Options are ‘train’, ‘newstest2012’, ‘newstest2013’, ‘newstest2014’, ‘newstest2015’, ‘newstest2016’ or their combinations
  • src_lang (str, default 'en') – The source language. Option for source and target languages are ‘en’ <-> ‘de’
  • tgt_lang (str, default 'de') – The target language. Option for source and target languages are ‘en’ <-> ‘de’
  • root (str, default '$MXNET_HOME/datasets/wmt2016') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
gluonnlp.data.register(class_=None, **kwargs)[source]

Registers a dataset with segment specific hyperparameters.

When passing keyword arguments to register, they are checked to be valid keyword arguments for the registered Dataset class constructor and are saved in the registry. Registered keyword arguments can be retrieved with the list_datasets function.

All arguments that result in creation of separate datasets should be registered. Examples are datasets divided in different segments or categories, or datasets containing multiple languages.

Once registered, an instance can be created by calling create() with the class name.

Parameters:**kwargs (list or tuple of allowed argument values) – For each keyword argument, it’s value must be a list or tuple of the allowed argument values.

Examples

>>> @gluonnlp.data.register(segment=['train', 'test', 'dev'])
... class MyDataset(Dataset):
...     def __init__(self, segment='train'):
...         pass
>>> my_dataset = gluonnlp.embedding.create('MyDataset')
>>> print(type(my_dataset))
<class '__main__.MyDataset'>
gluonnlp.data.create(name, **kwargs)[source]

Creates an instance of a registered dataset.

Parameters:name (str) – The dataset name (case-insensitive).
Returns:
gluonnlp.data.list_datasets(name=None)[source]

Get valid datasets and registered parameters.

Parameters:name (str or None, default None) – Return names and registered parameters of registered datasets. If name is specified, only registered parameters of the respective dataset are returned.
Returns:A dict of all the valid keyword parameters names for the specified dataset. If name is set to None, returns a dict mapping each valid name to its respective keyword parameter dict. The valid names can be plugged in gluonnlp.model.word_evaluation_model.create(name).
Return type:dict
class gluonnlp.data.SQuAD(segment='train', root='~/.mxnet/datasets/squad')[source]

Stanford Question Answering Dataset (SQuAD) - reading comprehension dataset.

From https://rajpurkar.github.io/SQuAD-explorer/

License: CreativeCommons BY-SA 4.0

The original data format is json, which has multiple contexts (a context is a paragraph of text from which questions are drawn). For each context there are multiple questions, and for each of these questions there are multiple (usually 3) answers.

This class loads the json and flattens it to a table view. Each record is a single question. Since there are more than one question per context in the original dataset, some records shares the same context. Number of records in the dataset is equal to number of questions in json file.

The format of each record of the dataset is following:

  • record_index: An index of the record, generated on the fly (0 … to # of last question)
  • question_id: Question Id. It is a string and taken from the original json file as-is
  • question: Question text, taken from the original json file as-is
  • context: Context text. Will be the same for questions from the same context
  • answer_list: All answers for this question. Stored as python list
  • start_indices: All answers’ starting indices. Stored as python list. The position in this list is the same as the position of an answer in answer_list
Parameters:
  • segment (str, default 'train') – Dataset segment. Options are ‘train’ and ‘dev’.
  • root (str, default '~/.mxnet/datasets/squad') – Path to temp folder for storing data.