gluonnlp.data.batchify

Batchify functions can be used to transform a dataset into mini-batches that can be processed efficiently.

Batch Loaders

Stack Stack the input data samples to construct the batch.
Pad Return a callable that pads and stacks data.
Tuple Wrap multiple batchify functions together.

Language Modeling

CorpusBatchify Transform the dataset into N independent sequences, where N is the batch size.
CorpusBPTTBatchify Transform the dataset into batches of numericalized samples, in the way that the recurrent states from last batch connects with the current batch for each sample.
StreamBPTTBatchify Transform a Stream of CorpusDataset to BPTT batches.

Embedding Training

EmbeddingCenterContextBatchify Batches of center and contexts words (and their masks).

API Reference

Batchify helpers.

class gluonnlp.data.batchify.Stack(dtype=None)[source]

Stack the input data samples to construct the batch.

The N input samples must have the same shape/length and will be stacked to construct a batch.

Parameters:dtype (str or numpy.dtype, default None) – The value type of the output. If it is set to None, the input data type is used.

Examples

>>> from gluonnlp.data import batchify
>>> # Stack multiple lists
>>> a = [1, 2, 3, 4]
>>> b = [4, 5, 6, 8]
>>> c = [8, 9, 1, 2]
>>> batchify.Stack()([a, b, c])
[[1. 2. 3. 4.]
 [4. 5. 6. 8.]
 [8. 9. 1. 2.]]
<NDArray 3x4 @cpu(0)>
>>> # Stack multiple numpy.ndarrays
>>> import numpy as np
>>> a = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> b = np.array([[5, 6, 7, 8], [1, 2, 3, 4]])
>>> batchify.Stack()([a, b])
[[[1. 2. 3. 4.]
  [5. 6. 7. 8.]]
 [[5. 6. 7. 8.]
  [1. 2. 3. 4.]]]
<NDArray 2x2x4 @cpu(0)>
>>> # Stack multiple NDArrays
>>> import mxnet as mx
>>> a = mx.nd.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> b = mx.nd.array([[5, 6, 7, 8], [1, 2, 3, 4]])
>>> batchify.Stack()([a, b])
[[[1. 2. 3. 4.]
  [5. 6. 7. 8.]]
 [[5. 6. 7. 8.]
  [1. 2. 3. 4.]]]
<NDArray 2x2x4 @cpu(0)>
__call__(data)[source]

Batchify the input data

Parameters:data (list) – The input data samples
Returns:batch_data
Return type:NDArray
class gluonnlp.data.batchify.Pad(axis=0, pad_val=0, ret_length=False, dtype=None)[source]

Return a callable that pads and stacks data.

Parameters:
  • axis (int, default 0) – The axis to pad the arrays. The arrays will be padded to the largest dimension at axis. For example, assume the input arrays have shape (10, 8, 5), (6, 8, 5), (3, 8, 5) and the axis is 0. Each input will be padded into (10, 8, 5) and then stacked to form the final output, which has shape(3, 10, 8, 5).
  • pad_val (float or int, default 0) – The padding value.
  • ret_length (bool, default False) – Whether to return the valid length in the output.
  • dtype (str or numpy.dtype, default None) – The value type of the output. If it is set to None, the input data type is used.

Examples

>>> from gluonnlp.data import batchify
>>> # Inputs are multiple lists
>>> a = [1, 2, 3, 4]
>>> b = [4, 5, 6]
>>> c = [8, 2]
>>> batchify.Pad()([a, b, c])
[[ 1  2  3  4]
 [ 4  5  6  0]
 [ 8  2  0  0]]
<NDArray 3x4 @cpu(0)>
>>> # Also output the lengths
>>> a = [1, 2, 3, 4]
>>> b = [4, 5, 6]
>>> c = [8, 2]
>>> batchify.Pad(ret_length=True)([a, b, c])
(
 [[1 2 3 4]
  [4 5 6 0]
  [8 2 0 0]]
 <NDArray 3x4 @cpu(0)>,
 [4 3 2]
 <NDArray 3 @cpu(0)>)
>>> # Inputs are multiple ndarrays
>>> import numpy as np
>>> a = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> b = np.array([[5, 8], [1, 2]])
>>> batchify.Pad(axis=1, pad_val=-1)([a, b])
[[[ 1  2  3  4]
  [ 5  6  7  8]]
 [[ 5  8 -1 -1]
  [ 1  2 -1 -1]]]
<NDArray 2x2x4 @cpu(0)>
__call__(data)[source]

Batchify the input data.

The input can be list of numpy.ndarray, list of numbers or list of mxnet.nd.NDArray. Inputting mxnet.nd.NDArray is discouraged as each array need to be converted to numpy for efficient padding.

The arrays will be padded to the largest dimension at axis and then stacked to form the final output. In addition, the function will output the original dimensions at the axis if ret_length is turned on.

Parameters:data (List[np.ndarray] or List[List[dtype]] or List[mx.nd.NDArray]) – List of samples to pad and stack.
Returns:
  • batch_data (NDArray) – Data in the minibatch. Shape is (N, …)
  • valid_length (NDArray, optional) – The sequences’ original lengths at the padded axis. Shape is (N,). This will only be returned in ret_length is True.
class gluonnlp.data.batchify.Tuple(fn, *args)[source]

Wrap multiple batchify functions together. The input functions will be applied to the corresponding input fields.

Each data sample should be a list or tuple containing multiple attributes. The i`th batchify function stored in `Tuple will be applied on the i`th attribute. For example, each data sample is (nd_data, label). You can wrap two batchify functions using `Tuple(DataBatchify, LabelBatchify) to batchify nd_data and label correspondingly.

Parameters:
  • fn (list or tuple or callable) – The batchify functions to wrap.
  • *args (tuple of callable) – The additional batchify functions to wrap.

Examples

>>> from gluonnlp.data import batchify
>>> a = ([1, 2, 3, 4], 0)
>>> b = ([5, 7], 1)
>>> c = ([1, 2, 3, 4, 5, 6, 7], 0)
>>> batchify.Tuple(batchify.Pad(), batchify.Stack())([a, b])
(
 [[1 2 3 4]
  [5 7 0 0]]
 <NDArray 2x4 @cpu(0)>,
 [0. 1.]
 <NDArray 2 @cpu(0)>)
>>> # Input can also be a list
>>> batchify.Tuple([batchify.Pad(), batchify.Stack()])([a, b])
(
 [[1 2 3 4]
  [5 7 0 0]]
 <NDArray 2x4 @cpu(0)>,
 [0. 1.]
 <NDArray 2 @cpu(0)>)
>>> # Another example
>>> a = ([1, 2, 3, 4], [5, 6], 1)
>>> b = ([1, 2], [3, 4, 5, 6], 0)
>>> c = ([1], [2, 3, 4, 5, 6], 0)
>>> batchify.Tuple(batchify.Pad(), batchify.Pad(), batchify.Stack())([a, b, c])
(
 [[1 2 3 4]
  [1 2 0 0]
  [1 0 0 0]]
 <NDArray 3x4 @cpu(0)>,
 [[5 6 0 0 0]
  [3 4 5 6 0]
  [2 3 4 5 6]]
 <NDArray 3x5 @cpu(0)>,
 [1. 0. 0.]
 <NDArray 3 @cpu(0)>)
__call__(data)[source]

Batchify the input data.

Parameters:data (list) – The samples to batchfy. Each sample should contain N attributes.
Returns:ret – A tuple of length N. Contains the batchified result of each attribute in the input.
Return type:tuple
class gluonnlp.data.batchify.CorpusBatchify(vocab, batch_size)[source]

Transform the dataset into N independent sequences, where N is the batch size.

Parameters:
  • vocab (gluonnlp.Vocab) – The vocabulary to use for numericalizing the dataset. Each token will be mapped to the index according to the vocabulary.
  • batch_size (int) – The number of samples in each batch.
__call__(data)[source]

Batchify a dataset.

Parameters:data (mxnet.gluon.data.Dataset) – A flat dataset to be batchified.
Returns:NDArray of shape (len(data) // N, N) where N is the batch_size wrapped by a mxnet.gluon.data.SimpleDataset. Excessive tokens that don’t align along the batches are discarded.
Return type:mxnet.gluon.data.Dataset
class gluonnlp.data.batchify.CorpusBPTTBatchify(vocab, seq_len, batch_size, last_batch='keep')[source]

Transform the dataset into batches of numericalized samples, in the way that the recurrent states from last batch connects with the current batch for each sample.

Each sample is of shape (seq_len, batch_size). When last_batch=’keep’, the first dimension of last sample may be shorter than seq_len.

Parameters:
  • vocab (gluonnlp.Vocab) – The vocabulary to use for numericalizing the dataset. Each token will be mapped to the index according to the vocabulary.
  • seq_len (int) – The length of each of the samples for truncated back-propagation-through-time (TBPTT).
  • batch_size (int) – The number of samples in each batch.
  • last_batch ({'keep', 'discard'}) –

    How to handle the last batch if the remaining length is less than seq_len.

    • keep: A batch with less samples than previous batches is returned. vocab.padding_token is used to pad the last batch based on batch size.
    • discard: The last batch is discarded if it’s smaller than (seq_len, batch_size).
__call__(corpus)[source]

Batchify a dataset.

Parameters:corpus (mxnet.gluon.data.Dataset) – A flat dataset to be batchified.
Returns:Batches of numericalized samples such that the recurrent states from last batch connects with the current batch for each sample. Each element of the Dataset is a tuple of data and label arrays for BPTT. They are of shape (seq_len, batch_size) respectively.
Return type:mxnet.gluon.data.Dataset
class gluonnlp.data.batchify.StreamBPTTBatchify(vocab, seq_len, batch_size, sampler='random', last_batch='keep')[source]

Transform a Stream of CorpusDataset to BPTT batches.

The corpus is transformed into batches of numericalized samples, in the way that the recurrent states from last batch connects with the current batch for each sample.

Each sample is of shape (seq_len, batch_size).

For example, the following 4 sequences:

a b c d <eos>
e f g h i j <eos>
k l m n <eos>
o <eos>

will generate 2 batches with seq_len = 5, batch_size = 2 as follow (transposed):

batch_0.data.T:

a b c d <eos>
e f g h i

batch_0.target.T:

b c d <eos> k
f g h i j

batch_1.data.T:

k l m n <eos>
j <eos> o <eos> <padding>

batch_1.target.T:

l m n <eos> <padding>
<eos> o <eos> <padding> <padding>
Parameters:
  • vocab (gluonnlp.Vocab) – The vocabulary to use for numericalizing the dataset. Each token will be mapped to the index according to the vocabulary.
  • seq_len (int) – The length of each of the samples for truncated back-propagation-through-time (TBPTT).
  • batch_size (int) – The number of samples in each batch.
  • sampler (str, {'sequential', 'random'}, defaults to 'random') –

    The sampler used to sample texts within a file.

    • ’sequential’: SequentialSampler
    • ’random’: RandomSampler
  • last_batch ({'keep', 'discard'}) –

    How to handle the last batch if the remaining length is less than seq_len.

    • keep: A batch with less samples than previous batches is returned.
    • discard: The last batch is discarded if it’s smaller than (seq_len, batch_size).
__call__(corpus)[source]

Batchify a stream.

Parameters:corpus (nlp.data.DatasetStream) – A stream of un-flattened CorpusDataset.
Returns:Batches of numericalized samples such that the recurrent states from last batch connects with the current batch for each sample. Each element of the Dataset is a tuple of data and label arrays for BPTT. They are of shape (seq_len, batch_size) respectively.
Return type:nlp.data.DataStream
class gluonnlp.data.batchify.EmbeddingCenterContextBatchify(batch_size, window_size=5, reduce_window_size_randomly=True, shuffle=True)[source]

Batches of center and contexts words (and their masks).

The context size is choosen uniformly at random for every sample from [1, window] if reduce_window_size_randomly is True. The mask is used to mask entries that lie outside of the randomly chosen context size. Contexts do not cross sentence boundaries.

Batches are created lazily on a optionally shuffled version of the Dataset.

Parameters:
  • batch_size (int) – Maximum size of batches returned. Actual batch returned can be smaller when running out of samples.
  • window_size (int, default 5) – The maximum number of context elements to consider left and right of each center element. Less elements may be considered if there are not sufficient elements left / right of the center element or if a reduced window size was drawn.
  • reduce_window_size_randomly (bool, default True) – If True, randomly draw a reduced window size for every center element uniformly from [1, window].
  • shuffle (bool, default True) – If True, shuffle the sentences before lazily generating batches.
__call__(corpus)[source]

Batchify a dataset.

Parameters:corpus (list of lists of int) –
List of coded sentences. A coded sentence itself is a list of token
indices. Context samples do not cross sentence boundaries.
returns:Each element of the DataStream is a tuple of 3 NDArrays (center, context, mask). The center array has shape (batch_size, 1). The context and mask arrays have shape (batch_size, 2*window). The center and context arrays contain the center and correpsonding context works respectively. The mask array masks invalid elements in the context array. Elements in the context array can be invalid due to insufficient context elements at a certain position in a sentence or a randomly reduced context size.
rtype:DataStream