gluonnlp.data.batchify

Batchify functions can be used to transform a dataset into mini-batches that can be processed efficiently.

Batch Loaders

Stack Stack the input data samples to construct the batch.
Pad Pad the input ndarrays along the specific padding axis and stack them to get the output.
Tuple Wrap multiple batchify functions together.

API Reference

Batchify functions. They can be used in Gluon data loader to help combine individual samples into batches for fast processing.

class gluonnlp.data.batchify.Stack(dtype=None)[source]

Stack the input data samples to construct the batch.

The N input samples must have the same shape/length and will be stacked to construct a batch.

Parameters:dtype (str or numpy.dtype, default None) – The value type of the output. If it is set to None, the input data type is used.

Examples

>>> from gluonnlp.data import bf
>>> # Stack multiple lists
>>> a = [1, 2, 3, 4]
>>> b = [4, 5, 6, 8]
>>> c = [8, 9, 1, 2]
>>> bf.Stack()([a, b, c])
[[1. 2. 3. 4.]
 [4. 5. 6. 8.]
 [8. 9. 1. 2.]]
<NDArray 3x4 @cpu(0)>
>>> # Stack multiple numpy.ndarrays
>>> import numpy as np
>>> a = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> b = np.array([[5, 6, 7, 8], [1, 2, 3, 4]])
>>> bf.Stack()([a, b])
[[[1. 2. 3. 4.]
  [5. 6. 7. 8.]]
 [[5. 6. 7. 8.]
  [1. 2. 3. 4.]]]
<NDArray 2x2x4 @cpu(0)>
>>> # Stack multiple NDArrays
>>> import mxnet as mx
>>> a = mx.nd.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> b = mx.nd.array([[5, 6, 7, 8], [1, 2, 3, 4]])
>>> bf.Stack()([a, b])
[[[1. 2. 3. 4.]
  [5. 6. 7. 8.]]
 [[5. 6. 7. 8.]
  [1. 2. 3. 4.]]]
<NDArray 2x2x4 @cpu(0)>
class gluonnlp.data.batchify.Pad(axis=0, pad_val=0, ret_length=False, dtype=None)[source]

Pad the input ndarrays along the specific padding axis and stack them to get the output.

Input of the function will be N samples. Each sample should contain a single element that can be 1) numpy.ndarray, 2) mxnet.nd.NDArray, 3) list of numbers

The arrays will be padded to the largest dimension at axis and then stacked to form the final output. In addition, the function will output the original dimensions at the axis if ret_length is turned on.

Parameters:
  • axis (int, default 0) – The axis to pad the arrays. The arrays will be padded to the largest dimension at axis. For example, assume the input arrays have shape (10, 8, 5), (6, 8, 5), (3, 8, 5) and the axis is 0. Each input will be padded into (10, 8, 5) and then stacked to form the final output, which has shape(3, 10, 8, 5).
  • pad_val (float or int, default 0) – The padding value.
  • ret_length (bool, default False) – Whether to return the valid length in the output.
  • dtype (str or numpy.dtype, default None) – The value type of the output. If it is set to None, the input data type is used.

Examples

>>> from gluonnlp.data import bf
>>> # Inputs are multiple lists
>>> a = [1, 2, 3, 4]
>>> b = [4, 5, 6]
>>> c = [8, 2]
>>> bf.Pad()([a, b, c])
[[ 1  2  3  4]
 [ 4  5  6  0]
 [ 8  2  0  0]]
<NDArray 3x4 @cpu(0)>
>>> # Also output the lengths
>>> a = [1, 2, 3, 4]
>>> b = [4, 5, 6]
>>> c = [8, 2]
>>> bf.Pad(ret_length=True)([a, b, c])
(
 [[1 2 3 4]
  [4 5 6 0]
  [8 2 0 0]]
 <NDArray 3x4 @cpu(0)>,
 [4 3 2]
 <NDArray 3 @cpu(0)>)
>>> # Inputs are multiple ndarrays
>>> import numpy as np
>>> a = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> b = np.array([[5, 8], [1, 2]])
>>> bf.Pad(axis=1, pad_val=-1)([a, b])
[[[ 1  2  3  4]
  [ 5  6  7  8]]
 [[ 5  8 -1 -1]
  [ 1  2 -1 -1]]]
<NDArray 2x2x4 @cpu(0)>
>>> # Inputs are multiple NDArrays
>>> import mxnet as mx
>>> a = mx.nd.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> b = mx.nd.array([[5, 8], [1, 2]])
>>> bf.Pad(axis=1, pad_val=-1)([a, b])
[[[ 1.  2.  3.  4.]
  [ 5.  6.  7.  8.]]
 [[ 5.  8. -1. -1.]
  [ 1.  2. -1. -1.]]]
<NDArray 2x2x4 @cpu(0)>
class gluonnlp.data.batchify.Tuple(fn, *args)[source]

Wrap multiple batchify functions together. The input functions will be applied to the corresponding input fields.

Each data sample should be a list or tuple containing multiple attributes. The i`th batchify function stored in `Tuple will be applied on the i`th attribute. For example, each data sample is (nd_data, label). You can wrap two batchify functions using `Tuple(DataBatchify, LabelBatchify) to batchify nd_data and label correspondingly.

Parameters:
  • fn (list or tuple or callable) – The batchify functions to wrap.
  • *args (tuple of callable) – The additional batchify functions to wrap.

Examples

>>> from gluonnlp.data import bf
>>> a = ([1, 2, 3, 4], 0)
>>> b = ([5, 7], 1)
>>> c = ([1, 2, 3, 4, 5, 6, 7], 0)
>>> bf.Tuple(bf.Pad(), bf.Stack())([a, b])
(
 [[1 2 3 4]
  [5 7 0 0]]
 <NDArray 2x4 @cpu(0)>,
 [0. 1.]
 <NDArray 2 @cpu(0)>)
>>> # Input can also be a list
>>> bf.Tuple([bf.Pad(), bf.Stack()])([a, b])
(
 [[1 2 3 4]
  [5 7 0 0]]
 <NDArray 2x4 @cpu(0)>,
 [0. 1.]
 <NDArray 2 @cpu(0)>)
>>> # Another example
>>> a = ([1, 2, 3, 4], [5, 6], 1)
>>> b = ([1, 2], [3, 4, 5, 6], 0)
>>> c = ([1], [2, 3, 4, 5, 6], 0)
>>> bf.Tuple(bf.Pad(), bf.Pad(), bf.Stack())([a, b, c])
(
 [[1 2 3 4]
  [1 2 0 0]
  [1 0 0 0]]
 <NDArray 3x4 @cpu(0)>,
 [[5 6 0 0 0]
  [3 4 5 6 0]
  [2 3 4 5 6]]
 <NDArray 3x5 @cpu(0)>,
 [1. 0. 0.]
 <NDArray 3 @cpu(0)>)