grain.experimental.FirstFitPackIterDataset

grain.experimental.FirstFitPackIterDataset#

class grain.experimental.FirstFitPackIterDataset(parent, *, length_struct, num_packing_bins, seed=0, shuffle_bins=True, shuffle_bins_group_by_feature=None, meta_features=())#

Implements first-fit packing of sequences.

Packing, compared to concat-and-split, avoids splitting sequences by padding instead. Larger number of packing bins reduce the amount of padding. If the number of bins is large, this can cause epoch leakage (data points from multiple epochs getting packed together).

This uses a simple first-fit packing algorithm that: 1. Creates N bins. 2. Adds elements (in the order coming from the parent) to the first bin that has enough space. 3. Once an element doesn’t fit, emits all N bins as elements. 4. (optional) Shuffles bins. 5. Loops back to 1 and starts with the element that didn’t fit.

This iterator is easy to make deterministic, but it has the downside that some bins (usually the bottom bins) have a lot of padding. To avoid this pattern, we add an option to shuffle the bins before emitting.

Parameters:
  • parent (IterDataset)

  • length_struct (Any)

  • num_packing_bins (int)

  • seed (int)

  • shuffle_bins (bool)

  • shuffle_bins_group_by_feature (str | None)

  • meta_features (Sequence[str])

__init__(parent, *, length_struct, num_packing_bins, seed=0, shuffle_bins=True, shuffle_bins_group_by_feature=None, meta_features=())#

Creates a dataset that packs sequences from the parent dataset.

Parameters:
  • parent (IterDataset) – Parent dataset with variable length sequences. Sequence cannot be longer than their length_struct value.

  • length_struct (Any) – Target sequence length for each feature.

  • num_packing_bins (int) – Number of bins to pack sequences into.

  • seed (int) – Random seed for shuffling bins, if shuffling is enabled.

  • shuffle_bins (bool) – Whether to shuffle bins after packing.

  • shuffle_bins_group_by_feature (str | None) – No-op if shuffle_bins is False. When shuffle_bins is True, if shuffle_bins_group_by_feature is set to something non-None, we will group the bins by this feature name and shuffle within each group. If None, the entire batch is shuffled without regard to this feature. The primary use case for this is to only shuffle within each epoch to avoid epoch leakage.

  • meta_features (Sequence[str]) – Meta features that do not need *_segment_ids and *_positions features.

Methods

__init__(parent, *, length_struct, ...[, ...])

Creates a dataset that packs sequences from the parent dataset.

apply(transformations)

Returns a dataset with the given transformation(s) applied.

batch(batch_size, *[, drop_remainder, batch_fn])

Returns a dataset of elements batched along a new first dimension.

filter(transform)

Returns a dataset containing only the elements that match the filter.

map(transform)

Returns a dataset containing the elements transformed by transform.

map_with_index(transform)

Returns a dataset of the elements transformed by the transform.

mp_prefetch([options, worker_init_fn])

Returns a dataset prefetching elements in multiple processes.

pipe(func, /, *args, **kwargs)

Syntactic sugar for applying a callable to this dataset.

prefetch(multiprocessing_options)

Deprecated, use mp_prefetch instead.

random_map(transform, *[, seed])

Returns a dataset containing the elements transformed by transform.

seed(seed)

Returns a dataset that uses the seed for default seed generation.

Attributes

parents