Reading from AWS S3#

Open in Colab

This document outlines how to read data from an Amazon S3 bucket and construct a Grain pipeline. We will leverage S3 Mountpoint, a service provided by AWS. S3 Mountpoint enables you to mount your S3 bucket as a local file system, allowing you to access and read data as if it were stored locally.

Install Mountpoint for Amazon S3#

!wget https://s3.amazonaws.com/mountpoint-s3-release/latest/x86_64/mount-s3.deb
!sudo apt-get install -y ./mount-s3.deb

Configure AWS credentials#

!pip install aws configure
!pip install awscli
!aws configure

Mount your S3 bucket to your local filepath#

!mount-s3 <your-s3-bucket> /path/to/mount/files

Install Grain and other dependencies#

!pip install grain
!pip install array_record

Write temp ArrayRecord files to the bucket#

from array_record.python import array_record_module

digits = [b"1", b"2", b"3", b"4", b"5"]

writer = array_record_module.ArrayRecordWriter("/path/to/mount/files/data.array_record")
for i in digits:
  writer.write(i)
writer.close()

Read ArrayRecord files using Grain#

import grain
from pprint import pprint

source =  grain.sources.ArrayRecordDataSource(paths="/path/to/mount/files/data.array_record")

dataset = (
    grain.MapDataset.source(source)
    .shuffle(seed=10)  # Shuffles globally.
    .batch(batch_size=2)  # Batches consecutive elements.
)

pprint(list(dataset))