Reading from AWS S3#
This document outlines how to read data from an Amazon S3 bucket and construct a Grain pipeline. We will leverage S3 Mountpoint, a service provided by AWS. S3 Mountpoint enables you to mount your S3 bucket as a local file system, allowing you to access and read data as if it were stored locally.
Install Mountpoint for Amazon S3#
!wget https://s3.amazonaws.com/mountpoint-s3-release/latest/x86_64/mount-s3.deb
!sudo apt-get install -y ./mount-s3.deb
Configure AWS credentials#
!pip install aws configure
!pip install awscli
!aws configure
Mount your S3 bucket to your local filepath#
!mount-s3 <your-s3-bucket> /path/to/mount/files
Install Grain and other dependencies#
!pip install grain
!pip install array_record
Write temp ArrayRecord files to the bucket#
from array_record.python import array_record_module
digits = [b"1", b"2", b"3", b"4", b"5"]
writer = array_record_module.ArrayRecordWriter("/path/to/mount/files/data.array_record")
for i in digits:
writer.write(i)
writer.close()
Read ArrayRecord files using Grain#
import grain
from pprint import pprint
source = grain.sources.ArrayRecordDataSource(paths="/path/to/mount/files/data.array_record")
dataset = (
grain.MapDataset.source(source)
.shuffle(seed=10) # Shuffles globally.
.batch(batch_size=2) # Batches consecutive elements.
)
pprint(list(dataset))