Reading from GCS#

Open in Colab

This document demonstrates how to access and load data from Google Cloud Storage using Grain. To achieve this, we’ll utilize Cloud Storage FUSE, an adapter that allows you to mount GCS buckets as local file systems. By using Cloud Storage FUSE to mount GCS buckets as local file systems, you can access cloud storage data just like local files.

Mount a Cloud Storage location into the local filesystem#

# Authenticate.
from google.colab import auth
auth.authenticate_user()
# Install Cloud Storage FUSE.
!echo "deb https://packages.cloud.google.com/apt gcsfuse-`lsb_release -c -s` main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
!apt -qq update && apt -qq install gcsfuse

The gcsfuse CLI offers various configurable options, detailed at https://cloud.google.com/storage/docs/gcsfuse-cli. Utilizing certain options, such as the caching features described at https://cloud.google.com/storage/docs/cloud-storage-fuse/caching, can enhance read performance and lower costs. For instance, MaxText setup gcsfuse flags (MaxText gcsfuse setting link) to reduce data loading time for training. We advise users to consider adopting similar settings or customizing their own gcsfuse options.

# Mount a Cloud Storage bucket or location, without the gs:// prefix.
mount_path = "my-bucket"  # or a location like "my-bucket/path/to/mount"
local_path = f"./mnt/gs/{mount_path}"

!mkdir -p {local_path}
# The flags below are configured to improve GCS data loading performance. Users are encouraged to explore alternative settings and we would greatly appreciate any feedback or insights shared with the Grain team.
!gcsfuse --implicit-dirs --type-cache-max-size-mb=-1 --stat-cache-max-size-mb=-1 --kernel-list-cache-ttl-secs=-1 --metadata-cache-ttl-secs=-1 {mount_path} {local_path}
# Then you can access it like a local path.
!ls -lh {local_path}

Read files using Grain#

If your data is in an ArrayRecord file, you can directly load it using grain.sources.ArrayRecordDataSource. For information on handling other file formats, please see the Grain data sources documentation at: https://google-grain.readthedocs.io/en/latest/data_sources.html

# Install Grain.
!pip install grain
import grain

source = grain.sources.ArrayRecordDataSource(local_path+"/local_file_name")

# Create a dataset from the data source then process the data.
dataset = (
    grain.MapDataset.source(source)
    .shuffle(seed=10)  # Shuffles globally.
    .batch(batch_size=2)  # Batches consecutive elements.
)
# Output a record at a random index
print(dataset[10])