AWS Storage Blog

lakeFS and Amazon S3 Express One Zone: Highly performant data version control for ML/AI

Machine learning presents a number of new challenges to data teams, calling for technology solutions that can support training and fine-tuning performance-critical workloads with high performance. Data version control is one of the facets of high-performing ML pipelines, as it allows efficient experimentation and full ML pipeline reproducibility at scale.

lakeFS by Treeverse, an AWS technology partner, is a scalable data version control system for data lakes, available as an OSS project as well as a cloud offering on AWS.

In this blog, we dive into data versioning on Amazon S3 with lakeFS, showing how lakeFS uses Amazon S3 Express One Zone to deliver up to 10x faster versioning operations within lakeFS. This is thanks to metadata operations performed by lakeFS being completed in under 30 ms and 5x faster merge, diff, and commit operations. With the combination of lakeFS and Amazon S3 Express One Zone, teams can establish versioned data repositories that contain both structured and unstructured data while providing the high-performance ML applications required.

What is lakeFS?

lakeFS is a scalable data version control system for data lakes. By using Git-like semantics such as branch, commit and merge to manage the data when building and running ML pipelines, it allows you to manage your data the way you manage your code. Since its inception, you could use lakeFS and Amazon S3 in two ways. First, S3 serves as a storage backend for hosting and serving very large data repositories, typically spanning billions of objects and many petabytes of data. lakeFS also supports Amazon S3 as an API, allowing users to configure their lakeFS installation URL as an S3 endpoint and continue using all the tools and libraries they are used to: boto3, AWS SDKs, the AWS CLI and many more.

Machine learning poses new challenges to data teams, and one of the key solutions is data version control. Thanks to the integration of lakeFS and Amazon S3, teams can create versioned data repositories hosting both structured and unstructured data, with the performance required for ML workloads.

How lakeFS enables data version control on Amazon S3

By using Amazon S3 and lakeFS together, users can create versioned, Git-like repositories of any type, hosting both structured and unstructured data. They allow teams to apply the following software engineering best practice in data and ML engineering.

  1. Isolated branches: Allowing users to modify and transform data without interfering with other people’s work. This enables users to test changes to schema, metadata, and data in isolation and more without having to create multiple, complex copy environments.
  2. Commit, rollback, explore history: Amazon S3 users can encapsulate changes to the data with informative commit messages. If anything goes wrong, they can roll back a given commit, dramatically reducing the cost of human (or machine) errors.
  3. CI/CD for automated data and metadata quality checks: By leveraging known patterns from software development lifecycle management, data practitioners can conditionally apply lakeFS hooks, enforce schema compatibility rules and validate data quality.
  4. Reproducible machine learning and AI: By having all components in a versioned, immutable store, AI applications and models could be reproduced, making it easier to iteratively improve them and allowing debugging and troubleshooting problems when they occur.

By hosting a lakeFS repository on Amazon S3, these capabilities are available at any scale, helping users to fully benefit from Amazon S3’s industry-leading horizontal scalability. lakeFS-managed repositories organize data inside Amazon S3 buckets in an optimal way, allowing users to take full advantage of the bandwidth, throughput, and consistency guarantees offered by Amazon S3.

Bringing together lakeFS and Amazon S3

Figure 1: Bringing together lakeFS and Amazon S3

Introduction to Amazon S3 Express One Zone

Recent advancements in ML and AI technologies have brought soaring demand for both Amazon S3 and lakeFS to be able to support training and fine-tuning performance-critical workloads. These workloads require not only high throughput and bandwidth but also very low latency. A common example of this is deep learning. The process of training a Deep Neural Network requires fast access to the training datasets in order to fully utilize the compute environment – typically, higher cost GPU clusters.

The Amazon S3 Express One Zone storage class is purpose-built for this task. S3 Express One Zone can improve data access speeds by 10x and reduce request costs by 50% compared to S3 Standard and scales to process millions of requests per minute for your most frequently accessed datasets. With S3 Express One Zone, you can specify the Availability Zone your data will be stored in, and this gives you the option to co-locate your storage and compute resources for even lower latency.

How lakeFS uses Amazon S3 Express One Zone

For lakeFS users, Amazon S3 Express One Zone support provides two benefits:

  1. Data versioned in a lakeFS repository built on S3 Express One Zone can leverage very low latency and 10x better performance with no overhead. This is possible thanks to lakeFS’ support for pre signed URLs, which I’ll break down later, allowing users to access the storage layer directly with no additional network hops between the compute clusters and the object store.

Figure 2 - Comparison of access time, Amazon S3 Standard vs. S3 Express One Zone

Figure 2: Comparison of access time, Amazon S3 Standard vs. S3 Express One Zone

  1. Teams that use lakeFS and S3 Express One Zone together enjoy up to 5x faster merge, diff, and commit operations. Since lakeFS stores its own metadata in the underlying object stores, any lakeFS repository running on top of S3 Express One Zone will automatically enjoy these performance enhancements. Data Version Control at scale just got even faster!

Figure 3 - Comparison of time to merge, Amazon S3 Standard vs. S3 Express One Zone

Figure 3: Comparison of time to merge, Amazon S3 Standard vs. S3 Express One Zone

lakeFS + S3 Express One Zone: Under the hood

Let’s explore how this integration works. We’ll look at the two common access patterns between lakeFS and its underlying data in an S3 directory bucket (the new bucket type introduced with S3 Express One Zone) to understand how these performance gains are achieved: the metadata path and the data path.

Metadata path

The most important primitive in lakeFS is a commit, which is a snapshot describing a set of immutable objects. This is essentially a mapping between a logical path, like images/cats/burmese.jpg, and metadata about the actual object in the object store (its location but also other metadata such as size in bytes, content type, and other optional user-supplied metadata).

To store this information, lakeFS writes a set of RocksDB-compatible sstables that make up a tree structure. Since each commit in lakeFS is immutable, the tree and its nodes (named “ranges” in lakeFS) are also immutable. Each node is addressed by the hash of its content, very similar to a Merkle Tree.

A single entry in the tree represents a single object in the snapshot (that is, “ValueRecord”), with a single node holding a few thousand such records.

Figure 4 - lakeFS writes a set of RocksDB-compatible SSTables that make up a tree structure

Figure 4: lakeFS writes a set of RocksDB-compatible SSTables that make up a tree structure

The root of the tree (named “metarange”) has the exact same structure as other ranges, but its ValueRecords point to a series of range files, with metadata describing their minimum and maximum values.

Figure 5- lakeFS metarange.

Figure 5: lakeFS metarange

This makes look up very efficient for three reasons:

  1. SSTables are highly optimized for random access.
  2. Ranges and metaranges are immutable, thus highly cacheable.
  3. Most analytical and AI training applications adhere to the Spatial locality principle. This means that they typically won’t read a completely random set of objects from the object store but rather a set of lexicographically adjacent objects (most commonly, a set of objects under a common prefix or directory).

Here’s how lakeFS would translate the location of images/cats/burmese.jpg for a commit ID 1abc23def:

Figure 6 - lakeFS location translation mechanism.

Figure 6: lakeFS location translation mechanism

Since S3 lookups were typically the ‘slowest’ part of this request flow, the overall impact of running on top of S3 Express One Zone is very significant (see Amdahl’s Law).

For the most part, both the metarange and the range file already exist in the lakeFS server’s local cache, so the server would avoid the ‘slow’ round trips to S3 (see the time to first byte benchmark above). With S3 Express One Zone, since access to S3 is now much faster, any metadata operation performed by lakeFS, regardless of the state of its local cache, would typically complete in under 30 ms – a 50% improvement over Amazon S3 Standard!

This is even more noticeable for more complex requests that, by nature, visit a lot of different ranges and metaranges. For example, diffing between multiple commits, listing large parts of a committed reference, and merging branches together all require fetching and potentially writing back several ranges. For those we can expect up to 10x improvement in overall performance!

Running deep learning jobs at scale requires listing and accessing thousands of smaller files. With S3 Express One Zone, these operations are now up to x10 faster, allowing expensive resources such as GPUs to spend a lot less time waiting for data, maximizing their utilization.

Figure 7 - Deep learning pipeline. Metadata operations (in green) are up to x10 faster when using S3 Express One Zone, resulting in better GPU utilization.

Figure 7: Deep learning pipeline, metadata operations (in green) are up to 10x faster when using S3 Express One Zone, resulting in better GPU utilization

Data path

While the metadata path is the core of lakeFS, this metadata isn’t very useful if we don’t have an efficient way of accessing the underlying data. One of the design goals for lakeFS is to stay out of the data path as much as possible. Amazon S3 is great at scaling bandwidth and throughput horizontally.

To allow lakeFS users to directly access the underlying data, we had to design a mechanism that on the one hand allows lakeFS to authorize requests – deciding which users can access which paths and repositories – but at the same time, avoid having the data itself go through the lakeFS server. This is especially important with S3 Express One Zone: given the low latency provided, any network hop could potentially bottleneck these great performance gains!

To achieve this, lakeFS uses one of our favorite features provided by S3: pre-signed URLs. According to the S3 documentation, a presigned URL employs security credentials to allow time-limited access to items for download. To download the item, the URL can be put into a browser or utilized by a software. The presigned URL uses the credentials of the AWS user who produced the URL.

Here’s how a lakeFS user would read data from S3 directly:

Figure 8 - How lakeFS clients read data from S3 Express One Zone.

Figure 8: How lakeFS clients read data from S3 Express One Zone

As you can see, the data path itself doesn’t involve the lakeFS server at all. Data is directly requested by the user from S3, with a request that has been pre-authorized by lakeFS.

Getting started with lakeFS on S3 Express One Zone

To get started with a lakeFS-managed repository built on Amazon S3 Express One Zone, follow the Amazon S3 User Guide to create an S3 directory bucket.

Once created, follow the instructions on the lakeFS documentation to prepare your storage for lakeFS to be able to access and manage it, by creating the following IAM user policy and attaching it to the IAM user or role provided to lakeFS:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "lakeFSPolicy",
            "Effect": "Allow",
            "Action": [
                "s3express:DeleteBucket",
                "s3express:DeleteBucketPolicy",
                "s3express:CreateBucket",
                "s3express:PutBucketPolicy",
                "s3express:GetBucketPolicy",
                "s3express:ListAllMyDirectoryBuckets"
            ],
            "Resource": "arn:aws:s3express:region:account_id:bucket-base-name--azid--x-s3/*"
        },
        {
            "Sid": "AllowCreateSession",
            "Effect": "Allow",
            "Action": "s3express:CreateSession",
            "Resource": "*"
        }
    ]
}

Once configured, we can go ahead and create an S3 Express One Zone backed repository on lakeFS:

Figure 9 - Create a new respository

Figure 9: Create a new repository

Once our repository is created, no further changes are required! We can start reading and writing to the repository as normal.

It is highly recommended for lakeFS users to use the pre-signed URL option when reading and writing to make sure our code can directly access S3 Express One Zone.

Example: Using the lakeFS Python SDK to read data from S3 Express One Zone

Let’s see how we can read versioned data from lakeFS running on top of S3 Express One Zone. Using the code below, we will be able to do the following:

  1. Install and configure a lakeFS Python SDK client.
  2. Read an image directly from an S3 Express One Zone- backed repository, by leveraging S3 Express One Zone pre-signed URLs.
  3. Stream data from S3 Express One Zone directly into a processing library (Pillow).
  4. Process the image.
  5. Write it back, streaming directly to S3 Express One Zone.

First, we’ll need to install the lakeFS Python client:

$ pip install lakefs

In our Python script or notebook, let’s import our dependencies:

import lakefs 
from PIL import Image # Optional

Now, let’s read an image file from our S3 Express One Zone backed repository, making sure to pass pre_sign=True to our reader request:

branch = lakefs.Repository('example-repo').branch('main')

with branch.object('data/image.png').reader(pre_sign=True) as r:
    img = Image.open(r)  # Stream directly from S3 Express One Zone!

# resize, grayscale
img = img.convert('L').resize((300, 300))

with branch.object('training/image.png').writer(pre_sign=True) as f:
    img.save(f, format='png')  # Stream directly to S3 Express One Zone!

In this example, example-repo is a repository created by passing an S3 Express One Zone directory bucket as its storage namespace.

Since the lakeFS Python client returns a file-like object, it is consumable by most Python libraries – in this example, passed into Pillow for image processing or manipulation.

Conclusion

In this blog, we explored data version control on Amazon S3 with lakeFS, demonstrating how lakeFS leverages Amazon S3 Express One Zone to offer up to 10x quicker versioning operations, thanks to metadata operations done in less than 30 ms, and 5x faster merge, diff, and commit operations.

By combining lakeFS with Amazon S3 Express One Zone, data practitioners can create versioned data repositories that incorporate both structured and unstructured data while offering the high performance ML applications demand. lakeFS Cloud on AWS customers can start using S3 Express One Zone today, at no additional lakeFS cost. For S3 Express One Zone pricing information, visit the S3 pricing page.

We can’t wait to see how lakeFS and AWS users will leverage these new and exciting capabilities, paving the way for more robust and resilient AI applications at any scale.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.