AWS Storage Blog

Snowflake streamlines data management and improves processing times using Amazon S3 Lifecycle

APN Partner Snowflake enables organizations to transform, integrate, and analyze their data. Running on AWS has helped them to keep up with a rapidly scaling customer base since 2012 (22% YoY total customer growth as of January 2024). With more customers comes more data, and efficient data management to manage operational overhead and cost is a top priority at Snowflake.

For data storage, Snowflake uses Amazon S3, an object storage service offering industry-leading scalability, data availability, security, and performance. Snowflake developed an internal data management service to delete unwanted data, but it required constant maintenance, incurred compute costs, and didn’t scale easily. Ultimately, it turned to Amazon S3 Lifecycle, which enables S3 customers to set rules to automatically transition data to other storage classes or to expire objects, to manage the process and control storage cost in S3 instead.

In this post, we discuss how Snowflake uses Amazon S3 Lifecycle with object tags to automate the expiration of temporary objects it no longer needs. With fewer unneeded objects, Snowflake was able to improve processing times by 80% and free up teams to work on projects that have a greater impact on their customers.

Data management at scale

Snowflake manages data in Amazon S3 on behalf of their customers, and processes billions of objects per day. When Snowflake customers run a query that requires a large amount of memory, temporary data is often created on S3. Once the query is complete, temporary objects are no longer required, so they are purged to optimize storage cost. At scale, these operations generate a large volume of temporary data that is stored on Amazon S3, along with persistent table data.

To manage this process, Snowflake developed a service called temporary data manager, to detect and delete temporary objects in their S3 buckets. This service would list objects in Snowflake buckets, identify temporary objects, and use Amazon S3 Multi-Object Delete APIs to delete them.

This approach presented the following challenges:

  • Maintenance: Snowflake assumed the undifferentiated heavy lifting and engineering effort of maintaining the temporary data manager service. This included patching, application updates, and maintenance of the underlying Amazon EC2 infrastructure. In addition, the business logic needed to identify temporary objects to be deleted required careful implementation.
  • Compute cost: Temporary data manager’s list and deletion logic required compute resources, and therefore increased the cost of running large queries.
  • Scaling: An influx of customer queries required the temporary data manager service to scale rapidly, which created risk for processing delays and unexpected storage costs.

Choosing Amazon S3 Lifecycle

Snowflake was looking for a more efficient solution to manage the expiration of these temporary objects. They decided on Amazon S3 Lifecycle because it helps to manage objects so that they are stored cost effectively throughout their lifecycle and, with automated rules to expire objects or transition them to another storage class, removes the undifferentiated heavy lifting of managing a custom application.

Solution overview

An Amazon S3 Lifecycle rule can apply to all or a subset of objects in a bucket. Customers can choose to apply filters to their Lifecycle rules based on the prefix that their objects are in, or based on tags that are applied to their objects. Since Snowflake only wanted to delete temporary objects, and those temporary objects are stored in the same prefixes as permanent data, they decided to use object tags for filtering.

To implement this change, Snowflake performed the following two tasks:

  1. Add object tags to the PUT operation for all temporary objects so that they could be identified with an S3 Lifecycle filter.
  2. Update the Amazon S3 Lifecycle configuration to include object tags as filters.

Tagging objects during PUT operations

To automate tag creation during PUT operations, Snowflake began to create temporary objects with tags already applied using the x-amz-tagging request header. The tagging action is free of charge when specified as part of the PutObject request in this way. The following is an example PUT operation:

aws s3api put-object --bucket temporary-data-manager --key dir-1/my_images.tar.bz2 --body e:\media\videos\f-sharp-3-data-services.mp4 --tagging "ObjectType"="temp"

Amazon S3 Lifecycle configuration using tag-based filters

Then, Snowflake updated their S3 Lifecycle configuration to include object tags as filters. In the following example, the Lifecycle rule specifies a filter based on a tag (key=ObjectType) and value (value=temp). As a result, the Lifecycle rule applies only to a subset of objects in the bucket. Expiration is set to 7 days, which means that S3 will take action 7 days from creation of the object. The following is an example S3 Lifecycle configuration:

< ? xml version = "1.0"
encoding = "UTF-8" ? >
    <
    LifecycleConfiguration xmlns = "http://s3.amazonaws.com/doc/2006-03-01/" >
    < Rule >
    < ID > delete_object_withTag < /ID>
        <Filter>
        <Tag><Key>ObjectType</Key > < Value > temp < /Value></Tag > 
        < /Filter>
        <Status>Enabled</Status > 
        < Expiration > < Days > 7 < /Days></Expiration >
        < /Rule>
        </LifecycleConfiguration >
XML

Tag values are just one of many filters that you can apply in your S3 Lifecycle policies. You can learn more about lifecycle filters in the Amazon S3 documentation.

Conclusion

In this post, we discussed Snowflake’s implementation of an efficient storage lifecycle strategy using Amazon S3 Lifecycle rules and S3 object tagging. This implementation resulted in the simplification of their temporary data manager service, and freed up Snowflake engineers for higher value work. The S3 Lifecycle-based system improved temporary object processing time by up to 80%, expiring billions of objects (or tens of petabytes) without any manual intervention or management overhead.

We are here to help, and if you need further assistance with S3 Lifecycle strategy, reach out to AWS Support and your AWS account team.

Wesley Pereira

Wesley Pereira

Wesley is a senior software engineer at Snowflake, working on the storage team. He's passionate about building and optimizing large-scale distributed systems. He works in the Bellevue Snowflake office in the Seattle area.

Frank Dallezotte

Frank Dallezotte

Frank Dallezotte is a Senior Solutions Architect at AWS and is passionate about working with independent software vendors to design and build scalable applications on AWS. He has experience creating software, implementing build pipelines, and deploying these solutions in the cloud.

Pino Suliman

Pino Suliman

Pino Suliman is a Senior Technical Product Manager on the Amazon S3 team at AWS, focusing on storage insights products. Pino loves engaging with customers to understand their S3 usage and gather valuable feedback. He is passionate about storage management, analytics, and data security in the cloud. Pino is based in Seattle and enjoys exploring nature with his wife and kids.

Archana Srinivasan

Archana Srinivasan

Archana Srinivasan is a Senior Technical Account Manager within Enterprise Support at Amazon Web Services (AWS). Archana provides strategic technical guidance for independent software vendors (ISVs) to innovate and operate their workloads efficiently on AWS.