AWS Storage Blog

Cost-optimized log aggregation and archival in Amazon S3 using s3tar

According to a study by the International Data Corporation (IDC), the global datasphere is expected to grow from 33 zettabytes (ZB) in 2018 to 175 ZB by 2025, a staggering five-fold increase. Organizations that leverage distributed architectures generate a significant portion of their data footprint from observability data, including application logs, metrics, and traces, which is critical for compliance and long-term retention, especially in regulated sectors. As a result, these organizations are seeking scalable, cost-effective solutions with a solid data foundation and retention strategy.

Amazon S3 is a popular choice for handling such large volumes of observability data. Centralized logging architectures in AWS typically ingest logs from various AWS sources (AWS CloudTrail, AWS Config, Amazon CloudWatch, S3 Access Logs, Elastic Load Balancing (ELB), VPC Flow Logs), custom applications, and third-party logging services into a central account and S3 bucket. This is outlined in the prescriptive guidance on centralized logging and monitoring. Amazon S3 offers options for storage tiering and data lifecycle management that can help you manage voluminous amounts of data in a cost-effective manner.

In this post, we show some of the challenges and cost considerations enterprises face while archiving massive volumes of observability data, such as logs and metrics, to the Amazon S3 Glacier storage classes. We also present a reference solution architecture that adopts the s3tar tool to provide a cost-effective approach for log aggregation and seamlessly transition to S3 Glacier archival storage classes. The s3tar tool efficiently handles objects of varying sizes, aggregating them through multipart copy operations. It also provides enhanced error handling and simplified deployment options.

Note: This solution serves as an alternative to the previous Java-based approach for aggregating and compressing logs for archiving in Amazon S3. The previous solution does not use the s3tar tool, but instead used a Java based method for aggregation and archiving. For customers with Java development background, the solution offers options to customize it to fit their specific needs and deployment methods. In addition, for further reading on aggregation to improve query performance, read this blog

Data lifecycle management without aggregation

This post assumes a hypothetical organization with specific log data retention and retrieval requirements. For the first 60 days, the data is frequently accessed. Between 60 and 180 days, the data is not often accessed, but must still be immediately available. Beyond that, between 180 and 365 days, the data needed to be retained and retrievable to meet compliance requirements, but with less stringent retrieval time requirements. Anything older than 365 days could be deleted. The challenge was finding a cost-efficient and scalable solution to archive these large volumes of data. The organization handles approximately 10 million objects per day, with file sizes ranging from as small as 2 KB to 5 MB.

Amazon S3 storage classes provide different storage options based on the performance, data access patterns, resiliency, and cost requirements of the workloads. Amazon S3 Lifecycle provides a mechanism to define custom object transition and expiration rules to move objects to cost-effective storage classes or expire them based on object age. You can transfer less frequently accessed objects to lower-cost storage classes to achieve cost savings without sacrificing availability or performance. To learn more about the storage class options, watch this video overview of Amazon S3 data lifecycle management.

To meet the previously listed data retention and retrieval requirements, organizations can set up S3 Lifecycle rules on the logging S3 bucket to transition objects from S3 Standard to S3 Glacier Instant Retrieval after 60 days, move objects to Glacier Flexible Retrieval or Glacier Deep Archive after 180 days, and permanently delete objects after 365 days.

Sample S3 Lifecycle configuration, with no aggregation of objects before archiving

Figure 1: Sample S3 Lifecycle configuration, with no aggregation of objects before archiving

For the purposes of this blog, we focus on Amazon S3 Glacier Instant Retrieval which is a storage class designed for rarely accessed data and provides millisecond retrieval times.

Challenges associated with this approach

  1. S3 Lifecycle transition costs are proportional to the number of objects transitioned by S3 Lifecycle configuration. In a hypothetical enterprise-scale setup involving daily log volumes scaling to approximately 10 million objects and billions of existing objects in the S3 bucket, the volume of object lifecycle transitions can be substantially high.
  2. S3 Lifecycle does not transition objects smaller than 128 KB by default. While customers can use object size filters with S3 Lifecycle rules to transition objects smaller than 128 KB, it’s generally not recommended. The transition costs for these small objects can sometimes outweigh the savings from lower storage costs. For more information, refer to the constraints section in this Amazon S3 User Guide.
  3. Customers can take one or more of the following actions with the small objects that have not been transitioned by default. One option is to continue to retain them in the S3 Standard storage class before expiring them. This is ideal if you only have a small number, say a couple of tens or hundreds, of small objects. Another option is to run custom aggregation solutions to aggregate millions to billions of small objects. While this option reduces the transition costs, it is time-consuming, resource-intensive, and can incur additional compute and request costs.

Data lifecycle management with aggregation

When millions of files are transitioning daily to Amazon S3 Glacier storage classes, even an efficient S3 Lifecycle policy can result in unexpectedly high transition costs. A primary objective of the solution is to optimize costs by minimizing the overall number of S3 Lifecycle transition requests needed to move to the archival storage classes. This is achieved by using a log aggregation mechanism. In the following solution, we perform a pre-processing step using a utility tool that aggregates the objects from the S3 Standard storage class to build a tar archive and upload it directly to S3 Glacier Instant Retrieval in a destination S3 bucket. The following figure shows a conceptual flow for this implementation.

Sample S3 Lifecycle configuration with aggregation and archival transition using s3tar

Figure 2: Sample S3 Lifecycle configuration with aggregation and archival transition using s3tar

The amazon-s3-tar-tool (s3tar) is a community-maintained open-source tool designed to efficiently create tar archives of S3 objects. s3tar optimizes for cost and performance on the steps involved in downloading the objects, aggregating them into a tar, and putting the final tar in a specified Amazon S3 storage class using a configurable “–concat-in-memory” flag. You can read this section of the README to learn more about the optimization details. The tool also offers the flexibility to upload directly to a user’s preferred storage class or store the tar object in S3 Standard storage and seamlessly transition it to specific archival classes using S3 Lifecycle policies. You can customize s3tar to use GNU headers as the tar format or use the default configuration of Portable Archive Interchange (PAX). You can retrieve objects included in tar using s3tar by downloading the archive object to a local filesystem and using tar to extract the objects. You can review the archive extraction methods described in the GNU manual for more guidance.

The reference solution architecture uses a serverless approach to automate the following steps for tar creation and object lifecycle management, error handling, and monitoring. An example implementation of this solution approach is available on GitHub. For a full walkthrough and deployment instructions, follow the steps detailed in the README. Figure 3 shows the proposed solution architecture with s3tar.

1. First, we configure Amazon S3 Inventory at a bucket level or prefix level where the tar and S3 Lifecycle transitions have to be run.

a. When S3 Versioning is enabled, the Inventory report is configured to list only the current active version for this implementation. Non-current versions of objects are deleted using S3 Lifecycle rule actions.

b. Include the following object metadata attributes in the S3 Inventory report configuration: last modified, size, Etag, and storage class. This metadata information is used later in the process to filter objects based on their age and size to create a tar.

c. You can configure the S3 Inventory CSV file to be stored in a destination bucket under a specific prefix.

2. Amazon EventBridge Scheduler is used to schedule the orchestration of the tar creation steps on a daily basis, based on the Amazon S3 Inventory schedule. An AWS Step Functions state machine workflow is used as a serverless mechanism for orchestrating the stages.

3. The s3tar tool is packaged as a Docker container and hosted on a serverless AWS Fargate infrastructure managed by Amazon Elastic Container Service (Amazon ECS). To learn how to create a Docker container image and host it on Amazon ECS and Fargate, refer to the getting started section of Amazon ECS developer guide. The Amazon ECS task is right-sized for CPU and memory settings based on assessing the average number of objects processed and the time taken by s3tar to prepare the tar archive.

a. When the step function workflow gets triggered on its schedule, the initial stage on the Amazon ECS task generates an input manifest file for the s3tar tool, as explained in the s3tar README. For this implementation, this stage filters objects older than 60 days using the object metadata available in the Amazon S3 Inventory report configuration and makes the manifest available to s3tar for further processing.

b. s3tar processes the filtered objects from the manifest and prepares a tar archive. The tar archive is placed in a non-versioned destination S3 bucket in the S3 Glacier Instant Retrieval storage class.

c. The final stage on the Amazon ECS task deletes objects and transitions applicable objects based on S3 Lifecycle transition rules. If the source bucket has S3 Versioning enabled, then this step invokes the S3 DeleteObject API to place delete markers on the objects that were included in the tar. S3 Lifecycle rules are configured to expire and permanently delete those objects next. S3 Lifecycle rules are also configured to permanently delete the inventory files and incomplete multipart objects that were generated during the s3tar processing.

4. Error handling and retry logic are included in the step function workflow using the AWS Lambda function with success and failure notifications sent to subscribers using Amazon Simple Notification Service (Amazon SNS). To learn how to set up error handling, retry, and alerting using Step Function State Machines, refer to the blog “Handling Errors, Retries, and adding Alerting to Step Function State Machine Executions.”

Proposed serverless event driven data lifecycle management architecture using s3tar

Figure 3: Proposed serverless event driven data lifecycle management architecture using s3tar

Cost efficiency is a top priority when managing large datasets with millions of objects with varying sizes. The following shows a direct comparative pricing illustration using Amazon S3 storage class pricing in the US East (N. Virginia) AWS Region. The illustration considers an S3 bucket with a daily log volume of 10 million objects in the S3 Standard storage class that are transitioned to the S3 Glacier Instant Retrieval storage class.

1. Daily transitions for the 10 million objects from the S3 Standard storage class to the S3 Glacier Instant Retrieval storage class at $0.02 per 1,000 requests turn out to be 10,000 transition requests amounting to $200 per day.

2. In contrast, our aggregation method with s3tar costs $8 per day, broken down into $4 for Amazon S3 API calls and $4 for serverless compute used for s3tar processing. Furthermore, when evaluating storage costs, the daily storage costs for 3 TB of data is approximately $2.36 in the S3 Standard storage class. When transitioned to the S3 Glacier Instant Retrieval storage class, the storage cost drops to $0.41 per day. This means that the reduced storage rate would offset the $8 aggregation cost in under five days. Therefore, not only does aggregation provide an immediate cost-saving, but also it proves more economical in the long-run.

The overall comparison from the preceding illustration shows that using the s3tar option for object aggregation combined with optimized S3 Lifecycle transition rules provides up to 80% savings on Amazon S3 costs. The number of S3 Lifecycle transition requests were optimized from 10,000 per day down to just 1 per day, a significant reduction. The preceding illustration excluded some other variable costs such as retrieval costs, S3 API costs, and costs associated with variable object sizes. For organizations with a similar setup across multiple environments spanning different AWS Regions, the s3tar solution can offer an optimized approach for data retention while achieving cost savings on Amazon S3 storage and Lifecycle transitions.

Considerations and best practices

Be sure to think through the following considerations and best practices:

  • By implementing a well-designed data retention strategy, organizations can balance the need to retain data for business, legal, and regulatory purposes while optimizing storage, improving data protection, and supporting overall data governance and compliance efforts.
  • Monitoring and analyzing access patterns can help organizations make more informed decisions about data management, storage, and processing, ultimately enhancing the overall efficiencies of data management. Use Amazon S3 Storage Lens to gain organization-wide visibility into object storage and activity. S3 Storage Lens groups help with object analysis by using metadata such as object size and age.
  • By understanding the costs associated with Amazon S3 storage classes and implementing well-designed S3 Lifecycle policies, organizations can effectively manage their massive data lakes built using Amazon S3.
  • When planning restoration from Amazon S3 Glacier storage classes, consider the associated restoration costs and applicable tiers. Refer to the S3 Glacier archival storage documentation to understand the cost implications of restore operations from these storage classes.
  • Implementing robust object aggregation strategies and S3 Lifecycle configurations provide more cost and performance gains.
  • For a small number of objects, it’s more practical and cost-effective to use S3 storage classes with S3 Lifecycle rules rather than implementing an aggregation solution.
  • Consider the trade-offs of maintaining a custom aggregation solution, including the complexity of restoring individual objects from archived aggregated files.

Conclusion

In this post, we demonstrated how to reduce the number of Amazon S3 Lifecycle transitions to S3 Glacier storage classes. We achieved this using a serverless, event driven solution that leverages the s3tar tool to minimize the number of objects transitioned. We discussed how enterprises with large Amazon S3 footprints can optimize storage and performance with efficient storage lifecycle management. For this case of log management, the solution presented in this post automates the process of consolidating objects in Amazon S3 and moving them to the S3 Glacier storage classes. This approach utilizes optimizations available in the s3tar tool, such as improved error handling, multipart copying for larger objects, and avoiding object downloads during processing. These features help make the overall archiving process more efficient and cost-effective.

With this approach, the user could achieve up to 80% costs savings, thus reducing transition costs and effectively handle varying object sizes. A comprehensive strategy is necessary when migrating data to archival storage classes, making sure of awareness of the associated costs and object access patterns. The proposed data aggregation approach using a tar archival format is particularly useful when dealing with large volumes of data that need to be retained for extended periods, such as for regulatory compliance and historical data analysis.

To get started, review the sample code on GitHub, which shows an implementation of this pattern. This example includes instructions on generating some test data and trying out the solution in your environment.

Thank you for reading this post. Feel free to leave your thoughts in the comments section.

Krishna Prasad

Krishna Prasad

Krishna Prasad is a Senior Solutions Architect in Strategic Accounts Solutions Architecture team at AWS. He works with customers to help solve their unique business and technical challenges, providing guidance in areas like distributed compute, security, storage, containers, serverless, artificial intelligence (AI), and machine learning (ML). In his free time, he enjoys music, playing tennis, travel and hiking nature trails.

Yanko Bolanos

Yanko Bolanos

Yanko Bolanos is a Senior Solutions Architect enabling customers run production workloads on AWS. With over 16 years of experience in media & entertainment, telco, gaming, and data analytics, he is passionate about driving innovation in cloud and technology solutions. Prior to AWS, Yanko applied his cross-disciplinary tech and media expertise while serving as a Director leading R&D Engineering and Ad Engineering teams.