AWS Storage Blog
Optimizing storage costs and query performance by compacting small objects
Applications produce log files that should be reliably stored for ad-hoc reporting, compliance, or auditing purposes. Over time, these collections of relatively small log files grow in volume and cost-effective storage and data management becomes crucial. Accessing the data in these files and querying them can also be useful for getting insight from the data.
An example of a service that generates event logs is AWS CloudTrail. CloudTrail tracks API calls and user activity within an AWS account. These log files are useful for security monitoring, change tracking, and troubleshooting. However, CloudTrail logs are stored as individual files in Amazon S3 buckets, with each file typically being less than 128 KB in size. Over weeks and months of activity, the number of CloudTrail log files can grow into thousands or millions, and storage costs also rise proportionally. Although Amazon S3 offers storage classes such as S3 Standard-Infrequent Access and S3 Glacier Instant Retrieval to reduce storage costs, they have a minimum billable object size of 128 KB, and Amazon S3 Lifecycle transition charges per object. For S3 Intelligent-Tiering, objects smaller than 128 KB can be stored, but they are always charged at the Frequent Access tier rates. Transitioning large numbers of small files to infrequent access tiers can also be cost-prohibitive.
In this post, we explore a pattern for compacting (or combining) large collections of small files into fewer, larger objects using AWS Step Functions. Compaction offers an alternative to archival-based solutions such as compressing and archiving logs stored in Amazon S3 and Amazon S3 Tar. By compacting many small objects together, they become large enough to exceed the 128 KB minimum billable size discussed prior. Additionally, ad-hoc query performance is improved, and queries are executed in-place without making changes to existing code or tools.
Comparing compaction and archival
We use the term compaction to refer to multiple files being concatenated, aggregated, or otherwise combined into a single larger file without changing the format or structure of the file. This is different to archival, where an archive file format (such as a .zip or .tar) is used to store the data.
When deciding which method to use, consider the following:
- Compaction: Retains the analytical value of the data and improves query performance. It is beneficial when you need to perform frequent queries or analysis on data, as it consolidates the small files into fewer, larger objects that can be queried more efficiently in place. We demonstrate how these larger objects can reduce storage costs in Amazon S3. Compaction works best when the objects being combined are semantically equivalent. For example, streams of log entries or individual JSON lines can be aggregated without changing the file structure. Note that any detail stored in the object key or metadata is lost.
- Archival: Suitable for long-term storage, when querying is infrequent or not required. It has the potential to provide additional cost optimization through compression. Archival is more suitable if there is an overarching structure to the file, such as a JSON object per file or heterogeneous file types. Archive formats can also store directory structure, and metadata about the original files.
The compaction method presented here uses a lightweight serverless pattern to orchestrate concatenation of files. The resulting aggregated files remain query-able in place. This retains the analytical value of the data while optimizing storage costs and providing improved query performance to help you get to insights more cost effectively and quickly.
Using Amazon S3 prefixes as compaction criteria
Although the 128 KB minimum billable object size on S3 Standard Infrequent Access and S3 Glacier Instant Retrieval is meant to safeguard users from excessive lifecycle transition fees for small objects, it can be a challenge when you want to write high volumes of smaller logs files. The compaction pattern provides a method to work around this challenge. The pattern also avoids the S3 Lifecycle transition costs for moving objects between storage classes.
For example:
- A bucket using the Standard storage class has 10,000,000 objects each at 1 KB.
- The cost to store these in S3 Standard is $0.23 per month.
- The cost to store these as-is in S3 Standard-IA is $16 per month (128 KB * 10,000,000 * $0.0125 $/GB).
- These are compacted into 1000 objects, then the cost to transition becomes $0.01, and the storage cost (assuming an even distribution across 1000 output files), becomes $0.13 per month.
- These pricing details are based on US-East-1 and accurate as of the time of writing. Check the Amazon S3 pricing documentation for the latest information.
The trade-off of this is reduced granularity for reads. Using the compaction pattern, you can only retrieve objects in their compacted form rather than at the same granularity as writes (for example, each day’s data rather than each individual event). For more details on reads, see the Performance gains when querying compacted objects with Amazon Athena section of this post.
The compaction ratio depends on the granularity of the prefixes in the source S3 bucket. For example, if the prefixes are split by date down to a daily granularity, then one compacted output file is produced for each day. By maintaining the same granularity between the input and output, efficient querying of the compacted object is possible using the same schema as the source data. An example of this pattern is shown in the following diagram:
You can transition these fewer, larger objects to S3 Standard-IA or S3 Glacier storage classes for reduced storage costs. You can then expire the old, uncompacted data using lifecycle policies without additional charges.
Compacting objects using AWS Lambda and AWS Step Functions Distributed Map
A sample implementation of this pattern is available on GitHub. For a full walkthrough and deployment instructions, follow the steps detailed in the README. This implementation uses an AWS Lambda function to iterate through multiple Amazon S3 prefixes, read and merge the contents of small files in each, and aggregate them into a single file written to a destination bucket. With Lambda you pay only for the compute time you consume, so it is a fitting solution for the compaction process, as you only need to run it on a specified schedule.
AWS Step Functions is a fully managed service that makes it easier to orchestrate the execution of application components. Using Step Functions, you can co-ordinate the invocation of the file compaction Lambda function into a workflow that is easy to manage. Given that there are multiple prefix hierarchies holding small files, it is beneficial to run the compaction in parallel, since the file merging occurring under a specific prefix is completely independent of other prefixes. Distributed Map for Step functions is a feature for orchestrating parallel data processing. The example implementation uses a distributed map to parallelize the file compaction.
An example illustrating this pattern is shown in the following diagram:
The execution steps are:
- An Amazon EventBridge scheduled rule is used to schedule the workflow.
- The Step Functions workflow is started after a desired number of days. It is sensible that the interval (measured in days) between invocations directly maps to the period during which the small files due for compaction in the next execution were created. For example, the full solution can be deployed along with a scheduled rule that invokes the distributed map state every 30 days, compacting small files (in parallel) that were created in the 30 days preceding the invocation.
- A Lambda function is invoked to list all the files in the requested prefixes. This list is used as an input to the distributed map step.
- For each prefix in the source list, a parallel Lambda function is invoked to compact the objects in that prefix together.
- The resulting compacted objects are written to the same prefix structure in the destination bucket (in practice, this can be the same bucket as the source).
Note that this example implementation assumes that there is no identifying information contained in the key itself. If this is not the case (for example if the key contains the ID ranges of the records within the object, such as 101-200.json, 201-300.json, etc.), then modify the solution to output a file with the correct name or store the identifying information in a lookup table.
Solution performance and costs
In a test run, an execution of a single Lambda function could sequentially compact over 22,000 small log files in about 12 minutes. The log files were a few KB each, totalling around 250 MB. When invoking the distributed map state to run compaction against multiple prefixes simultaneously, the same total number of files were compacted in about 50 seconds (which is a 1340% reduction). This shows the benefit of using Step Functions parallel map to minimize the execution time of an individual Lambda function. This is important, as Lambda has a maximum execution timeout of 15 minutes. Without employing parallel processing, you must be increasingly mindful of the highest number of files that the Lambda function can compact in sequence in under 15 minutes.
The solution scales efficiently for environments with large volumes of small data files. As a serverless solution, there is minimal operational overhead as the underlying compute is managed by AWS, and you only pay for the resources that are consumed while the compaction runs. Step Functions is billed based on the state transitions required to go from identifying the relevant list of prefixes to concluding the compaction. For the previously described parallel invocation test run, the overall cost is covered by the AWS Free Tier. Outside of the free tier, this solution would have cost less than $0.22 per execution for 1,000 prefixes. If the compaction is run monthly, then the total is $2.64 per year. For more details, review the Lambda and Step Functions pricing pages.
The efficiency of this pattern is dependent on not only the size and quantity of the small files, but also how they are spread across the partitions in the source bucket. For example, if you have too few small files per prefix, then the cost and performance gains of the compaction process would not be as evident as would be the case when the resulting compacted object is several orders of magnitude bigger. Before adopting this solution, it is important to review your partitioning strategy.
Performance gains when querying compacted objects with Amazon Athena
The reason for compacting the small files and not compressing and archiving them is to allow real-time access to the data. Archival-based approaches have an additional overhead for decompressing the data prior to querying. This could be for as-needed analysis using CloudTrail or for less frequently accessed data in a data lake. The choice between compaction and archival is not mutually exclusive, you can compact compressed data (e.g. using gzip) and you can compress compacted data for archiving based on your requirements and access patterns.
Amazon Athena is a serverless SQL engine that runs on structured and semi-structured data stored within S3 buckets. Taking the log analytics example, Athena is a good fit for as-needed querying and analysis. Athena is priced based on the GB of data scanned. One of the main performance factors of Athena is the partition structure of the objects in Amazon S3.
When using the compaction pattern, it’s important to aggregate files up to the level that matches the time granularity of your source data (for example, daily if your Amazon S3 partition structure is year/month/day). This makes sure that you don’t over-select or discard data when querying. If you compact files beyond the granularity of your source data, then you risk inefficient query performance and increasing costs. For example, if your source data is partitioned by day (such as 2024/02/01), then the most granular level to which you should compact files is a single day. If you compact files at a higher level than the source data’s granularity, then you risk losing the ability to filter data effectively, which leads to over-selecting, increased costs, and degraded query performance for large datasets. For more information, see organizing objects using prefixes.
Another factor in Athena performance is the size of the files. Many small files can add an overhead when calculating the result. This is where compaction can reduce query execution times. Generally, aiming for ~100s of MB file size is recommended. For more details on optimizing Athena query performance, see “Top 10 Performance Tuning Tips for Amazon Athena.”
In the following example, there is 1.4 GB of data, split across 10,000 files (0.14 MB per file), partitioned by day in an S3 bucket. There could be hundreds of individual files per day. When querying this raw data with Athena, the query performance is as follows:
When running the same query against the compacted version of that data, the query time is around 66% faster and query concurrency is improved due to the faster query completion time. Compaction has reduced the number of files to 293 (4.8 MB per file). The following screenshot shows that the volume of data scanned is the same as the data that is not compressed or converted to another format:
As the volume of data scales, the execution performance scales linearly. In the following example, ~582k objects, totalling ~83 GB (0.14 MB per file), are compacted down to 336 files (247 MB per file). When queried, the same performance improvement is observed. The first of the following screenshots shows the raw data retrieved in 40 seconds:
The next screenshot shows the compacted data and improved query run time of 9.7 seconds:
The same execution results are presented in the following tables for comparison:
Dataset 1:
Format | Total objects | Total size (GB) | Average object size (MB) | Execution time (s) |
Raw | 10,000 | 1.4 | 0.14 | 16 |
Compacted | 293 | 1.4 | 4.8 | 6.8 |
Dataset 2:
Format | Total objects | Total size (GB) | Average object size (MB) | Execution time (s) |
Raw | 582,000 | 83 | 0.14 | 40.7 |
Compacted | 336 | 83 | 247 | 9.7 |
Another benefit to the compaction approach is that the implementation is transparent to the query engine. As the schema and format is the same, you can query across both the “hot” or raw data in the source bucket and join or union to the data in the “cold” or compacted bucket. This allows for the aggregation process to be run periodically or infrequently while still having up-to-date data. An example query performing a union across the raw and compacted data is shown in the following screenshot:
For more details on querying archive data using Athena, see the blog “Simplify querying your archive data in Amazon S3 with Amazon Athena.”
Cleaning up
If you deployed the solution from the GitHub repository, then make sure to follow the clean-up steps to avoid unexpected costs.
Conclusion
In this post, we explored an efficient way to compact small objects in Amazon S3 and showed how it’s an effective way to optimize storage costs for log data. By leveraging AWS Step Functions, you can compact thousands of small objects quickly and efficiently. By using AWS Lambda to perform the compaction you can achieve lower costs and reduce operational overhead for your data compaction solution.
When aggregating many small files into larger objects in a destination bucket with a matching prefix hierarchy, the ease of querying the data is preserved. Query times in Amazon Athena against compacted data can be 50-70% faster, as there is less overhead in scanning fewer, larger files. Additionally, when transitioning compacted objects to S3 Standard-IA or S3 Glacier storage tiers, you can save up to 80% on storage costs by running this solution periodically.
To get started, review the sample code on GitHub, which shows an implementation of this pattern. This example includes instructions on generating some test data and trying out the solution in your environment.