AWS Big Data Blog

Building and Maintaining an Amazon S3 Metadata Index without Servers

Mike Deck is a Solutions Architect with AWS

Amazon S3 is a simple key-based object store whose scalability and low cost make it ideal for storing large datasets. Its design enables S3 to provide excellent performance for storing and retrieving objects based on a known key. Finding objects based on other attributes, however, requires doing a linear search using the LIST operation. Because each listing can return at most 1000 keys, it may require many requests before finding the object. Because of these additional requests, implementing attribute-based queries in S3 alone can be challenging.

A common solution is to build an external index that maps queryable attributes to the S3 object key. This index can leverage data repositories that are built for fast lookups but might not be great at storing large data blobs. These types of indexes provide an entry point to your data that can be used by a variety of systems. For instance, the AWS Lambda search function described in “Building Scalable and Responsive Big Data Interfaces with AWS Lambda” could leverage an index instead of listing keys directly, to dramatically reduce the search space and improve performance.

In this post, I walk through an approach for building such an index using Amazon DynamoDB and AWS Lambda. With these technologies, you can create a high performance, low-cost index that scales and remains highly available without the need to maintain traditional servers.

Example use case

For the purposes of illustration, this post focuses on a common use case in which S3 is used as the primary data store for a fleet of data ingestion servers. For this example, assume you have a large number of Amazon EC2 instances that receive data sent by customers via a public API. These servers batch the data in one-minute increments and add an object per customer to S3 with the raw data items received in that minute. Because of the distributed nature of the instances, there’s no way to know which servers might store data for a given customer at any minute.

Assume that the servers upload objects with the following key structure:

[4-digit hash]/[server id]/[year]-[month]-[day]-[hour]-[minute]/[customer id]-[epoch timestamp].data

Example: a5b2/i-31cc02/2015-07-05-00-25/87423-1436055953839.data

This key structure enables sustained, high-access rates to S3 but makes it difficult to find all keys for a given customer or server using S3 LIST operations. For instance, to list all the data objects for a given customer uploaded within the last 24 hours, you would have to iterate over every single key in the bucket and inspect the customer ID for each one separately.

In addition to the information encoded in the key, each object has a user-defined metadata field that specifies whether a transaction record is present in the data. A very small percentage of these objects contain transaction records. However, these records are particularly important for certain analyses.

For data collected in this manner there are a number of analyses you could run. This post focuses on building a metadata index to facilitate four specific reports and queries:

  1. Find all objects for a given customer collected during a time range.
  2. Calculate the total storage used for a given customer.
  3. List all objects for a given customer that contain a transaction record.
  4. Find all objects uploaded by a given server during a time range.

Architecture

In addition to fulfilling the functional requirements outlined above, below are the primary architectural goals for this system:

  • Zero administration cost – This system should not require the creation or administration of any servers.
  • Scalable and elastic – The index should be able to accommodate a growing number of entries seamlessly as well as scaling up and down to handle changing rates of insertions and queries.
  • Automatic – Adding objects to the index should not require any additional operations beyond adding the object to S3.

DynamoDB is a NoSQL data store that can be used for storing the index itself, and AWS Lambda is a compute service that can run code to add index entries. Both of these services are fully managed, providing scalable and highly available components without the need to administer servers directly.

To update the index automatically when new objects are created, the AWS Lambda function that creates the index entries can be configured to execute in response to S3 object creation events.

The process is illustrated below.

Architectural goals

Note: The example code in this post only handles object creation, but the same approach can also be used to remove entries from the index when objects are deleted from the bucket.

DynamoDB table design

The heart of the S3 object index is a DynamoDB table with one item per object, which associates various attributes with the object’s S3 key. Each item contains the S3 key, the size of the object, and any additional attributes to use for lookups.

Because DynamoDB tables are schema-less, the only things you need to define explicitly are the primary key and any additional indexes to support your queries. When selecting a primary key and indexes, you need to consider how the table will be queried. The following sections look at each of the four queries from the example and show an index optimized for each one.

In this example, you define all of your indexes up front. In a more iterative development context, you could define only the primary key to begin with and then add secondary indexes as your requirements demand.

1. Find all objects for a given customer collected during a time range

For this query type, use a hash and range primary key. By making the customer ID the hash key, you can find all the objects for a given customer. Additionally, if the range key is the timestamp you can narrow the results to a specific time range.  You can’t use just these two pieces of information alone because there’s no guarantee that two different servers won’t upload an object for the same customer at the same time, violating the uniqueness requirement of your primary key. To guarantee uniqueness while still enabling the ability to query on time range, you can append the server ID to the timestamp for the range key. The resulting key layout is shown below.

2. Calculate the total storage used for a given customer

Because your primary key always allows you to retrieve all of the attributes for each item, you’ll also be able to use this index to track the storage consumed for each customer by retrieving all of the records for a given customer ID and summing the size attribute.

3. List all objects for a given customer that contain a transaction record.

Because most of the objects won’t contain a transaction, you can use a sparse secondary index to enable fast lookups of log files with transactions for a customer. To create a sparse index you need a “HasTransaction” attribute that is only present when a transaction exists in the object. When no transaction is present you should omit the attribute entirely.

For this index, use the same customer ID hash key and set the range key to the “HasTransaction” attribute. Because this index’s hash key is the same as the primary key you can define the index to be a local secondary index.

4. Find all objects uploaded by a given server during a time range

This query will require a global secondary index since the lookup will use a different hash key than the primary key.  Use the server ID as the hash key and reuse the concatenated timestamp and server ID attribute for the range key. Because global secondary indexes do not have the same uniqueness constraint as primary keys, you don’t need to worry about including the customer ID in this index.

Lambda function overview

Now that you have your DynamoDB table defined, you can build the Lambda function that handles the object creation events fired by S3. The event handler needs to complete the following tasks for each object added to your bucket:

  1. Extract the key and object size from the event data.
  2. Request the user-defined metadata fields for the object from S3.
  3. Determine the name of the index DynamoDB table.
  4. Put an item into the table.

Steps 1, 2, and 4 are very straightforward, and are shown in the example code that accompanies this post.

Determining the name of the DynamoDB table to use can be done in several ways. For the sake of simplicity, the code example uses a simple naming convention in which an “-index” suffix is appended to the bucket name. This way, the same Lambda function can be reused on multiple buckets. Other alternatives to this strategy would be to simply hard-code the index table name in the function or use the event notification configuration ID to encode the table name in the S3 event itself.

Practical considerations

The approach described in this post is an effective way to build and maintain an index for S3 buckets across a variety of usage patterns, but there are some issues you should consider before using this architecture in production.

Error Handling

While all of the services used for this index are designed to be highly available, there’s always the potential that the indexing function could encounter an error. We can write our function defensively and handle many scenarios gracefully, but we also need a mechanism for dealing with unrecoverable failures.

Fortunately, Lambda functions create and write to Amazon CloudWatch log streams by default. Each invocation of the function is logged. If it does not complete successfully, there is a record of what caused the failure. You can also create a CloudWatch alarm that notifies a human whenever there is an error that our automated process couldn’t deal with, so that the problem can be investigated and remedied.

Object creation rate

When configuring your index, consider the rate at which objects will be created in S3 to properly set the provisioned throughput for the DynamoDB table as well as the concurrency rates for the Lambda function. This style of index generally requires DynamoDB write capacity equivalent to the maximum object creation rate. For more information about provisioning throughput, see the Use Burst Capacity Sparingly section in the Guidelines for Working with Tables topic.

You should also test your Lambda function under various loads to determine its concurrency requirements. After you’ve determined the maximum request rate and concurrent invocations needed to support your usage patterns, you can request an appropriate increase to the default limits if necessary.

Performance Tuning

Depending on your AWS Lambda function’s complexity, you may need to adjust the available resources (memory, CPU, and network). You can adjust the memory allocated to your function at any time and AWS Lambda assigns proportional CPU and network resources based on that value.

Sample code and query examples

The AWS Big Data Blog’s Github repository contains sample code and instructions for deploying this system.

I’ve also created a video that demonstrates deploying the sample code.

Conclusion

By leveraging S3’s integration with other fully-managed AWS services, you can build extremely useful extensions with minimal development and ongoing administrative costs. Because both Lambda and DynamoDB provide highly flexible platforms for executing arbitrary code or storing schema-less data, respectively, you can use the overarching approach described in this post to build sophisticated solutions that don’t create the operational burden of provisioning and maintaining traditional servers.