AWS Architecture Blog

Swiftly Search Metadata with an Amazon S3 Serverless Architecture

As you increase the number of objects in Amazon Simple Storage Service (Amazon S3), you’ll need the ability to search through them and quickly find the information you need.

In this blog post, we offer you a cost-effective solution that uses a serverless architecture to search through your metadata. Using a serverless architecture helps you reduce operational costs because you only pay for what you use.

Our solution is built with Amazon S3 event notifications, AWS Lambda, AWS Glue Catalog, and Amazon Athena. These services allow you to search thousands of objects in an S3 bucket by filenames, object metadata, and object keys. This solution maintains an index in an Apache Parquet file, which optimizes Athena queries to search Amazon S3 metadata. Using this approach makes it straightforward to run queries as needed without the need to ingest data or manage any servers.

Using Athena to search Amazon S3 objects

Amazon S3 stores and retrieves objects for a range of use cases, such as data lakes, websites, cloud-native applications, backups, archive, machine learning, and analytics. When you have an S3 bucket with thousands of files in it, how do you search for and find what you need? The object search box within the Amazon S3 user interface allows you to search by prefix, or you can search using Amazon S3 API’s LIST operation, which only returns 1,000 objects at a time.

A common solution to this issue is to build an external index and search for Amazon S3 objects using the external index. The Building and Maintaining an Amazon S3 Metadata Index without Servers blog post shows you how to build this solution with Amazon OpenSearch Service or Amazon DynamoDB.

Our solution stores the external index in Amazon S3 and uses Athena to search the index. Athena makes it straightforward to search Amazon S3 objects without the need to manage servers or introduce another data repository. In the next section, we’ll talk about a few use cases where you can apply this solution.

Metadata search use cases

When your clients upload files to their S3 buckets, you’ll sometimes need to verify the files that were uploaded. You must validate whether you have received all the required information, including metadata such as customer identifier, category, received date, etc. The following examples will make use of this metadata:

  • Searching for a file from a specific date (or) date range
  • Finding all objects uploaded by a given customer identifier
  • Reviewing files for a particular category

The next sections outline how to build a serverless architecture to apply to use cases like these.

Building a serverless file metadata search on AWS

Let’s go through layers that are involved in our serverless architecture solution:

  1. Data ingestion: Set object metadata when objects are uploaded into Amazon S3. This layer uploads objects using the Amazon S3 console, AWS SDK, REST API, and AWS CLI.
  2. Data processing: Integrate Amazon S3 event notifications with Lambda to process S3 events. The AWS Data Wrangler library within Lambda will transform and build the metadata index file.
  3. Data catalog: Use AWS Glue Data Catalog as a central repository to store table definition and add/update business-relevant attributes of the metadata index. AWS Data Wrangler API creates the Apache Parquet files to maintain the AWS Glue Catalog.
  4. Metadata search: Define tables for your metadata and run queries using standard SQL with Athena to get started faster.

Reference architecture

Figure 1 illustrates our approach to implementing serverless file metadata search, which consists of the following steps:

  1. When you create new objects/files in an S3 bucket, the source bucket is configured with Amazon S3 Event Notification events (put, post, copy, etc.). Amazon S3 events provide the metadata information required for further processing and building the metadata index file on the destination bucket.
  2. The S3 event is sent to a Lambda function with necessary permissions on Amazon S3 using a resource-based policy. The Lambda function processes the event with metadata and converts it into an Apache Parquet file, which is then written into a target bucket. The AWS Data Wrangler API transforms and builds the metadata index file. The Lambda layer configures AWS Data Wrangler library for necessary transformations.
  3. AWS Data Wrangler also creates and stores metadata in the AWS Glue Data Catalog. DataFrames are then written into target S3 buckets in Apache Parquet format. The AWS Glue Data Catalog is then updated with necessary metadata. The following example code snippet writes into an example table with columns for year, month, and date for an S3 object.

wr.s3.to_parquet(df=df, path=path, dataset=True, mode="append", partition_cols=["year","month","date"],database="example_database", table="example_table")

  1. With the AWS Glue Data Catalog built, Athena will use AWS Glue crawlers to automatically infer schemas and partitions of the metadata search index. Athena makes it easy to run interactive SQL queries directly into Amazon S3 by using the schema-on-read approach.
Serverless S3 metadata search

Figure 1. Serverless S3 metadata search

Athena charges based on the amount of data scanned for the query. The data being in columnar format and data partitioning will save costs as well as improve performance. Figure 2 provides a sample metadata query result from Athena.

Athena sample metadata query results

Figure 2. Athena sample metadata query results

Conclusion

This blog post shows you how to create a robust metadata index using serverless components. This solution allows you to search files in an S3 bucket by filenames, metadata, and keys.

We showed you how to set up Amazon S3 Event Notifications, Lambda, AWS Glue Catalog, and Athena. You can use this approach to maintain an index in an Apache Parquet file, store it in Amazon S3, and use Athena queries to search S3 metadata.

Our solution requires minimal administration effort. It does not require administration and maintenance of Amazon Elastic Compute Cloud (Amazon EC2) instances, DynamoDB tables, or Amazon OpenSearch Service clusters. Amazon S3 provides scalable storage, high durability, and availability at a low cost. Plus, this solution does not require in-depth knowledge of AWS services. When not in use, it will only incur cost for Amazon S3 and possibly for AWS Glue Data Catalog storage. When needed, this solution will scale out effortlessly.

Ready to get started? Read more and get started on building Amazon S3 Serverless file metadata search:

Jiwan Panjiker

Jiwan Panjiker

Jiwan Panjiker is a Solutions Architect at Amazon Web Services, based in the Greater New York City area. He works with AWS enterprise customers, helping them in their cloud journey to solve complex business problems by making effective use of AWS services. Outside of work, he likes to spend time with his friends and family, going for long drives, and exploring local cuisine.

Haider Naqvi

Haider Naqvi

Haider Naqvi is a Solutions Architect at AWS. He has extensive Software Development and Enterprise Architecture experience. He focuses on enabling customers re:Invent and achieve business outcome with AWS. He is based out of New York.

Muthuvelan Swaminathan

Muthuvelan Swaminathan

Muthuvelan Swaminathan is an Enterprise Solutions Architect based out of New York. He works with enterprise customers providing architectural guidance in building resilient, cost-effective and innovative solutions that address business needs.