Indexing and Querying Amazon S3 Metadata with Amazon SimpleDB

Sample Code & Libraries>Indexing and Querying Amazon S3 Metadata with Amazon SimpleDB
Community Contributed Software

  • Amazon Web Services provides links to these packages as a convenience for our customers, but software not authored by an "@AWS" account has not been reviewed or screened by AWS.
  • Please review this software to ensure it meets your needs before using it.

Curious about how to use the Amazon SimpleDB service at real scale? Read on to walk through a real-world application using Amazon SimpleDB to index and search the metadata of your Amazon S3 objects in an efficient manner. The application is a Java console application that demonstrates how to do a few basic things: index data (put AmazonS3 metadata into Amazon SimpleDB), do single queries, and run bulk queries to assess performance. In each of these areas the app demonstrates best practices for using Amazon's Web Services.

Details

Submitted By: Michael@AWS
AWS Products Used: Amazon SimpleDB
Language(s): Java
License: Apache License 2.0
Created On: May 7, 2008 9:03 PM GMT
Last Updated: May 8, 2009 2:40 AM GMT
Download

About the Sample

Curious about how to use the Amazon SimpleDB service at real scale? Read on to walk through a real-world application using Amazon SimpleDB to index and search the metadata of your Amazon S3 objects in an efficient manner.

The application is a Java console application that demonstrates how to do a few basic things: index data (put AmazonS3 metadata into Amazon SimpleDB), do single queries, and run bulk queries to assess performance. In each of these areas the app demonstrates best practices for using Amazon's Web Services.

What's New?

  • 2009-05-01 version 0.2:
    • Updated the tool to use the Select function instead of Query and Get.
    • Renamed package from AmazonSimpleDBApp to S3Indexer.
    • Use the Download button to get this version.
  • 2008-05-07 version 0.1:
    • Initial release.
    • This tool was initially released before SimpleDB added the QueryWithAttributes and Select functions. Thus it was necessary for the tool to perform Query calls to obtain item names and then retrieve the attributes of each item using separate Get calls.
    • Download this old version: AmazonSimpleDBApp-0.1.zip

Prerequisites

  1. An Amazon Web Services account with access to Amazon SimpleDB, Amazon S3, and Amazon Simple Queue Service. (also Amazon EC2 if you wish to easily distribute the indexing task among several machines)
  2. Java 1.5 JRE or greater.
  3. If you wish to build the app (it comes with an already built jar), you will need a JDK, and Ant (>1.6)

Running the sample

  1. Click the Download button on this page to download the project .zip file, then unzip it to a working directory.
  2. If you don't want to have to pass your AWS access identifiers as arguments each time you run a command, set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to your identifiers.
  3. Read the README for instructions on how to run the app.

Indexing

The app indexes some or all of your AmazonS3 buckets, as specified on the command line. You tell it to index a particular set of buckets (or all of your buckets) and it goes through each bucket and indexes all the objects in that bucket. For each object, it stores the following attributes:

  • bucket
  • filename
  • size
  • eTag
  • lastModified

In this section we'll look at simple indexing, threading, and distributed applications.

To begin indexing, run the following command from the dist directory:

java -jar S3Indexer-0.2.jar -action index -allBuckets

You may see this exception if you are running Java 1.6:

Exception in thread "main" java.lang.LinkageError: JAXB 2.0 API is being
loaded from the bootstrap classloader, but this RI 
(from jar:file:S3Indexer-0.2/lib/jaxb-impl.jar!/com/sun/xml/bind/v2/model
/impl/ModelBuilder.class) needs 2.1 API. Use the endorsed directory
mechanism to place jaxb-api.jar in the bootstrap classloader. (See
http://java.sun.com/j2se/1.5.0/docs/guide/standards/)
        at com.sun.xml.bind.v2.model.impl.ModelBuilder.(ModelBuilder.java:135)
        at com.sun.xml.bind.v2.runtime.JAXBContextImpl.getTypeInfoSet(JAXBContextImpl.java:389)
        at com.sun.xml.bind.v2.runtime.JAXBContextImpl.(JAXBContextImpl.java:253)
        at com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:84)
        at com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:66)
        at com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:132)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at javax.xml.bind.ContextFinder.newInstance(Unknown Source)
        at javax.xml.bind.ContextFinder.find(Unknown Source)
        at javax.xml.bind.JAXBContext.newInstance(Unknown Source)
        at javax.xml.bind.JAXBContext.newInstance(Unknown Source)
        at com.amazonaws.queue.AmazonSQSClient.(AmazonSQSClient.java:92)
        at com.amazonaws.referenceapps.s3indexer.Main.(Main.java:72)
        at com.amazonaws.referenceapps.s3indexer.Main.main(Main.java:215)

Resolve the problem by specifying java.endorsed.dirs as follows:

java -Djava.endorsed.dirs=../lib -jar S3Indexer-0.2.jar -action index -allBuckets

To see more examples of options to use while indexing, see the README or the usage output.

Note - If you have a large set of data that changes frequently, you will want to think about how to keep your index up to date. It probably will not be efficient to re-index everything every time, because this costs you time and money. Ideally, you want to index only those items that are new/deleted/changed since the last time. To keep this application simple and small we'll leave this as an exercise to the developer . One solution may be to index objects accordingly when you modify them in Amazon S3.

Threading

To get the maximum performance from a single machine, indexing must be multithreaded. The app gets a list of objects from AmazonS3, and then hands each object off to a thread to index in Amazon SimpleDB concurrently. The app uses a standard Java concurrency tool, the Executor service thread pool. This service allows you to create a pool of threads to index each object, avoiding the overhead of spinning up a new thread every time. Read more at http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/package-summary.html.

The app allows you to specify the size of the thread pool used for indexing using the "-indexPoolSize N" option. The optimal setting depends primarily on the hardware your app is running on, and also to some degree on your data set. It's useful to do some quick tests starting at a low number of threads and increasing the number until you reach a point of diminishing returns. For example, you might find that moving from 4 threads to 8 halves the time it takes to index a portion of your data, but moving from 8 to 16 has little to no effect. Test and run accordingly.

Distributed applications

If you have maximized the usage of one machine and indexing still takes too long, it is time to make your application distributed. Amazon's web services make this surprisingly easy with Amazon EC2 and the Amazon Simple Queue Service. The app uses Amazon SQS to pass off the task of indexing chunks of an Amazon S3 bucket to individual workers that can be running on other machines. It puts a message on the queue to index a particular set of data, and any machine can pick up that message and do the work. To start a second machine working on the indexing, run:

java -jar S3Indexer-0.2.jar -action index_worker

This raises the question of how granular you should distribute things in such an application. The app could distribute just based on bucket, i.e. put messages on the queue to index a particular bucket, and distributed workers process that entire bucket. What happens, however, when one bucket contains a million items and the other bucket contains 10? At this point your application is effectively not distributed. The 10 item bucket gets processed instantly, and then only one box is left working on a big job while the other worker sits idle.

You want to distribute work evenly, so that over the lifetime of a job all your workers are contributing roughly equally. The app does this by taking advantage of the ability to pull objects from Amazon S3 in batches. It distributes the work by processing buckets 1,000 items at a time. If there are more items in a bucket to process after grabbing a particular batch of 1000, it adds a message on the queue to process the next 1000. Any of the distributed workers is then free to pick up that task.

It may be tempting to go to the other extreme and be very granular with your distributed jobs. You might want to put a message on the queue to process 10 items from a bucket, or put a message to index an individual object. However, distributing at this level has a cost. In our case, it takes time both to put a message on the queue and to read it off, which could lead to inefficient throughput. So choose with care how distributed you wish to be.

In our case, the application is expected to index many (~1 million) objects, so batching in buckets of 1,000 objects is granular enough.

Using the Index

The second goal of this app is to demonstrate how to retrieve data from SimpleDB. SimpleDB's Select function will return items that match the provided select expression. You can retrieve data from the index by running the app like this:

java -jar S3Indexer-0.2.jar -action select -selectExpression "select * from s3indexerDemo"

You should see output like:

2009-05-08 02:39:33,207 [main] INFO  com.amazonaws.referenceapps.s3indexer.select.SelectRunner - Running select: 'select * from s3indexerDemo'
2009-05-08 02:39:34,343 [main] INFO  com.amazonaws.referenceapps.s3indexer.select.SelectRunner - Got 100 items with 1 requests.
2009-05-08 02:39:34,343 [main] INFO  com.amazonaws.referenceapps.s3indexer.Main - 100 results found.
2009-05-08 02:39:34,344 [main] INFO  com.amazonaws.referenceapps.s3indexer.Main - Done!

If you want to see more information about what is going on under the hood, you can turn up the logging level. For example, at this point the app only prints out how many results were found, without displaying what the results are.

To see the results, run the following:

java -jar S3Indexer-0.2.jar -action select -selectExpression "select * from s3indexerDemo" -logLevel debug

To print out the attributes as well as the keys, add the '-attributes' parameter:

java -jar S3Indexer-0.2.jar -action select -selectExpression "select * from s3indexerDemo" -logLevel debug -attributes

Note - The Amazon SimpleDB service returns at most 250 items at a time for a given select request, even if there are more items that match the select expression. By default, the application makes a single request and retrieves only the first batch of 250 results. If you want the application to make multiple requests and get all of the results, add the '-all' parameter to the command line.

Bulk Selects

The application has a Bulk Select mode where it will make many select requests. It uses a thread pool to execute multiple select calls in parallel. Each select call uses a randomly generated select expression.

Conclusions

This working application serves as a demonstration of best practices for using Amazon Web Services, and is also a useful application in its own right. The app shows how you can use several Amazon Web Services in concert to create a useful, scalable, high-performing application. Hopefully it helps you do the same.

Further Reading

Would you like to know more about best practices with Amazon Web Services and how to get the last bit of performance out of your code? Read Building for Performance and Reliability with Amazon SimpleDB on the AWS Developer Resource Center.

©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved.