Building GrepTheWeb in the Cloud, Part 2: Best Practices

Articles & Tutorials>Amazon SQS>Building GrepTheWeb in the Cloud, Part 2: Best Practices
The second part in the two-part series on building GrepTheWeb on Amazon Web Services discusses lessons along the way.


Submitted By: Jinesh@AWS
AWS Products Used: Amazon EC2, Amazon S3, Amazon SQS, Amazon SimpleDB
Created On: July 15, 2008 11:37 PM GMT
Last Updated: September 21, 2008 9:56 PM GMT

By Jinesh Varia, Amazon Web Services

In the previous section, Building GrepTheWeb in the Cloud, Part 1: Cloud Architectures, we discussed Cloud Architectures and some tips of designing a Cloud Architecture. We also discussed one type of Cloud Architecture in the GrepTheWeb Application that uses Hadoop and all the Amazon Infrastructure Services (Amazon EC2, Amazon S3, Amazon SQS, Amazon SimpleDB).

In this section, we highlight some of the best practices from the lessons learned during implementation of GrepTheWeb. For reference, you might want to refer to the Figure 2 and Figure 4 from the first part.

Best Practices for Amazon S3

Upload Large Files, Retrieve Small Offsets

End-to-end transfer data rates in Amazon S3 are best when large files are stored instead of small tiny files (sizes in the lower KBs). So instead of storing individual files on Amazon S3, multiple files were bundled and compressed (gzip) into a blob (.gz) and then stored on Amazon S3 as objects. The files were retrieved (streamed) using the standard HTTP GET request by providing a URL (bucket and key), offset (byte-range), and size (byte-length). As a result, the overall cost of storage was reduced due to reduction in the overall size of the dataset (because of compression) and consequently the lesser number of PUT requests required than otherwise.

Sort the Keys and Then Upload Your Dataset

Amazon S3 reconcilers show performance improvement if the keys are pre-sorted before upload. By running a small script, the keys (URL pointers) were sorted and then uploaded in sorted order to Amazon S3.

Use Multi-threaded Fetching

Instead of fetching objects one by one from Amazon S3, multiple concurrent fetch threads were started within each map task to fetch the objects. However, it is not advisable to spawn 100s of threads because every node has bandwidth constraints. Ideally, users should try slowly ramping up their number of concurrent parallel threads until they find the point where adding additional threads offers no further speed improvement.

Use Exponential Back-off and Then Retry

A reasonable approach for any application is to retry every failed web service request. What is not obvious is what strategy to use to determine the retry interval. Our recommended approach is to use the truncated binary exponential back-off. In this approach the exact time to sleep between retries is determined by a combination of successively doubling the number of seconds that the maximum delay may be and choosing randomly a value in that range.

We recommended that you build the exponential back-off, sleep, and retry logic into the error handling code of your client. Exponential back-off reduces the number of requests made to Amazon S3 and thereby reduces the overall cost, while not overloading any part of the system.

Best Practices for Amazon SQS

Store Reference Information in the Message

Amazon SQS is ideal for small short-lived messages in workflows and processing pipelines. To stay within the message size limits it is advisable to store reference information as a part of the message and to store the actual input file on Amazon S3.

In GrepTheWeb, the launch queue message contains the URL of the input file (.dat.gz) which is a small subset of a result set (Million Search results that can have up to 10 million links). Likewise, the shutdown queue message contains the URL of the output file (.dat.gz), which is a filtered result set containing the links which match the regular expression.

The following tables show the message format of the queue and their statuses






Your request has been queued.











Results are now available for download from DownloadUrl






Use Process-oriented Messaging and Document-oriented Messaging

There are two messaging approaches that have worked effectively for us: process oriented and document oriented messaging. Process-oriented messaging is often defined by process or actions. The typical approach is to delete the old message from the "from" queue, and then to add a new message with new attributes to the new "to" queue.

Document-oriented messaging happens when one message per user/job thread is passed through the entire system with different message attributes. This is often implemented using XML/JSON because it has an extensible model. In this solution, messages can evolve, except that the receiver only needs to understand those parts that are important to him. This way a single message can flow through the system and the different components only need to understand the parts of the message that is important to them.

For GrepTheWeb, we decided to use the process-oriented approach.

Take Advantage Of Visibility Timeout Feature

Amazon SQS has a special functionality that is not present in many other messaging systems; when a message is read from the queue it is visible to other readers of the queue yet it is not automatically deleted from the queue. The consumer needs to explicitly delete the message from the queue. If this hasn't happened within a certain period after the message was read, the consumer is considered to have failed and the message will re-appear in the queue to be consumed again. This is done by setting the so-called visibility timeout when creating the queue. In GrepTheWeb, the visibility timeout is very important because certain processes (such as the shutdown controller) might fail and not respond (e.g., instances would stay up). With the visibility timeout set to a certain number of minutes, another controller thread would pick up the old message and resume the task (of shutting down).

Best Practices for Amazon SimpleDB

Multithread GetAttributes() and PutAttributes()

In Amazon SimpleDB, domains have items, and items have attributes. Querying Amazon SimpleDB returns a set of items. But often, attribute values are needed to perform a particular task. In that case, a query call is followed by a series of GetAttributes calls to get the attributes of each item in the list. As you can guess, the execution time would be slow. To address this, it is highly recommended to multi-thread your GetAttributes calls and to run them in parallel. The overall performance increases dramatically (up to 50 times) when run in parallel. In the GrepTheWeb application to generate monthly activity reports, this approach helped create more dynamic reports.

Use Amazon SimpleDB in Conjunction With Other Services

Build frameworks, libraries and utilities that use functionality of two or more services together in one. For GrepTheWeb, we built a small framework that uses Amazon SQS and Amazon SimpleDB together to externalize appropriate state. For example, all controllers are inherited from the BaseController class. The BaseController class's main responsibility is to dequeue the message from the "from" queue, validate the statuses from a particular Amazon SimpleDB domain, execute the function, update the statuses with a new timestamp and status, and put a new message in the "to" queue. The advantage of such a setup is that in an event of hardware failure or when controller instance dies, a new node can be brought up almost immediately and resume the state of operation by getting the messages back from the Amazon SQS queue and their status from Amazon SimpleDB upon reboot and makes the overall system more resilient.

Although not used in this design, a common practice is to store actual files as objects on Amazon S3 and to store all the metadata related to the object on Amazon SimpleDB. Also, using an Amazon S3 key to the object as item name in Amazon SimpleDB is a common practice.

Figure 7: Controller Architecture and Workflow

Best Practices for Amazon EC2

Launch Multiple Instances All At Once

Instead of waiting for your EC2 instances to boot up one by one, we recommend that you start all of them at once with a simple run-instances command that specifies the number of instances of each type.

Automate As Much As Possible

This is applicable in everything we do and requires a special mention because automation of Amazon EC2 is often ignored. One of the biggest features of Amazon EC2 is that you can provision any number of compute instances by making a simple web service call. Automation will empower the developer to run a dynamic programmable datacenter that expands and contracts based on his needs. For example, automating your build-test-deploy cycle in the form of an Amazon Machine Image (AMI) and then running it automatically on Amazon EC2 every night (using a CRON job) will save a lot of time. By automating the AMI creation process, one can save a lot of time in configuration and optimization.

Add Compute Instances On-The-Fly ("Auto-scale")

With Amazon EC2, we can fire up a node within minutes. Hadoop supports the dynamic addition of new nodes and task tracker nodes to a running cluster. One can simply launch new compute instances and start Hadoop processes on them, point them to the master and dynamically grow (and shrink) the cluster in real-time to speed up the overall process.

Safeguard Your AWS credentials When Bundling an AMI

If your AMI is running processes that need to communicate with other AWS web services (for polling the Amazon SQS queue or for reading objects from Amazon S3), one common design mistake is embedding the AWS credentials in the AMI. Instead of embedding the credentials, they should be passed in as arguments using the parameterized launch feature and encrypted before being sent over the wire. General steps are:

  • Generate a new RSA keypair (use OpenSSL tools).
  • Copy the private key onto the image, before you bundle it (so it will be embedded in the final AMI).
  • Post the public key along with the image details, so users can use it.
  • When a user launches the image they must first encrypt their AWS secret key (or private key if you wanted to use SOAP) with the public key you gave them in step 3. This encrypted data should be injected via user-data at launch (i.e. the parameterized launch feature).
  • Your image can then decrypt this at boot time and use it to decrypt the data required to contact Amazon S3. Also be sure to delete this private key upon reboot before installing the SSH key (i.e. before users can log into the machine). If users won't have root access then you don't have to delete the private key, just make sure it's not readable by users other than root.


In the first section, we discussed the business advantages of Cloud Architectures and different types of Cloud Architectures.

Best Practices are subject to the application requirements. In these article, we saw some of the general best practices that might help you if you are building an application similar to GrepTheWeb. 

Further Reading

Amazon SimpleDB White Papers
Amazon SQS White paper
Hadoop Wiki
Hadoop Website
Distributed Grep Examples
Map Reduce Paper

©2017, Amazon Web Services, Inc. or its affiliates. All rights reserved.