Category: Compute*


Plat_Forms Contest in Germany – Not Your Typical Hackathon

Were excited to work with the Plat_Forms programming contest, an effort organized in Berlin, at Freie Universitt this spring.

Plat_forms_2011_fisheye

 

Focus on Scalability & Cloud Architectures
The Plat_Forms contest has been around in Germany since 2007. Its hallmark is celebrating the diversity and strength of various development languages (Java EE, .NET, PHP, Perl, Python, Ruby, etc.). This year, the focus of the contest is entirely on cloud computing and scalability. Ulrich Strk, the program organizer, explained AWS is technology agnostic, thus allowing for a fair comparison of the individual platforms. Apart from that, AWS is the market leader in IaaS services, thus making a perfect choice for us since we can expect that a lot of developers have either used AWS already or are interested in trying it now.

The Coding Challenge
Unlike the typical hackathons or programming contests where developers can enter with radically different apps, Plat_Forms is all about giving everyone the same coding challenge. The organizers have established a set of requirements for the contest task and plan on evaluating all entries holistically, including against principles such as application usability and robustness.

Some of the considerations required on the submission include:

  • It will be a web-based application, with a simple RESTful web service interface.
  • It will have challenging Service Level Agreement (SLA) requirements such as number of concurrent users it needs to support, guaranteed response times, fault tolerance, etc.
  • It will be some kind of messaging service, where users can send each other messages.
  • It will require persistent storage of data.
  • It may require integration with external systems or data sources, but using simple and standard kinds of mechanisms only (such as HTTP/REST).
  • It will require scaling by operation on multiple nodes (computers), which must use a stateless operation mode (i.e. if a single node fails the system overall, the application must not fail and must not lose data).
  • It must run completely on the Amazon Web Services infrastructure.

Bonus!
For Germany-based programmers entering the contest, there is an extra bonus prize of $1,000 in AWS credits to the winning teams, in addition to the prestige and prizes provided by the contest itself. The deadline for entering has just been extended to March 16, 2012 and the actual coding challenge will take place in April.

To learn more (guidelines, judging criteria, process etc.), please visit the Plat_Forms website and let us know if you enter! :)

-rodica

Running Esri Applications in the AWS Cloud

Esri is a leading provider of Geographic Information Systems (GIS) software and geo-database management applications. Their powerful mapping solutions have been used by Governments, industry, academics, and NGOs for nearly 30 years (read their history to learn more).

Over the years, they have ported their code from mainframes to minicomputers (anyone else remember DEC and Prime?), and then to personal computers. Now they are moving their applications to the AWS cloud to provide their customers with applications that can be launched quickly and then scaled as needed, all at a cost savings when compared to traditional on-premise hosting.

Watch this video to learn more about this potent combination:

We will be participating in the Esri Federal GIS Conference in Washington, DC this month; please visit the AWS Federal page for more information about this and other events.

On that page you will also find case studies from several AWS users in the Federal Government including the Recovery Accountability and Transparency Board, the US Department of Treasury, the DOE’s National Renewable Energy Laboratory, the US Department of State, the US Department of Agriculture, the NASA Jet Propulsion Laboratory, and the European Space Agency.

On March 21, we will run a free Esri on the Cloud webinar at noon EST. Attend the webinar to learn how to use AWS to process GIS jobs faster and at a lower cost than an on-premise solution. Our special guest will be Shawn Kingsberry, CIO of the US Government’s Recovery, Accountability, and Transparency board.

— Jeff;

Pulse – Using Big Data Analytics to Drive Rich User Features

Its always exciting to find out that an app that has changed how I consume news and blog content on my mobile devices is using AWS to power some of their most engaging features. Such is the case with Pulse, a visual news reading app for iPhone, iPad and Android. Pulse uses Amazon Elastic MapReduce, our hosted Hadoop product, to analyze data from over 11 million users and to deliver the best news stories from a variety of different content publishers. Born out of a Stanford launchpad class and awarded for its elegant design by Apple at WWDC 2011, the Pulse app blends a strong high-tech backend with great visual appeal to conquer the eyes of mobile news readers everywhere.

Pulse backend team members from left to right: Simon, Lili, Greg, Leonard

The December 2011 update included a new feature called Smart Dock, which uses Hadoop and a tool called mrjob, developed by Yelp, to analyze users reading preferences and continuously recommend other articles or sources they might enjoy.

To understand the level of engineering that goes behind such rich customer features, I spoke to Greg Bayer, Backend Engineering Lead at Pulse:

How big is the big data that Pulse analyzes every day? 

Our application relies on accurately analyzing client event logs (as opposed to web logs) to extract trends and enable other rich features for our users. To give you a sense of the scale at which we run these analyses, we literally go through millions of events per hour, which translates to as many as 250+ Amazon Elastic MapReduce nodes on any given day. Since we are dealing with event logs, generated by our users from the various platforms on which they access our app (Android, iPhone, iPad, etc.), our logs grow in proportion to our user base. For example, the recent influx of new users from Kindle Fire (Android) means we now have a lot more logs coming in from those devices.  Also, since the logs are big, weve found that it is very efficient to write them to disk as fast as possible – directly from devices to Amazon EC2 (see my tandem article on the logging architecture we use and the graph below, which highlights some of our numbers).

For more Pulse numbers, checkout the full infographic.

Powering Rich Features for Our Users

Much of our backend is built on industry standard systems such as Hadoop. The innovation happens in how we leverage these systems to create value. For us, its all about how we can make the app more fun to use and provide rich features that our users will love. For techies, you can read about many of these features in the backend section of the Pulse engineering blog and learn about all the details.

The Right Choice for Big Data

I joined the team here pretty early on as the first backend engineer. I came to Pulse after working at Sandia National Labs, where I built and managed an in-house 70-node Hadoop cluster. This was an investment of over $100,000, operational support, and over 6 months time to get it fully fine-tuned. Needless to say, I was fully aware of the cost and resources needed to run something at the scale that Pulse would need to accommodate.

AWS was and still is the only feasible solution for us. I love the flexibility to quickly stand up a cluster of hundreds of nodes and the added flexibility of choosing the pricing scheme thats needed for a job. If I need a job done faster, I can always spin up a very large cluster and get results in minutes, or take advantage of smaller instances and the spot marketplace for Amazon Elastic MapReduce if Im looking to complete a job thats not time-sensitive. Since an Amazon Elastic MapReduce cluster can simply be turned off when we are done, the cost to run big queries is usually quite reasonable. Consider a cluster of 100 m1.large machines: a set of queries that takes 45 minutes to run on this cluster could cost us approximately $11 – $34 (depending on whether we bid on spot instances or use regular on-demand instances).

Lessons Learned (the bold fomatting below is our doing :) )

It is important to consider the trade-offs and choose the right tool for the job. In our experience, AWS provides an exceptional capability to build systems as close to the metal as you like, while still avoiding the burden and inelasticity of owning your own hardware. It also provides some useful abstraction layers and services above the machine level.

By allowing virtual machines (Amazon EC2 instances) to be provisioned quickly and inexpensively, a small engineering team can stay more focused on the development of key product features. Since stopping and starting these instances is painless, its easy to quickly adapt to changing engineering or needs perhaps scaling up to support 10x more users or shutting down a feature after pivoting a business model.

AWS also provides many other useful services that help save engineering time. Many standard systems, such as load balancers or Hadoop clusters, that normally require significant time and specialized knowledge to deploy, can be deployed automatically on Amazon EC2 for almost no setup or maintenance cost.

Simple, but powerful services like Amazon S3 and the newly released Amazon DynamoDB make building complex features on AWS even easier. Because bandwidth is fast and free between all AWS services, plugging together several of these services is a great way to bootstrap a scalable infrastructure.

Thanks for your time, Greg & best of luck to the Pulse team! 

-rodica

Related: Pulse Engineering – Scaling to 10M on AWS

Be Careful When Comparing AWS Costs…

Earlier today, GigaOM published a cost comparison of self-hosting vs. hosting on AWS. I wanted to bring to your attention a few quick issues that we saw with this analysis:

Lower Costs in Other AWS Regions – The comparison used the AWS costs for the US West (San Francisco) Region, ignoring the fact that EC2 pricing in the US East (Northern Virginia) and US West (Oregon) is lower ($0.76 vs. $0.68 for On-Demand Extra Large Instances).

Three Year Reserved Instances – The comparison used one year Reserved Instances, but a three year amortization schedule for the self-hosted hardware. You save over 22% by using three year Reserved Instances instead of one year Reserved Instances, and the comparison is closer to apples-to-apples.

Heavy Utilization Reserved Instances – The comparison used a combination of Medium Utilization Reserved Instances and On-Demand Instances. Given the predictable traffic pattern in the original post, a blend of Heavy and Light Utilization Reserved Instances would reduce your costs, and still give you the flexibility to easily scale up and scale down that you don’t get with traditional hosting.

Load Balancer (and other Networking) Costs – The self-hosted column does not include the cost of a redundant set of load balancers. They also need top-of-rack switches (to handle what is probably 5 racks worth of servers) and a router.

No Administrative Costs – Although the self-hosted model specifically excludes maintenance and administrative costs, it is not reasonable to assume that none of the self-hosted hardware will fail in the course of the three year period. It is also dangerous to assume that labor costs will be the same in both cases, as labor can be a significant expense when you are self-hosting.

Data Transfer Costs – The self-hosted example assumes a commit of over 4 Gbps of bandwidth capacity. If you have ever contracted for bandwidth & connectivity at this scale, you undoubtedly know that you must actually commit to a certain amount of data transfer, and that your costs will change significantly if you are over or under your commitment.

We did our own calculations taking in to account only the first four issues listed above and came up with a monthly cost for AWS of $56,043 (vs. the $70,854 quoted in the article). Obviously each workload differs based on the nature of what resources are utilized most.

These analyses are always tricky to do and you always need to make apples-to-apples cost comparisons and the benefits associated with each approach. We’re always happy to work with those wanting to get in to the details of these analyses; we continue to focus on lowering infrastructure costs and we’re far from being done.

— Jeff;

New Elastic MapReduce Features: Metrics, Updates, VPC, and Cluster Compute Support (Guest Post)

Today’s guest blogger is Adam Gray. Adam is a Product Manager on the Elastic MapReduce Team.

— Jeff;


Were always excited when we can bring features to our customers that make it easier for them to derive value from their dataso its been a fun month for the EMR team. Here is a sampling of the things weve been working on.

Free CloudWatch Metrics
Starting today customers can view graphs of 23 job flow metrics within the EMR Console by selecting the Monitoring tab in the Job Flow Details page. These metrics are pushed CloudWatch every five minutes at no cost to you and include information on:

  • Job flow progress including metrics on the number of map and reduce tasks running and remaining in your job flow and the number of bytes read and written to S3 and HDFS.
  • Job flow contention including metrics on HDFS utilization, map and reduce slots open, jobs running, and the ratio between map tasks remaining and map slots.
  • Job flow health including metrics on whether your job flow is idle, if there are missing data blocks, and if there are any dead nodes.

Please watch this video to see how to view CloudWatch graphs in the EMR Console:

You can also learn more from the Viewing CloudWatch Metrics section of the EMR Developer Guide.

You can view the new metrics in the AWS Management Console:

Further, through the CloudWatch Console, API, or SDK you can set alarms to be notified via SNS if any of these metrics go outside of specified thresholds. For example, you can receive an email notification whenever a job flow is idle for more than 30 minutes, HDFS Utilization goes above 80%, or there are five times as many remaining map tasks as there are map slots, indicating that you may want to expand your cluster size.

Please watch this video to see how to set EMR alarms through the CloudWatch Console:

Hadoop 0.20.205, Pig 0.9.1, and AMI Versioning
EMR now supports running your job flows using Hadoop 0.20.205 and Pig 0.9.1. To simplify the upgrade process, we have also introduced the concept of AMI versions. You can now provide a specific AMI version to use at job flow launch or specify that you would like to use our latest AMI, ensuring that you are always using our most up-to-date features. The following AMI versions are now available:

  • Version 2.0.x: Hadoop 0.20.205, Hive 0.7.1, Pig 0.9.1, Debian 6.0.2 (Squeeze)
  • Version 1.0.x: Hadoop 0.18.3 and 0.20.2, Hive 0.5 and 0.7.1, Pig 0.3 and 0.6, Debian 5.0 (Lenny)

You can specify an AMI version when launching a job flow in the Ruby CLI using the –ami-version argument (note that you will have to download the latest version of the Ruby CLI):

$ ./elastic-mapreduce –create –alive –name “Test AMI Versioning” –ami-version latest –num-instances 5 –instance-type m1.small

Please visit the AMI Versioning section of the Elastic MapReduce Developer Guide for more information.

S3DistCp for Efficient Copy between S3 and HDFS
We have also made available S3DistCp, an extension of the open source Apache DistCp tool for distributed data copy, that has been optimized to work with Amazon S3. Using S3DistCp, you can efficiently copy large amounts of data between Amazon S3 and HDFS on your Amazon EMR job flow or copy files between Amazon S3 buckets. During data copy you can also optimize your files for Hadoop processing. This includes modifying compression schemes, concatenating small files, and creating partitions.

For example, you can load Amazon CloudFront logs from S3 into HDFS for processing while simultaneously modifying the compression format from Gzip (the Amazon CloudFront default) to LZO and combining all the logs for a given hour into a single file. As Hadoop jobs are more efficient processing a few, large, LZO-compressed files than processing many, small, Gzip-compressed files, this can improve performance significantly.

Please see Distributed Copy Using S3DistCp in the Amazon Elastic MapReduce documentation for more details and code examples.

cc2.8xlarge Support
Amazon Elastic MapReduce also now supports the new Amazon EC2 Cluster Compute instance, Cluster Compute Eight Extra Large (cc2.8xlarge). Like other Cluster Compute instances, cc2.8xlarge instances are optimized for high performance computing, giving customers very high CPU capabilities and the ability to launch instances within a high bandwidth, low latency, full bisection bandwidth network. cc2.8xlarge instances provide customers with more than 2.5 times the CPU performance of the first Cluster Compute instance (cc1.4xlarge) instance, more memory, and more local storage at a very compelling cost. Please visit the Instance Types section of the Amazon Elastic MapReduce detail page for more details.

In addition, we are pleased to announce an 18% reduction in Amazon Elastic MapReduce pricing for cc1.4xlarge instances, dropping the total per hour cost to $1.57. Please visit the Amazon Elastic MapReduce Pricing Page for more details.

VPC Support
Finally, we are excited to announce support for running job flows in an Amazon Virtual Private Cloud (Amazon VPC), making it easier for customers to:

  • Process sensitive data – Launching a job flow on Amazon VPC is similar to launching the job flow on a private network and provides additional tools, such as routing tables and Network ACLs, for defining who has access to the network. If you are processing sensitive data in your job flow, you may find these additional access control tools useful.
  • Access resources on an internal network – If your data is located on a private network, it may be impractical or undesirable to regularly upload that data into AWS for import into Amazon Elastic MapReduce, either because of the volume of data or because of its sensitive nature. Now you can launch your job flow on an Amazon VPC and connect to your data center directly through a VPN connection.

You can launch Amazon Elastic MapReduce job flows into your VPC through the Ruby CLI by using the –subnet argument and specifying the subnet address (note that you will have to download the latest version of the Ruby CLI):

$ ./elastic-mapreduce –create –alive –subnet “subnet-identifier”

Please visit the Running Job Flows on an Amazon VPC section in the Elastic MapReduce Developer Guide for more information.

— Adam Gray, Product Manager, Amazon Elastic MapReduce.

New Tagging for Auto Scaling Groups

You can now add up to 10 tags to any of your Auto Scaling Groups. You can also, if you’d like, propagate the tags to the EC2 instances launched from your groups.

Adding tags to your Auto Scaling groups will make it easier for you to identify and distinguish them.

Each tag has a name, a value, and an optional propagation flag. If the flag is set, then the corresponding tag will be applied to EC2 instances launched from the group. You can use this feature to label or distinguish instances created by distinct Auto Scaling groups. You might be using multiple groups to support multiple scalable applications, or multiple scalable tiers or components of a single application. Either, way the tags can help you to keep your instances straight.

Read more in the newest version of the Auto Scaling Developer Guide.

— Jeff;

AWS HowTo: Using Amazon Elastic MapReduce with DynamoDB (Guest Post)

Today’s guest blogger is Adam Gray. Adam is a Product Manager on the Elastic MapReduce Team.

— Jeff;


Apache Hadoop and NoSQL databases are complementary technologies that together provide a powerful toolbox for managing, analyzing, and monetizing Big Data. Thats why we were so excited to provide out-of-the-box Amazon Elastic MapReduce (Amazon EMR) integration with Amazon DynamoDB, providing customers an integrated solution that eliminates the often prohibitive costs of administration, maintenance, and upfront hardware. Customers can now move vast amounts of data into and out of DynamoDB, as well as perform sophisticated analytics on that data, using EMRs highly parallelized environment to distribute the work across the number of servers of their choice. Further, as EMR uses a SQL-based engine for Hadoop called Hive, you need only know basic SQL while we handle distributed application complexities such as estimating ideal data splits based on hash keys, pushing appropriate filters down to DynamoDB, and distributing tasks across all the instances in your EMR cluster.

In this article, Ill demonstrate how EMR can be used to efficiently export DynamoDB tables to S3, import S3 data into DynamoDB, and perform sophisticated queries across tables stored in both DynamoDB and other storage services such as S3.

We will also use sample product order data stored in S3 to demonstrate how you can keep current data in DynamoDB while storing older, less frequently accessed data, in S3. By exporting your rarely used data to Amazon S3 you can reduce your storage costs while preserving low latency access required for high velocity data. Further, exported data in S3 is still directly queryable via EMR (and you can even join your exported tables with current DynamoDB tables).

The sample order data uses the schema below. This includes Order ID as its primary key, a Customer ID field, an Order Date stored as the number of seconds since epoch, and Total representing the total amount spent by the customer on that order. The data also has folder-based partitioning by both year and month, and youll see why in a bit.

Creating a DynamoDB Table
Lets create a DynamoDB table for the month of January, 2012 named Orders-2012-01. We will specify Order ID as the Primary Key. By using a table for each month, it is much easier to export data and delete tables over time when they no longer require low latency access.

For this sample, a read capacity and a write capacity of 100 units should be more than sufficient. When setting these values you should keep in mind that the larger the EMR cluster the more capacity it will be able to take advantage of. Further, you will be sharing this capacity with any other applications utilizing your DynamoDB table.

Launching an EMR Cluster
Please follow Steps 1-3 in the EMR for DynamoDB section of the Elastic MapReduce Developer Guide to launch an interactive EMR cluster and SSH to its Master Node to begin submitting SQL-based queries. Note that we recommend you use at least three instances of m1.large size for this sample.

At the hadoop command prompt for the current master node, type hive. You should see a hive prompt: hive>

As no other applications will be using our DynamoDB table, lets tell EMR to use 100% of the available read throughput (by default it will use 50%). Note that this can adversely affect the performance of other applications simultaneously using your DynamoDB table and should be set cautiously.

SET dynamodb.throughput.read.percent=1.0;

Creating Hive Tables
Outside data sources are referenced in your Hive cluster by creating an EXTERNAL TABLE. First lets create an EXTERNAL TABLE for the exported order data in S3. Note that this simply creates a reference to the data, no data is yet moved.

CREATE EXTERNAL TABLE orders_s3_export ( order_id string, customer_id string, order_date int, total double )
PARTITIONED BY (year string, month string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LOCATION ‘s3://elastic-mapreduce/samples/ddb-orders’ ;

You can see that we specified the data location, the ordered data fields, and the folder-based partitioning scheme.

Now lets create an EXTERNAL TABLE for our DynamoDB table.

CREATE EXTERNAL TABLE orders_ddb_2012_01 ( order_id string, customer_id string, order_date bigint, total double )
STORED BY ‘org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler’ TBLPROPERTIES (
“dynamodb.table.name” = “Orders-2012-01”,
“dynamodb.column.mapping” = “order_id:Order ID,customer_id:Customer ID,order_date:Order Date,total:Total”
);

This is a bit more complex. We need to specify the DynamoDB table name, the DynamoDB storage handler, the ordered fields, and a mapping between the EXTERNAL TABLE fields (which cant include spaces) and the actual DynamoDB fields.

Now were ready to start moving some data!

Importing Data into DynamoDB
In order to access the data in our S3 EXTERNAL TABLE, we first need to specify which partitions we want in our working set via the ADD PARTITION command. Lets start with the data for January 2012.

ALTER TABLE orders_s3_export ADD PARTITION (year=’2012′, month=’01’) ;

Now if we query our S3 EXTERNAL TABLE, only this partition will be included in the results. Lets load all of the January 2012 order data into our external DynamoDB Table. Note that this may take several minutes.

INSERT OVERWRITE TABLE orders_ddb_2012_01
SELECT order_id, customer_id, order_date, total
FROM orders_s3_export ;

Looks a lot like standard SQL, doesnt it?

Querying Data in DynamoDB Using SQL
Now lets find the top 5 customers by spend over the first week of January. Note the use of unix-timestamp as order_date is stored as the number of seconds since epoch.

SELECT customer_id, sum(total) spend, count(*) order_count
FROM orders_ddb_2012_01
WHERE order_date >= unix_timestamp(‘2012-01-01’, ‘yyyy-MM-dd’)
AND order_date < unix_timestamp(‘2012-01-08’, ‘yyyy-MM-dd’)
GROUP BY customer_id
ORDER BY spend desc
LIMIT 5 ;

Querying Exported Data in S3
It looks like customer: c-2cC5fF1bB was the biggest spender for that week. Now lets query our historical data in S3 to see what that customer spent in each of the final 6 months of 2011. Though first we will have to include the additional data into our working set. The RECOVER PARTITIONS command makes it easy to

ALTER TABLE orders_s3_export RECOVER PARTITIONS;

We will now query the 2011 exported data for customer c-2cC5fF1bB from S3. Note that the partition fields, both month and year, can be used in your Hive query.

SELECT year, month, customer_id, sum(total) spend, count(*) order_count
FROM orders_s3_export
WHERE customer_id = ‘c-2cC5fF1bB’
AND month >= 6
AND year = 2011
GROUP BY customer_id, year, month
ORDER by month desc;

Exporting Data to S3
Now lets export the January 2012 DynamoDB table data to a different S3 bucket owned by you (denoted by YOUR BUCKET in the command). Well first need to create an EXTERNAL TABLE for that S3 bucket. Note that we again partition the data by year and month.

CREATE EXTERNAL TABLE orders_s3_new_export ( order_id string, customer_id string, order_date int, total double )
PARTITIONED BY (year string, month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
LOCATION ‘s3:// YOUR BUCKET‘;

Now export the data from DynamoDB to S3, specifying the appropriate partition values for that tables month and year.

INSERT OVERWRITE TABLE orders_s3_new_export
PARTITION (year=’2012′, month=’01’)
SELECT * from orders_ddb_2012_01;

Note that if this was the end of a month and you no longer needed low latency access to that tables data, you could also delete the table in DynamoDB. You may also now want to terminate your job flow from the EMR console to ensure you do not continue being charged.

Thats it for now. Please visit our documentation for more examples, including how to specify the format and compression scheme for your exported files.

— Adam Gray, Product Manager, Amazon Elastic MapReduce.

The AWS Storage Gateway – Integrate Your Existing On-Premises Applications with AWS Cloud Storage

Warning: If you don’t have a data center, or if all of your IT infrastructure is already in the cloud, you may not need to read this post! But feel free to pass it along to your friends and colleagues.

The Storage Gateway
Our new AWS Storage Gateway service connects an on-premise software appliance with cloud-based storage to integrate your existing on-premises applications with the AWS storage infrastructure in a seamless, secure, and transparent fashion. Watch this video for an introduction:

Data stored in your current data center can be backed up to Amazon S3, where it is stored as Amazon EBS snapshots. Once there, you will benefit from S3’s low cost and intrinsic redundancy. In the event you need to retrieve a backup of your data, you can easily restore these snapshots locally to your on-premises hardware. You can also access them as Amazon EBS volumes, enabling you to easily mirror data between your on-premises and Amazon EC2-based applications.

You can install the AWS Storage Gateway’s software appliance on a host machine in your data center. Here’s how all of the pieces fit together:

 

The AWS Storage Gateway allows you to create storage volumes and attach these volumes as iSCSI devices to your on-premises application servers. The volumes can be Gateway-Stored (right now) or Gateway-Cached (soon) volumes. Gateway-Stored volumes retain a complete copy of the volume on the local storage attached to the on-premises host, while uploading backup snapshots to Amazon S3. This provides low-latency access to your entire data set while providing durable off-site backups. Gateway-Cached volumes will use the local storage as a cache for frequently-accessed data; the definitive copy of the data will live in the cloud. This will allow you to offload your storage to Amazon S3 while preserving low-latency access to your active data.

Gateways can connect to AWS directly or through a local proxy. You can connect through AWS Direct Connect if you would like, and you can also control the amount of inbound and outbound bandwidth consumed by each gateway. All data is compressed prior to upload.

Each gateway can support up to 12 volumes and a total of 12 TB of storage. You can have multiple gateways per account and you can choose to store data in our US East (Northern Virginia), US West (Northern California), US West (Oregon), EU (Ireland), Asia Pacific (Singapore), or Asia Pacific (Tokyo) Regions.

The first release of the AWS Storage Gateway takes the form of a VM image for VMware ESXi 4.1 (we plan on supporting other virtual environments in the future). Adequate local disk storage, either Direct Attached or SAN (Storage Area Network), is needed for your application storage (used by your iSCSI storage volumes) and working storage (data queued up for writing to AWS). We currently support mounting of our iSCSI storage volumes using the Microsoft Windows and Red Hat iSCSI Initiators.

Up and Running
During the installation and configuration process you will be able to create up to 12 iSCSI storage volumes per gateway. Once installed, each gateway will automatically download, install, and deploy updates and patches. This activity takes place during a maintenance window that you can set on a per-gateway basis.

The AWS Management Console includes complete support for the AWS Storage Gateway. You can create volumes, create and restore snapshots, and establish a schedule for snapshots. Snapshots can be scheduled at 1, 2, 4, 8, 12, or 24 hour intervals. Each gateway reports a number of metrics to Amazon CloudWatch for monitoring.

The snapshots are stored as Amazon EBS (Elastic Block Store) snapshots. You can create an EBS volume using a snapshot of one of your local gateway volumes, or vice versa. Does this give you any interesting ideas?

The Gateway in Action
I expect the AWS Storage Gateway will be put to use in all sorts of ways. Some that come to mind are:

  • Disaster Recovery and Business Continuity – You can reduce your investment in hardware set aside for Disaster Recovery using a cloud-based approach. You can send snapshots of your precious data to the cloud on a regular and frequent basis and you can use our VM Import service to move your virtual machine images to the cloud.
  • Backup – You can back up local data to the cloud without worrying about running out of storage space. It is easy to schedule the backups, and you don’t have to arrange to ship tapes off-site or manage your own infrastructure in a second data center.
  • Data Migration – You can now move data from your data center to the cloud, and back, with ease.

Security Considerations
We believe that the AWS Storage Gateway will be at home in the enterprise, so I’ll cover the inevitable security questions up front. Here are the facts:

  • Data traveling between AWS and each gateway is protected via SSL.
  • Data at rest (stored in Amazon S3) is encrypted using AES-256.
  • The iSCSI initiator authenticates itself to the target using CHAP (Challenge-Handshake Authentication protocol).

Costs
All AWS users are eligible for a free trial of the AWS Storage Gateway. After that, there is a charge of $125 per month for each activated gateway. The usual EBS snapshot storage rates apply ($0.14 per Gigabyte-month in the US-East Region), as do the usual AWS prices for outbound data transfer (there’s no charge for inbound data transfer). More pricing information can be found on the Storage Gateway Home Page. If you are eligible for the AWS Free Usage Tier, you get up to 1 GB of free EBS snapshot storage per month as well as 15 GB of outbound data transfer.

On the Horizon
As I mentioned earlier, the first release of the AWS Storage Gateway supports Gateway-Stored volumes. We plan to add support for Gateway-Cached volumes in the coming months.

We’ll add more features to our roadmap as soon as our users (this means you) start to use the AWS Storage Gateway and send feedback our way.

Learn More
You can visit the Storage Gateway Home Page or read the Storage Gateway User Guide to learn more.

We will be hosting a Storage Gateway webinar on Thursday, February 23rd. Please attend if you would like to learn more about the Storage Gateway and how it can be used for backup, disaster recover, and data mirroring scenarios. The webinar is free and open to all, but space is limited and you need to register!

— Jeff;

Launch Relational Database Service Instances in the Virtual Private Cloud

You can now launch Amazon Relational Database Service (RDS) DB instances inside of a Virtual Private Cloud (VPC).

Some Background
The Relational Database Service takes care of all of the messiness associated with running a relational database. You don’t have to worry about finding and configuring hardware, installing an operating system or a database engine, setting up backups, arranging for fault detection and failover, or scaling compute or storage as your needs change.

The Virtual Private Cloud lets you create a private, isolated section of the AWS Cloud. You have complete control over IP address ranges, subnetting, routing tables, and network gateways to your own data center and to the Internet.

Here We Go
Before you launch an RDS DB Instance inside of a VPC, you must first create the VPC and partition its IP address range in to the desired subnets. You can do this using the VPC wizard pictured above, the VPC command line tools, or the VPC APIs.

Then you need to create a DB Subnet Group. The Subnet Group should have at least one subnet in each Availability Zone of the target Region; it identifies the subnets (and the corresponding IP address ranges) where you would like to be able to run DB Instances within the VPC. This will allow a Multi-AZ deployment of RDS to create a new standby in another Availability Zone should the need arise. You need to do this even for Single-AZ deployments, just in case you want to convert them to Multi-AZ at some point.

You can create a DB Security Group, or you can use the default. The DB Security Group gives you control over access to your DB Instances; you can allow access from EC2 instances with specific EC2 Security Group or VPC Security Groups membership, or from designated ranges of IP addresses. You can also use VPC subnets and the associated network Access Control Lists (ACLs) if you’d like. You have a lot of control and a lot of flexibility.

The next step is to launch a DB Instance within the VPC while referencing the DB Subnet Group and a DB Security Group. With this release, you are able to use the MySQL DB engine (we plan to additional options over time). The DB Instance will have an Elastic Network Interface using an IP address selected from your DB Subnet Group. You can use the IP address to reach the instance if you’d like, but we recommend that you use the instance’s DNS name instead since the IP address can change during failover of a Multi-AZ deployment.

Upgrading to VPC
If you are running an RDB DB Instance outside of a VPC, you can snapshot the DB Instance and then restore the snapshot into the DB Subnet Group of your choice. You cannot, however, access or use snapshots taken from within a VPC outside of the VPC. This is a restriction that we have put in to place for security reasons.

Use Cases and Access Options
You can put this new combination (RDS + VPC) to use in a variety of ways. Here are some suggestions:

  • Private DB Instances Within a VPC – This is the most obvious and straightforward use case, and is a perfect way to run corporate applications that are not intended to be accessed from the Internet.
  • Public facing Web Application with Private Database – Host the web site on a public-facing subnet and the DB Instances on a private subnet that has no Internet access. The application server and the RDB DB Instances will not have public IP addresses.

Your Turn
You can launch RDS instances in your VPCs today in all of the AWS Regions except AWS GovCloud (US). What are you waiting for?

— Jeff;

 

AWS Free Usage Tier now Includes Microsoft Windows on EC2

The AWS Free Usage Tier now allows you to run Microsoft Windows Server 2008 R2 on an EC2 t1.micro instance for up to 750 hours per month. This benefit is open to new AWS customers and to those who are already participating in the Free Usage Tier, and is available in all AWS Regions with the exception of GovCloud. This is an easy way for Windows users to start learning about and enjoying the benefits of cloud computing with AWS.

The micro instances provide a small amount of consistent processing power and the ability to burst to a higher level of usage from time to time. You can use this instance to learn about Amazon EC2, support a development and test environment, build an AWS application, or host a web site (or all of the above). We’ve fine-tuned the micro instances to make them even better at running Microsoft Windows Server.

You can launch your instance from the AWS Management Console:

We have lots of helpful resources to get you started:

Along with 750 instance hours of Windows Server 2008 R2 per month, the Free Usage Tier also provides another 750 instance hours to run Linux (also on a t1.micro), Elastic Load Balancer time and bandwidth, Elastic Block Storage, Amazon S3 Storage, and SimpleDB storage, a bunch of Simple Queue Service and Simple Notification Service requests, and some CloudWatch metrics and alarms (see the AWS Free Usage Tier page for details). We’ve also boosted the amount of EBS storage space offered in the Free Usage Tier to 30GB, and we’ve doubled the I/O requests in the Free Usage Tier, to 2 million.

I look forward to hearing more about your experience with this new offering. Please feel free to leave a comment!

— Jeff;

PS – If you want to learn more about what’s next in the AWS Cloud, please sign up for our live event.