Category: Amazon EC2


Surprise! The EC2 CC2 Instance Type uses a Sandy Bridge Processor…

We like to distinguish our Cluster Compute instances from the other instance types by providing details about the actual processor (CPU) inside. When we launched the CC2 (Cluster Compute Eight Extra Large) instance type last year we were less specific than usual, stating only that each instance contained a pair of 8-core Xeon processors, each Hyper-Threaded, for a total of 32 parallel execution units.

If you are a student of the Intel architectures, you probably realized pretty quickly that Intel didn’t actually have a processor on the market with these particular specs and wondered what was going on.

Well, therein lies a tale. We work very closely with Intel and were able to obtain enough pre-release Xeon E5 (“Sandy Bridge“) chips last fall to build, test, and deploy the CC2 instance type. We didn’t publicize or expose this information and simply announced the capabilities of the instance.

Earlier today, Intel announced that the Xeon E5 is now in production and that you can now buy the same chips that all EC2 users have had access to since last November. You can now write and make use of code that takes advantage of Intel’s Advanced Vector Extensions (AVX) including vector and scalar operations on 256-bit integer and floating-point values. These capabilities were a key in an cluster of 1064 cc2.8xlarge instances making it to the 42nd position at last Novembers Top500 supercomputer list clocking in at over 240 Teraflops.

I am happy to say that we now have plenty of chips, and there is no longer any special limit on the number of CC2 instances that you can use (just ask if you need more). Customers like InhibOx are using cc2.8xlarge instances for building extremely large customized virtual libraries for their customers – to support computational chemistry in drug discovery.  In addition to computational chemistry customers have been using this instance type for a variety of applications ranging from image processing to in-memory databases.

On a personal note, my son Stephen is working on a large-scale dynamic problem solver as part of his PhD research. He recently ported his code from Python to C++ to take advantage of the Intel Math Kernel Library (MKL) and some other parallel programming tools. I was helping him to debug an issue that prevented him from fully exploiting all of the threads. Once we had fixed it, it was pretty cool to see his CC2 instance making use of all 32 threads (image via htop):

And what are you doing with your CC2? Leave us a comment and share….

— Jeff;

Dropping Prices Again– EC2, RDS, EMR and ElastiCache

AWS works hard to lower our costs so that we can pass those savings back to our customers. We look to reduce hardware costs, improve operational efficiencies, lower power consumption and innovate in many other areas of our business so we can be more efficient. The history of AWS bears this out — in the past six years, weve lowered pricing 18 times, and today were doing it again. Were lowering pricing for the 19th time with a significant price decrease for Amazon EC2, Amazon RDS, Amazon ElastiCache and Amazon Elastic Map Reduce.

Amazon EC2 Price Drop
First, a quick refresher.  You can buy EC2 instances by the hour. You have no commitment beyond an hour and can come or go as you please. That is our On-Demand model.

If you have predictable, steady-state workloads, you can save a significant amount of money by buying EC2 instances for a term (one year or three year).  In this model, you purchase your instance for a set period of time and get a lower price. These are called Reserved Instances, and this model is the equivalent to buying or leasing servers, like folks have done for years, except EC2 passes its benefit of substantial scale to its customers in the form of low prices. When people try to compare EC2 costs to doing it themselves, the apples to apples comparison is to Reserved Instances (although with EC2, you don’t have to staff all the people to build / grow / manage the Infrastructure, and instead, get to focus your scarce resources on what really differentiates your business or mission).

Todays Amazon EC2 price reduction varies by instance type and by Region, with Reserved Instance prices dropping by as much as 37%, and On-Demand instance prices dropping up to 10%. In 2006, the cost of running a small website with Amazon EC2 on an m1.small instance was $876 per year. Today with a High Utilization Reserved Instance, you can run that same website for less than 1/3 of the cost at just $250 per year – an effective price of less than 3 cents per hour. As you can see below, we are lowering both On-Demand and Reserved Instances prices for our Standard, High-Memory and High-CPU instance families. The chart below highlights the price decreases for Linux instances in our US-EAST Region, but we are lowering prices in nearly every Region for both Linux and Windows instances.

For a full list of our new prices, go to the Amazon EC2 pricing page.

We have a few flavors of Reserved Instances that allow you to optimize your cost for the usage profile of your application. If you run your instances steady state, Heavy Utilization Reserved Instances are the least expensive on a per hour basis. Other variants cost a little more per hour in exchange for the flexibility of being able to turn them off and save on the usage costs when you are not using them. This can save you money if you dont need to run your instance all of the time. For more details on which type of Reserved Instances are best for you, see the EC2 Reserved Instances page.

Save Even More on EC2 as You Get Bigger
One misperception we sometimes hear is that while EC2 is a phenomenal deal for smaller businesses, the cost benefit may diminish for large customers who achieve scale.  We have lots of customers of all sizes, and those who take the time to rigorously run the numbers see significant cost advantages in using EC2 regardless of the size of their operations.

Today, were enabling customers to save even more as they scale — by introducing Reserved Instance volume tiers.  In order to determine what tier you qualify for, you add up all of the upfront Reserved Instance payments for any Reserved Instances that you own. If you own more than $250,000 of Reserved Instances, you qualify for a 10% discount on any additional Reserved Instances you buy (that discount applies to both the upfront and the usage prices).  If you own more than $2 Million of Reserved Instances, you qualify for a 20% discount on any new Reserved Instances you buy.  Once you cross $5 Million in Reserved Instance purchases, give us a call and we will see what we can do to reduce prices for you even further we look forward to speaking with you!

Price Reductions for Amazon RDS, Amazon Elastic MapReduce and Amazon ElastiCache
These price reductions dont just apply to EC2 though, as Amazon Elastic MapReduce customers will also benefit from lower prices on the EC2 instances they use.  In addition, we are also lowering prices for Amazon Relational Database Service (Amazon RDS).  Prices for new RDS Reserved Instances will decrease by up to 42%, with On-Demand Instances for RDS and ElastiCache decreasing by up to 10%.

Heres a quick example of how these price reductions will help customers save money. If you are a game developer using a Quadruple Extra Large RDS MySQL 1-year Heavy Utilization Reserved Instance to power a new game, the new pricing will save you over $550 per month (or 39%) for each new database instance you run. If you run an e-commerce application on AWS using an Extra Large multi-AZ RDS MySQL instance for your always-on database you will save more than $445 per month (or 37%) by using a 3-year Heavy Utilization Reserved Database Instance.  If you added a two node Extra Large ElastiCache cluster for better performance, you will save an additional $80 per month (or 10%).  For a full list of the new prices, go to the Amazon RDS pricing page, Amazon ElastiCache pricing page, and the Amazon EMR pricing page.

Real Customer Savings
Lets put these cost savings into context. One of our fast growing customers was primarily running Amazon EC2 On-Demand instances, running 360,000 hours last month using a mix of M1.XL, M1.large, M2.2XL and M2.4XL instances.  Without this customer changing a thing, with our new EC2 pricing, their bill will drop by over $25,000 next month, or $300,000 per year an 8.6% savings in their On-Demand spend. This customer was in the process of switching to 3-year Heavy Utilization Reserved Instances (seeing most of their instances are running steady state) for a whopping savings of 55%. Now, with the new EC2 price drop we’re announcing today, this customer will save another 37% on these Reserved Instances.  Additionally, with the introduction of our new volume tiers, this customer will add another 10% discount on top of all that. In all, this price reduction, the new volume discount tiers, and the move to Reserved Instances will save the customer over $215,000 per month, or $2.6 million per year over what they are paying today, reducing their bill by 76%!

Many of our customers were already saving significant amounts of money before this price drop, simply by running on AWS.  Samsung uses AWS to power its smart hub application which powers the apps you can use through their TVs and they recently shared with us that by using AWS they are saving $34 million in capital expenses over 2 years and reducing their operating expenses by 85%.  According to their team, with AWS, they met reliability and performance objectives at a fraction of the cost they would have otherwise incurred.

Another customer example is foursquare Labs, Inc., They use AWS to perform analytics across more than 5 million daily check-ins.  foursquare runs Amazon Elastic MapReduce clusters for their data analytics platform, using a mix of High Memory and High CPU instances.  Previously, this EMR analytics cluster was running On-Demand EC2 Instances, but just recently, foursquare decided they would buy over $1 million of 1-year Heavy Utilization Reserved Instances, reducing their costs by 35% while still using some On-Demand instances to provide them with the flexibility to scale up or shed instances as needed.  However, the new EC2 price drop lowers their costs even further.  This price reduction will help foursquare save another 22%, and their overall EC2 Reserved Instance usage for their EMR cluster qualifies them for the additional 10% volume tier discount on top of that.  This price drop combined with the move to Reserved Instances will help foursquare reduce their EC2 instance costs by over 53% from last month without sacrificing any of the scaling provided by EC2 and Elastic MapReduce.

As we continue to find ways to lower our own cost structure, we will continue to pass these savings back to our customers in the form of lower prices.  Some companies work hard to lower their costs so they can pocket more margin.  Thats a strategy that a lot of the traditional technology companies have employed for years, and its a reasonable business model.  Its just not ours.  We want customers of all sizes, from start-ups to enterprises to government agencies, to be able to use AWS to lower their technology infrastructure costs and focus their scarce engineering resources on work that actually differentiates their businesses and moves their missions forward.  We hope this is another helpful step in that direction.

You can use the AWS Simple Monthly Calculator to see the savings first-hand.

— Jeff;

Running Esri Applications in the AWS Cloud

Esri is a leading provider of Geographic Information Systems (GIS) software and geo-database management applications. Their powerful mapping solutions have been used by Governments, industry, academics, and NGOs for nearly 30 years (read their history to learn more).

Over the years, they have ported their code from mainframes to minicomputers (anyone else remember DEC and Prime?), and then to personal computers. Now they are moving their applications to the AWS cloud to provide their customers with applications that can be launched quickly and then scaled as needed, all at a cost savings when compared to traditional on-premise hosting.

Watch this video to learn more about this potent combination:

We will be participating in the Esri Federal GIS Conference in Washington, DC this month; please visit the AWS Federal page for more information about this and other events.

On that page you will also find case studies from several AWS users in the Federal Government including the Recovery Accountability and Transparency Board, the US Department of Treasury, the DOE’s National Renewable Energy Laboratory, the US Department of State, the US Department of Agriculture, the NASA Jet Propulsion Laboratory, and the European Space Agency.

On March 21, we will run a free Esri on the Cloud webinar at noon EST. Attend the webinar to learn how to use AWS to process GIS jobs faster and at a lower cost than an on-premise solution. Our special guest will be Shawn Kingsberry, CIO of the US Government’s Recovery, Accountability, and Transparency board.

— Jeff;

Pulse – Using Big Data Analytics to Drive Rich User Features

Its always exciting to find out that an app that has changed how I consume news and blog content on my mobile devices is using AWS to power some of their most engaging features. Such is the case with Pulse, a visual news reading app for iPhone, iPad and Android. Pulse uses Amazon Elastic MapReduce, our hosted Hadoop product, to analyze data from over 11 million users and to deliver the best news stories from a variety of different content publishers. Born out of a Stanford launchpad class and awarded for its elegant design by Apple at WWDC 2011, the Pulse app blends a strong high-tech backend with great visual appeal to conquer the eyes of mobile news readers everywhere.

Pulse backend team members from left to right: Simon, Lili, Greg, Leonard

The December 2011 update included a new feature called Smart Dock, which uses Hadoop and a tool called mrjob, developed by Yelp, to analyze users reading preferences and continuously recommend other articles or sources they might enjoy.

To understand the level of engineering that goes behind such rich customer features, I spoke to Greg Bayer, Backend Engineering Lead at Pulse:

How big is the big data that Pulse analyzes every day? 

Our application relies on accurately analyzing client event logs (as opposed to web logs) to extract trends and enable other rich features for our users. To give you a sense of the scale at which we run these analyses, we literally go through millions of events per hour, which translates to as many as 250+ Amazon Elastic MapReduce nodes on any given day. Since we are dealing with event logs, generated by our users from the various platforms on which they access our app (Android, iPhone, iPad, etc.), our logs grow in proportion to our user base. For example, the recent influx of new users from Kindle Fire (Android) means we now have a lot more logs coming in from those devices.  Also, since the logs are big, weve found that it is very efficient to write them to disk as fast as possible – directly from devices to Amazon EC2 (see my tandem article on the logging architecture we use and the graph below, which highlights some of our numbers).

For more Pulse numbers, checkout the full infographic.

Powering Rich Features for Our Users

Much of our backend is built on industry standard systems such as Hadoop. The innovation happens in how we leverage these systems to create value. For us, its all about how we can make the app more fun to use and provide rich features that our users will love. For techies, you can read about many of these features in the backend section of the Pulse engineering blog and learn about all the details.

The Right Choice for Big Data

I joined the team here pretty early on as the first backend engineer. I came to Pulse after working at Sandia National Labs, where I built and managed an in-house 70-node Hadoop cluster. This was an investment of over $100,000, operational support, and over 6 months time to get it fully fine-tuned. Needless to say, I was fully aware of the cost and resources needed to run something at the scale that Pulse would need to accommodate.

AWS was and still is the only feasible solution for us. I love the flexibility to quickly stand up a cluster of hundreds of nodes and the added flexibility of choosing the pricing scheme thats needed for a job. If I need a job done faster, I can always spin up a very large cluster and get results in minutes, or take advantage of smaller instances and the spot marketplace for Amazon Elastic MapReduce if Im looking to complete a job thats not time-sensitive. Since an Amazon Elastic MapReduce cluster can simply be turned off when we are done, the cost to run big queries is usually quite reasonable. Consider a cluster of 100 m1.large machines: a set of queries that takes 45 minutes to run on this cluster could cost us approximately $11 – $34 (depending on whether we bid on spot instances or use regular on-demand instances).

Lessons Learned (the bold fomatting below is our doing :) )

It is important to consider the trade-offs and choose the right tool for the job. In our experience, AWS provides an exceptional capability to build systems as close to the metal as you like, while still avoiding the burden and inelasticity of owning your own hardware. It also provides some useful abstraction layers and services above the machine level.

By allowing virtual machines (Amazon EC2 instances) to be provisioned quickly and inexpensively, a small engineering team can stay more focused on the development of key product features. Since stopping and starting these instances is painless, its easy to quickly adapt to changing engineering or needs perhaps scaling up to support 10x more users or shutting down a feature after pivoting a business model.

AWS also provides many other useful services that help save engineering time. Many standard systems, such as load balancers or Hadoop clusters, that normally require significant time and specialized knowledge to deploy, can be deployed automatically on Amazon EC2 for almost no setup or maintenance cost.

Simple, but powerful services like Amazon S3 and the newly released Amazon DynamoDB make building complex features on AWS even easier. Because bandwidth is fast and free between all AWS services, plugging together several of these services is a great way to bootstrap a scalable infrastructure.

Thanks for your time, Greg & best of luck to the Pulse team! 

-rodica

Related: Pulse Engineering – Scaling to 10M on AWS

Be Careful When Comparing AWS Costs…

Earlier today, GigaOM published a cost comparison of self-hosting vs. hosting on AWS. I wanted to bring to your attention a few quick issues that we saw with this analysis:

Lower Costs in Other AWS Regions – The comparison used the AWS costs for the US West (San Francisco) Region, ignoring the fact that EC2 pricing in the US East (Northern Virginia) and US West (Oregon) is lower ($0.76 vs. $0.68 for On-Demand Extra Large Instances).

Three Year Reserved Instances – The comparison used one year Reserved Instances, but a three year amortization schedule for the self-hosted hardware. You save over 22% by using three year Reserved Instances instead of one year Reserved Instances, and the comparison is closer to apples-to-apples.

Heavy Utilization Reserved Instances – The comparison used a combination of Medium Utilization Reserved Instances and On-Demand Instances. Given the predictable traffic pattern in the original post, a blend of Heavy and Light Utilization Reserved Instances would reduce your costs, and still give you the flexibility to easily scale up and scale down that you don’t get with traditional hosting.

Load Balancer (and other Networking) Costs – The self-hosted column does not include the cost of a redundant set of load balancers. They also need top-of-rack switches (to handle what is probably 5 racks worth of servers) and a router.

No Administrative Costs – Although the self-hosted model specifically excludes maintenance and administrative costs, it is not reasonable to assume that none of the self-hosted hardware will fail in the course of the three year period. It is also dangerous to assume that labor costs will be the same in both cases, as labor can be a significant expense when you are self-hosting.

Data Transfer Costs – The self-hosted example assumes a commit of over 4 Gbps of bandwidth capacity. If you have ever contracted for bandwidth & connectivity at this scale, you undoubtedly know that you must actually commit to a certain amount of data transfer, and that your costs will change significantly if you are over or under your commitment.

We did our own calculations taking in to account only the first four issues listed above and came up with a monthly cost for AWS of $56,043 (vs. the $70,854 quoted in the article). Obviously each workload differs based on the nature of what resources are utilized most.

These analyses are always tricky to do and you always need to make apples-to-apples cost comparisons and the benefits associated with each approach. We’re always happy to work with those wanting to get in to the details of these analyses; we continue to focus on lowering infrastructure costs and we’re far from being done.

— Jeff;

New Tagging for Auto Scaling Groups

You can now add up to 10 tags to any of your Auto Scaling Groups. You can also, if you’d like, propagate the tags to the EC2 instances launched from your groups.

Adding tags to your Auto Scaling groups will make it easier for you to identify and distinguish them.

Each tag has a name, a value, and an optional propagation flag. If the flag is set, then the corresponding tag will be applied to EC2 instances launched from the group. You can use this feature to label or distinguish instances created by distinct Auto Scaling groups. You might be using multiple groups to support multiple scalable applications, or multiple scalable tiers or components of a single application. Either, way the tags can help you to keep your instances straight.

Read more in the newest version of the Auto Scaling Developer Guide.

— Jeff;

AWS HowTo: Using Amazon Elastic MapReduce with DynamoDB (Guest Post)

Today’s guest blogger is Adam Gray. Adam is a Product Manager on the Elastic MapReduce Team.

— Jeff;


Apache Hadoop and NoSQL databases are complementary technologies that together provide a powerful toolbox for managing, analyzing, and monetizing Big Data. Thats why we were so excited to provide out-of-the-box Amazon Elastic MapReduce (Amazon EMR) integration with Amazon DynamoDB, providing customers an integrated solution that eliminates the often prohibitive costs of administration, maintenance, and upfront hardware. Customers can now move vast amounts of data into and out of DynamoDB, as well as perform sophisticated analytics on that data, using EMRs highly parallelized environment to distribute the work across the number of servers of their choice. Further, as EMR uses a SQL-based engine for Hadoop called Hive, you need only know basic SQL while we handle distributed application complexities such as estimating ideal data splits based on hash keys, pushing appropriate filters down to DynamoDB, and distributing tasks across all the instances in your EMR cluster.

In this article, Ill demonstrate how EMR can be used to efficiently export DynamoDB tables to S3, import S3 data into DynamoDB, and perform sophisticated queries across tables stored in both DynamoDB and other storage services such as S3.

We will also use sample product order data stored in S3 to demonstrate how you can keep current data in DynamoDB while storing older, less frequently accessed data, in S3. By exporting your rarely used data to Amazon S3 you can reduce your storage costs while preserving low latency access required for high velocity data. Further, exported data in S3 is still directly queryable via EMR (and you can even join your exported tables with current DynamoDB tables).

The sample order data uses the schema below. This includes Order ID as its primary key, a Customer ID field, an Order Date stored as the number of seconds since epoch, and Total representing the total amount spent by the customer on that order. The data also has folder-based partitioning by both year and month, and youll see why in a bit.

Creating a DynamoDB Table
Lets create a DynamoDB table for the month of January, 2012 named Orders-2012-01. We will specify Order ID as the Primary Key. By using a table for each month, it is much easier to export data and delete tables over time when they no longer require low latency access.

For this sample, a read capacity and a write capacity of 100 units should be more than sufficient. When setting these values you should keep in mind that the larger the EMR cluster the more capacity it will be able to take advantage of. Further, you will be sharing this capacity with any other applications utilizing your DynamoDB table.

Launching an EMR Cluster
Please follow Steps 1-3 in the EMR for DynamoDB section of the Elastic MapReduce Developer Guide to launch an interactive EMR cluster and SSH to its Master Node to begin submitting SQL-based queries. Note that we recommend you use at least three instances of m1.large size for this sample.

At the hadoop command prompt for the current master node, type hive. You should see a hive prompt: hive>

As no other applications will be using our DynamoDB table, lets tell EMR to use 100% of the available read throughput (by default it will use 50%). Note that this can adversely affect the performance of other applications simultaneously using your DynamoDB table and should be set cautiously.

SET dynamodb.throughput.read.percent=1.0;

Creating Hive Tables
Outside data sources are referenced in your Hive cluster by creating an EXTERNAL TABLE. First lets create an EXTERNAL TABLE for the exported order data in S3. Note that this simply creates a reference to the data, no data is yet moved.

CREATE EXTERNAL TABLE orders_s3_export ( order_id string, customer_id string, order_date int, total double )
PARTITIONED BY (year string, month string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LOCATION ‘s3://elastic-mapreduce/samples/ddb-orders’ ;

You can see that we specified the data location, the ordered data fields, and the folder-based partitioning scheme.

Now lets create an EXTERNAL TABLE for our DynamoDB table.

CREATE EXTERNAL TABLE orders_ddb_2012_01 ( order_id string, customer_id string, order_date bigint, total double )
STORED BY ‘org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler’ TBLPROPERTIES (
“dynamodb.table.name” = “Orders-2012-01”,
“dynamodb.column.mapping” = “order_id:Order ID,customer_id:Customer ID,order_date:Order Date,total:Total”
);

This is a bit more complex. We need to specify the DynamoDB table name, the DynamoDB storage handler, the ordered fields, and a mapping between the EXTERNAL TABLE fields (which cant include spaces) and the actual DynamoDB fields.

Now were ready to start moving some data!

Importing Data into DynamoDB
In order to access the data in our S3 EXTERNAL TABLE, we first need to specify which partitions we want in our working set via the ADD PARTITION command. Lets start with the data for January 2012.

ALTER TABLE orders_s3_export ADD PARTITION (year=’2012′, month=’01’) ;

Now if we query our S3 EXTERNAL TABLE, only this partition will be included in the results. Lets load all of the January 2012 order data into our external DynamoDB Table. Note that this may take several minutes.

INSERT OVERWRITE TABLE orders_ddb_2012_01
SELECT order_id, customer_id, order_date, total
FROM orders_s3_export ;

Looks a lot like standard SQL, doesnt it?

Querying Data in DynamoDB Using SQL
Now lets find the top 5 customers by spend over the first week of January. Note the use of unix-timestamp as order_date is stored as the number of seconds since epoch.

SELECT customer_id, sum(total) spend, count(*) order_count
FROM orders_ddb_2012_01
WHERE order_date >= unix_timestamp(‘2012-01-01’, ‘yyyy-MM-dd’)
AND order_date < unix_timestamp(‘2012-01-08’, ‘yyyy-MM-dd’)
GROUP BY customer_id
ORDER BY spend desc
LIMIT 5 ;

Querying Exported Data in S3
It looks like customer: c-2cC5fF1bB was the biggest spender for that week. Now lets query our historical data in S3 to see what that customer spent in each of the final 6 months of 2011. Though first we will have to include the additional data into our working set. The RECOVER PARTITIONS command makes it easy to

ALTER TABLE orders_s3_export RECOVER PARTITIONS;

We will now query the 2011 exported data for customer c-2cC5fF1bB from S3. Note that the partition fields, both month and year, can be used in your Hive query.

SELECT year, month, customer_id, sum(total) spend, count(*) order_count
FROM orders_s3_export
WHERE customer_id = ‘c-2cC5fF1bB’
AND month >= 6
AND year = 2011
GROUP BY customer_id, year, month
ORDER by month desc;

Exporting Data to S3
Now lets export the January 2012 DynamoDB table data to a different S3 bucket owned by you (denoted by YOUR BUCKET in the command). Well first need to create an EXTERNAL TABLE for that S3 bucket. Note that we again partition the data by year and month.

CREATE EXTERNAL TABLE orders_s3_new_export ( order_id string, customer_id string, order_date int, total double )
PARTITIONED BY (year string, month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
LOCATION ‘s3:// YOUR BUCKET‘;

Now export the data from DynamoDB to S3, specifying the appropriate partition values for that tables month and year.

INSERT OVERWRITE TABLE orders_s3_new_export
PARTITION (year=’2012′, month=’01’)
SELECT * from orders_ddb_2012_01;

Note that if this was the end of a month and you no longer needed low latency access to that tables data, you could also delete the table in DynamoDB. You may also now want to terminate your job flow from the EMR console to ensure you do not continue being charged.

Thats it for now. Please visit our documentation for more examples, including how to specify the format and compression scheme for your exported files.

— Adam Gray, Product Manager, Amazon Elastic MapReduce.

The AWS Storage Gateway – Integrate Your Existing On-Premises Applications with AWS Cloud Storage

Warning: If you don’t have a data center, or if all of your IT infrastructure is already in the cloud, you may not need to read this post! But feel free to pass it along to your friends and colleagues.

The Storage Gateway
Our new AWS Storage Gateway service connects an on-premise software appliance with cloud-based storage to integrate your existing on-premises applications with the AWS storage infrastructure in a seamless, secure, and transparent fashion. Watch this video for an introduction:

Data stored in your current data center can be backed up to Amazon S3, where it is stored as Amazon EBS snapshots. Once there, you will benefit from S3’s low cost and intrinsic redundancy. In the event you need to retrieve a backup of your data, you can easily restore these snapshots locally to your on-premises hardware. You can also access them as Amazon EBS volumes, enabling you to easily mirror data between your on-premises and Amazon EC2-based applications.

You can install the AWS Storage Gateway’s software appliance on a host machine in your data center. Here’s how all of the pieces fit together:

 

The AWS Storage Gateway allows you to create storage volumes and attach these volumes as iSCSI devices to your on-premises application servers. The volumes can be Gateway-Stored (right now) or Gateway-Cached (soon) volumes. Gateway-Stored volumes retain a complete copy of the volume on the local storage attached to the on-premises host, while uploading backup snapshots to Amazon S3. This provides low-latency access to your entire data set while providing durable off-site backups. Gateway-Cached volumes will use the local storage as a cache for frequently-accessed data; the definitive copy of the data will live in the cloud. This will allow you to offload your storage to Amazon S3 while preserving low-latency access to your active data.

Gateways can connect to AWS directly or through a local proxy. You can connect through AWS Direct Connect if you would like, and you can also control the amount of inbound and outbound bandwidth consumed by each gateway. All data is compressed prior to upload.

Each gateway can support up to 12 volumes and a total of 12 TB of storage. You can have multiple gateways per account and you can choose to store data in our US East (Northern Virginia), US West (Northern California), US West (Oregon), EU (Ireland), Asia Pacific (Singapore), or Asia Pacific (Tokyo) Regions.

The first release of the AWS Storage Gateway takes the form of a VM image for VMware ESXi 4.1 (we plan on supporting other virtual environments in the future). Adequate local disk storage, either Direct Attached or SAN (Storage Area Network), is needed for your application storage (used by your iSCSI storage volumes) and working storage (data queued up for writing to AWS). We currently support mounting of our iSCSI storage volumes using the Microsoft Windows and Red Hat iSCSI Initiators.

Up and Running
During the installation and configuration process you will be able to create up to 12 iSCSI storage volumes per gateway. Once installed, each gateway will automatically download, install, and deploy updates and patches. This activity takes place during a maintenance window that you can set on a per-gateway basis.

The AWS Management Console includes complete support for the AWS Storage Gateway. You can create volumes, create and restore snapshots, and establish a schedule for snapshots. Snapshots can be scheduled at 1, 2, 4, 8, 12, or 24 hour intervals. Each gateway reports a number of metrics to Amazon CloudWatch for monitoring.

The snapshots are stored as Amazon EBS (Elastic Block Store) snapshots. You can create an EBS volume using a snapshot of one of your local gateway volumes, or vice versa. Does this give you any interesting ideas?

The Gateway in Action
I expect the AWS Storage Gateway will be put to use in all sorts of ways. Some that come to mind are:

  • Disaster Recovery and Business Continuity – You can reduce your investment in hardware set aside for Disaster Recovery using a cloud-based approach. You can send snapshots of your precious data to the cloud on a regular and frequent basis and you can use our VM Import service to move your virtual machine images to the cloud.
  • Backup – You can back up local data to the cloud without worrying about running out of storage space. It is easy to schedule the backups, and you don’t have to arrange to ship tapes off-site or manage your own infrastructure in a second data center.
  • Data Migration – You can now move data from your data center to the cloud, and back, with ease.

Security Considerations
We believe that the AWS Storage Gateway will be at home in the enterprise, so I’ll cover the inevitable security questions up front. Here are the facts:

  • Data traveling between AWS and each gateway is protected via SSL.
  • Data at rest (stored in Amazon S3) is encrypted using AES-256.
  • The iSCSI initiator authenticates itself to the target using CHAP (Challenge-Handshake Authentication protocol).

Costs
All AWS users are eligible for a free trial of the AWS Storage Gateway. After that, there is a charge of $125 per month for each activated gateway. The usual EBS snapshot storage rates apply ($0.14 per Gigabyte-month in the US-East Region), as do the usual AWS prices for outbound data transfer (there’s no charge for inbound data transfer). More pricing information can be found on the Storage Gateway Home Page. If you are eligible for the AWS Free Usage Tier, you get up to 1 GB of free EBS snapshot storage per month as well as 15 GB of outbound data transfer.

On the Horizon
As I mentioned earlier, the first release of the AWS Storage Gateway supports Gateway-Stored volumes. We plan to add support for Gateway-Cached volumes in the coming months.

We’ll add more features to our roadmap as soon as our users (this means you) start to use the AWS Storage Gateway and send feedback our way.

Learn More
You can visit the Storage Gateway Home Page or read the Storage Gateway User Guide to learn more.

We will be hosting a Storage Gateway webinar on Thursday, February 23rd. Please attend if you would like to learn more about the Storage Gateway and how it can be used for backup, disaster recover, and data mirroring scenarios. The webinar is free and open to all, but space is limited and you need to register!

— Jeff;

Launch Relational Database Service Instances in the Virtual Private Cloud

You can now launch Amazon Relational Database Service (RDS) DB instances inside of a Virtual Private Cloud (VPC).

Some Background
The Relational Database Service takes care of all of the messiness associated with running a relational database. You don’t have to worry about finding and configuring hardware, installing an operating system or a database engine, setting up backups, arranging for fault detection and failover, or scaling compute or storage as your needs change.

The Virtual Private Cloud lets you create a private, isolated section of the AWS Cloud. You have complete control over IP address ranges, subnetting, routing tables, and network gateways to your own data center and to the Internet.

Here We Go
Before you launch an RDS DB Instance inside of a VPC, you must first create the VPC and partition its IP address range in to the desired subnets. You can do this using the VPC wizard pictured above, the VPC command line tools, or the VPC APIs.

Then you need to create a DB Subnet Group. The Subnet Group should have at least one subnet in each Availability Zone of the target Region; it identifies the subnets (and the corresponding IP address ranges) where you would like to be able to run DB Instances within the VPC. This will allow a Multi-AZ deployment of RDS to create a new standby in another Availability Zone should the need arise. You need to do this even for Single-AZ deployments, just in case you want to convert them to Multi-AZ at some point.

You can create a DB Security Group, or you can use the default. The DB Security Group gives you control over access to your DB Instances; you can allow access from EC2 instances with specific EC2 Security Group or VPC Security Groups membership, or from designated ranges of IP addresses. You can also use VPC subnets and the associated network Access Control Lists (ACLs) if you’d like. You have a lot of control and a lot of flexibility.

The next step is to launch a DB Instance within the VPC while referencing the DB Subnet Group and a DB Security Group. With this release, you are able to use the MySQL DB engine (we plan to additional options over time). The DB Instance will have an Elastic Network Interface using an IP address selected from your DB Subnet Group. You can use the IP address to reach the instance if you’d like, but we recommend that you use the instance’s DNS name instead since the IP address can change during failover of a Multi-AZ deployment.

Upgrading to VPC
If you are running an RDB DB Instance outside of a VPC, you can snapshot the DB Instance and then restore the snapshot into the DB Subnet Group of your choice. You cannot, however, access or use snapshots taken from within a VPC outside of the VPC. This is a restriction that we have put in to place for security reasons.

Use Cases and Access Options
You can put this new combination (RDS + VPC) to use in a variety of ways. Here are some suggestions:

  • Private DB Instances Within a VPC – This is the most obvious and straightforward use case, and is a perfect way to run corporate applications that are not intended to be accessed from the Internet.
  • Public facing Web Application with Private Database – Host the web site on a public-facing subnet and the DB Instances on a private subnet that has no Internet access. The application server and the RDB DB Instances will not have public IP addresses.

Your Turn
You can launch RDS instances in your VPCs today in all of the AWS Regions except AWS GovCloud (US). What are you waiting for?

— Jeff;

 

AWS Free Usage Tier now Includes Microsoft Windows on EC2

The AWS Free Usage Tier now allows you to run Microsoft Windows Server 2008 R2 on an EC2 t1.micro instance for up to 750 hours per month. This benefit is open to new AWS customers and to those who are already participating in the Free Usage Tier, and is available in all AWS Regions with the exception of GovCloud. This is an easy way for Windows users to start learning about and enjoying the benefits of cloud computing with AWS.

The micro instances provide a small amount of consistent processing power and the ability to burst to a higher level of usage from time to time. You can use this instance to learn about Amazon EC2, support a development and test environment, build an AWS application, or host a web site (or all of the above). We’ve fine-tuned the micro instances to make them even better at running Microsoft Windows Server.

You can launch your instance from the AWS Management Console:

We have lots of helpful resources to get you started:

Along with 750 instance hours of Windows Server 2008 R2 per month, the Free Usage Tier also provides another 750 instance hours to run Linux (also on a t1.micro), Elastic Load Balancer time and bandwidth, Elastic Block Storage, Amazon S3 Storage, and SimpleDB storage, a bunch of Simple Queue Service and Simple Notification Service requests, and some CloudWatch metrics and alarms (see the AWS Free Usage Tier page for details). We’ve also boosted the amount of EBS storage space offered in the Free Usage Tier to 30GB, and we’ve doubled the I/O requests in the Free Usage Tier, to 2 million.

I look forward to hearing more about your experience with this new offering. Please feel free to leave a comment!

— Jeff;

PS – If you want to learn more about what’s next in the AWS Cloud, please sign up for our live event.