Category: Amazon EC2*

New Public Data Set: YRI Trio

The YRI Trio Public Data Set provides complete genome sequence data for three Yoruba individuals from Ibadan, Nigeria, which represent the first human genomes sequenced using Illuminas next generation Sequence-by-Synthesis technology.

This data represents some of the first individual human genomes to be sequenced and peer-reviewed (the full story is here). This article contains full information about this remarkable and ground-breaking effort.

The data is described as “containing paired 35-based reads of over 30x average depth.” Basically this means that the data contains a large number of relatively short genome sequences, and that each base is present in at least 30 separate sequences. I asked my colleague Deepak Singh for a better explanation and this is what he told me:

In order to get better assembly and data accuracy you determine the order of bases n times. With older sequencing technologies you collected longer reads and coverage was typically in the n=4-6 range. The sequencing process also took a very long time (several months) to collect sufficient data. Modern, or next generation, sequencing technologies yield shorter reads but you get results much faster (days to weeks) and at much lower cost, so you can repeat the experiment many times to get better coverage. Higher coverage depth gives you the ability to detect low frequency common variations (which is how we are differentiated from one another, and can be characteristic of certain diseases) and improved genome assemblies.

Suggested uses for this data include:

  • The development of alignment algorithms.
  • The development of de novo assembly algorithms.
  • The development of algorithms that define genetic regions of interest, sequence motifs, structural variants, copy number variations, and site-specific polymorphisms.
  • To test the viability of annotation engines that start with raw sequence data.

By the way, this data set is big (700 GB), but you can create an EBS volume, attach it to an EC2 instance, and start processing it in minutes!

— Jeff;

AWS Workshops in Beijing, Bangalore and Chennai

I will be in China and India starting next week. Apart from other meetings and presentations to user group, this time, I will be taking up 3-hour workshops. These workshops are targeted at architects and technical decision makers and attendees will get a chance to play with core AWS infrastructure services.

If you are a System Integrator or Independent Software Vendor, Enterprise Architect or a entrepreneur, this will be a great opportunity to meet and learn more about AWS.

Seats are limited and prior registration is required:

AWS Workshop in Beijing
Oct 24th
(in conjunction with CSDN conference)

AWS Workshop in Bangalore
Nov 4th :
(in conjunction with Business Technology Summit)

AWS Workshop in Chennai
Nov 10th :
(in conjunction with AWS Chennai Enthusiasts group)

Details of the Workshop


Learn how to create an AWS account, understand SOAP, REST and Query APIs and learn how to use the tools like AWS management Console

Amazon EC2 – Learn how to create, bundle and launch and AMI, Setting up Amazon EBS volumes, Elastic IP
Amazon S3 buckets objects and ACLs and Amazon CloudFront distributions
Amazon SQS queues
Amazon SimpleDB Domains, Items, Attributes and Querying
Amazon Elastic MapReduce JobFlows and Map Reduce

Architect a Web application in the Cloud to be discussed in class

Learn how to build highly scalable applications in the cloud. In this session, you will learn about best practices, tip, tricks and techniques of leveraging the highly scalable infrastructure platform: AWS cloud.
Learn a step by step approach to migrate your existing applications to the Cloud environment. This blueprint will help enterprise architects in performing a cloud assessment, selecting the right candidate for the Cloud for a proof of concept project and leveraging the actual benefits of the Cloud like auto-scaling and low-cost business continuity. Jinesh will discuss migration plans and reference architectures of various examples, scenarios and use cases.

Laptop required
Knowledge of Java language preferred

See you at the workshop!

– Jinesh

SecondTeacher – Scalable Math Homework Help in the Cloud

Despite the fact that I have written over 800 posts for this blog, I never know what to expect in terms of traffic or attention from any given post. Last Friday afternoon I spent an hour or two putting together a relatively quick post to remind everyone that it is possible to use Amazon SimpleDB for free.

Driven by my initial tweet and a discussion on Hacker News , word spread quickly and traffic to the post was quite robust, especially for a Friday evening!

I’m a big believer in the ClueTrain and its perspective-changing observation that “Markets are conversations.” Blog posts are a great way to get conversations started. Several people took the time to leave comments on the post. A comment from Robert Doyle of SecondTeacher caught my eye, and I asked him if he could tell me a bit more about his use of SimpleDB. He sent me a wonderful summary of his experience to date with the site and with AWS. I’ll do my best to summarize what he told me.


Robert is a one-time OS/2 developer who now writes Windows code. When he was designing Second Teacher he decided to offload the heavy lifting to someone else so that he and his brother could focus on the site instead of on the infrastructure.

Robert and his brother started SecondTeacher a year or so ago, with the idea that they could make math easier to understand for kids in school. They wanted to make the service inexpensive (currently $25 per year) and accessible. They realized this goal by creating a 10-minute instructional video for each chapter of a number of popular textbooks. A certified teacher explains the concepts and works the problems using a whiteboard.

Traffic to the site is variable yet very predictable. Peak traffic occurs between 3 PM and 9 PM on weekdays, with very low load at other times. Robert told me that he keeps one EC2 instance running in off hours and scales to twenty during the daily peak. Over the course of a week he has an average of three instances running.

The site uses a number of AWS services! The video content is stored in Amazon S3 and distributed world-wide using Amazon CloudFront. Elastic Load Balancing is used to distribute incoming traffic across a variable number of EC2 instances.

User data, content management information, and session state are all stored in Amazon SimpleDB. The session state (a string) is retrieved on each page request and averages 200 bytes per user. Robert said that all of his usage to date has remained within the free level! He also told me:

We really have used SimpleDB as a traditional database and we have found it to be easily as reliable as roll your own databases and with little or no maintenance required. Initially we found the variable number of attributes in a domain a bit off putting but we have grown to love it and use it extensively.

You can read Robert’s recent post, Building a Scalable Website, to learn even more about the site’s architecture.

As a parent who has spent way too much time trying to understand enough “new math” in order to help my children with their homework over the years, this sounds like a really valuable tool.

— Jeff;

Lower Prices for EC2 Windows Instances using Authentication Services

We’ve removed the distinction between Amazon EC2 running Windows and Amazon EC2 running Windows with Authentication Services, allowing all of our Windows instances to make use of Authentication Services such as LDAP, RADIUS, and Kerberos. With this change, any Windows instance can host a Domain Controller or join an existing domain. File sharing services such as SMB between instances will now automatically default to SMB-over-TCP in all cases, and will also be able to negotiate more secure authentication.

Existing Windows with Authentication Services instances will now be charged the same price as Windows instances, a savings of 50% on the hourly rate. All newly launched instances will be charged the new, lower price (starting at 12.5 cents per hour for a 32-bit instance in the US). Applications requiring logins can now be run on the Amazon EC2 running Windows AMIs.

As a result of these changes, our Windows AMI lineup now looks like this:

  • US:
    • Amazon EC2 running Windows (32 bit) – English.
    • Amazon EC2 running Windows (64 bit) – English.
    • Amazon EC2 running Windows With SQL Server (64 bit) – English.
  • Europe:
    • Amazon EC2 running Windows (32 bit) – English, German, French, Spanish, Italian.
    • Amazon EC2 running Windows (64 bit) – English, German, French, Spanish, Italian.
    • Amazon EC2 running Windows With SQL Server (64 bit) – English, German, French, Spanish, Italian.

If you are using Amazon DevPay in conjunction with Amazon EC2 running Windows with Authentication Services you will need to create new AMIs and adjust your pricing plan before November 1, 2009.

We continue to strive for simplicity and cost effectiveness; this is a good example of both!

— Jeff;

PS – I know that a lot of you have been asking us to support Windows Server 2008.  I don’t have a release date for you yet, but I can assure you that we’ve prioritized the work needed to properly support it.

Bioinformatics, Genomes, EC2, and Hadoop

I think it is really interesting to see how breakthroughs and process improvements in one scientific or technical discipline can drive that discipline forward while also enabling progress in other seemingly unrelated disciplines.

The Bioinformatics field is rife with examples of this pattern. Declining hardware costs, cloud computing, the ability to do parallel processing, and algorithmic advances have driven down the cost and time of gene sequencing by multiple orders of magnitude in the space of a decade or two. Processing that was once measured by years and megabucks is now denominated by hours and dollars.

My colleague Deepak Singh pointed out a number of recent AWS-related developments in this space:

JCVI Cloud Bio-Linux

Built on top of a 64-bit Ubuntu distribution, the JCVI Cloud Bio-Linux gives scientists the ability to launch EC2 instances chock-full of the latest bioinformatics packages including BLAST (Basic Local Alignment Search Tool), glimmer (Microbial Gene-Finding System), hmmer (Biosequence Analysis Using Profile Hidden Markov Models), phylip (Phylogeny Inference Package), rasmol (Molecular Visualization) genespring (statistical analysis, data mining, and visualization tools), clustalw (general purpose multiple sequence alignment), the Celera Assembler (de novo whole-genome shotgun DNA sequence assembler), and the NIH EMBOSS utilities. The Celera Assembler can be used to assemble entire bacterial genome sequences on Amazon EC2 today!

There’s a getting-started guide for the JCVI AMI. Graphical and command- line bioinformatics tools can be launched from a shell window connected to a running instance of the AMI.


CloudBurst is described as a “new parallel read-mapping algorithm optimized for mapping next-generation sequence data to the human genome and other reference genomes, for use in a variety of biological analyses including SNP discovery, genotyping, and personal genomics.”

In laymen’s terms, CloudBurst uses Hadoop to implement a linearly scalable search tool. Once loaded with a reference genome, it maps the “short reads” (snippets of sequenced DNA approximately 30 base pairs long) to a location (or locations) on the reference genome. Think of it as a very advanced form of string matching, with support for partial matches, insertions, deletions, and subtle differences. This is a highly parallelizable operation; CloudBurst reduces operations involving millions of short reads from hours to minutes when run on a large-scale cluster of EC2 instances.

You can read more about CloudBurst in the research paper. This paper includes benchmarks of CloudBurst on EC2 along with performance and scaling information.



Crossbow was built to do “Whole Genome Resequencing in the Clouds.” It combines Bowtie for ultra-fast short read alignment and SOAPsnp for sequence assembly and high quality SNP calling. The Crossbow home page claims that it can sequence an entire genome in an afternoon on EC2, for less than $250. Crossbow is so new that the papers and the code distribution are still a little ways off. There’s a lot of good information in this poster:

Michael Schatz (the principal author of CloudBurst and Bowtie) wrote a really interesting note on Hadoop for Computational Biology. He states that “CloudBurst is just the beginning of the story, not the end.” and endorses the Map/Reduce model for processing 100+GB datasets. I will echo Mike’s conclusion to wrap up this somewhat long post:

In short, there is no shortage of opportunities for utilizing MapReduce/Hadoop for computational biology, so if your users are skeptical now, I just ask that they are patient for a little bit longer and reserve judgment on MapReduce/Hadoop until we can publish a few more results.

I really learned a lot while putting this post together and I hope that you will learn something by reading it. If you are using EC2 in a bioinformatics context, I’d love to hear from you. Leave a comment or send me some mail.

— Jeff;

New Public Data Set: Wikipedia XML Data

Weighing in at a whopping 500 GB (388 GB of data and 112 GB of free space to allow for some in-place decompression), the Wikipedia XML data is our newest Public Data Set.

This data set contains all of the Wikimedia wikis in the form of wikitext source and metadata embedded in XML. We’ll be updating this data set every month and we’ll keep the sets for the previous three months around.

As you can see from this screen shot of my PuTTY window, there are some pretty beefy files in this data set:

As an example of what can be done with this data, take a look at Cloudera’s blog post on Grouping Related Trends with Hadoop and Hive. This article shows how to create a trend tracking site using a Cloudera Hadoop cluster running on EC2, using Apache Hive queries to process the data.

— Jeff;

New Public Data Set: Daily Global Weather

The folks at Infochimps have just released the Daily Global Weather Public Data Set.

This 20 GB data set incorporates daily weather measurements (temperature, dew point, wind speed, humidity, barometric pressure, and so forth) from over 9000 weather stations around the world. The data was originally collected as part of the Global Surface Summary of the Day (GSOD) by the National Climactic Data Center and is available from 1929 to the present, with the data from 1973 to the present being the most complete.

The map at right contains one yellow dot for each data collection station.

— Jeff;

New Public Data Set: Sloan Digital Sky Survey DR6 Subset

The Sloan Digital Sky Survey, or SDSS, is now available as a Public Data Set.

Weighing in at 180 GB, the SDSS is the most ambitious astronomical survey ever undertaken. The researchers have used a 2.5 meter, 120 megapixel telescope located in Apache Point, New Mexico to capture images of over one quarter of the sky, or about 230 million celestial objects. They have also created 3-dimensional maps containing more than 930,000 galaxies and 120,000 quasars.

This new public data set (which is a subset of the entire SDSS) will be of interest to students, educators, hobby astronomers, and researchers. From a standing start, it is possible to launch an EC2 instance, create an Elastic Block Store volume with this data, attach the volume to the instance and start examining and processing the data in less than ten minutes.

The data set takes the form of a Microsoft SQL Server MDF file. Once you have created your EBS volume and attached it to your Windows EC2 instance, you can access the data using SQL Server Enterprise Manager or SQL Server Management Studio. The SDSS makes use of stored procedures, user defined functions, and a spatial indexing library, so porting it to another database would be a fairly complex undertaking.

I know from experience (my son Andy is studying Astronomy at the University of Washington and is always showing me the “please delete your unnecessary files” emails from the department’s administrator) that storage space is always at a premium in academic settings, due in part to the existence of large scale data sets like this. The combination of EC2, EBS, this public data set, and our AWS in Education program should enable students and educators to analyze, process, display, and study the universe in revolutionary ways.

— Jeff;

Shared Snapshots for EC2’s Elastic Block Store Volumes

Today we are adding a new feature which significantly improves the flexibility of EC2’s Elastic Block Store  (EBS) snapshot facility. You now have the ability to share your snapshots with other EC2 customers using a new set of fine-grained access controls. You can keep the snapshot to yourself (the default), share it with a list of EC2 customers, or share it publicly. Here’s a visual overview of the data flow (in this diagram, the word Partner refers to anyone that you choose to share your data with):

The Amazon Elastic Block Store lets you create block storage volumes in sizes ranging from 1 GB to 1 TB. You can create empty volumes or you can pre-populate them using one of our Public Data Sets. Once created, you attach each volume to an EC2 instance and then reference it like any other file system. The new volumes are ready in seconds. Last week I created a 180 GB volume from a Public Data Set, attached  it to my instance, and started examining it, all in about 15 seconds.

You can use the AWS Management Console, the command line tools, or the EC2 API to create a snapshot backup of an EBS volume at any time. The snapshots are stored in Amazon S3. Once created, a snapshot can be used to create a new EBS volume in the same AWS region. Sharing these snapshots, as we are now letting you do, makes it possible for other users to create an identical copy of the volume.

The new ModifySnapshotAttribute function gives you the ability to set and change the createPermission attribute on any of your snapshots. We’ve also added the ResetSnapshotAttribute function to clear snapshot attributes and the DescribeSnapshotAttribute function to get the value of a particular attribute.

The DescribeSnapshots function now lists all of the snapshots that have been shared with you. You can also use this function to retrieve a list of all of our Public Data Sets.

You can also modify snapshot permissions using the AWS Management Console:

How can you use this? Off the top of my head, here are a number of ideas:

  1. If you are a teacher or professor, create and share a volume of reference data for use in a classroom setting (and take a look at the AWS in Education program too).
  2. If you are a researcher, share your data and your results with your colleagues, both within your own organization and at other organizations.
  3. If you are a developer, share your development and test environments with your teammates. Snapshot the environments before each release to make it easy to regenerate the environment later for regression tests.
  4. If you are a business, you can use snapshots to store data internally, with external clients, or with partners. This could be reference data, results of a lengthy and expensive computation, a set of test cases (and expected results) or even a set of pre-populated database tables.

I’m sure you have some ideas of your own; please feel free to share them in a comment!

Update:Shlomo Swidler posted some really good ideas in his Cloud Developer Tips blog.

As is often the case with AWS, we’ll use this new feature as the basis for even more functionality later.

— Jeff;

Now In Europe: Amazon SimpleDB, CloudWatch, Auto Scaling, and Elastic Load Balancing

I’m happy to announce that the following AWS services are now available in Europe:

  • Amazon SimpleDB – Highly available and scalable, low/no administration structured data storage.
  • Amazon CloudWatch – Monitoring for the AWS cloud, starting with providing resource consumption (CPU utilization, network traffic, and disk I/O) for EC2 instances.
  • Elastic Load Balancing – Traffic distribution across multiple EC2 instances.
  • Auto Scaling – Automated scaling of EC2 instances based on rules that you define.

All of the services work just the same way in Europe as they do in the US. Existing applications and management tools should be able to access the services in this region after a simple change of the service endpoint. As is the case with S3 and EC2, these services are independent of their US counterparts.

Our full slate of infrastructure services is now available in Europe. With the European debut of these services, developers can now built reliable and scalable applications in both of the AWS regions (US and Europe).

— Jeff;