Category: Amazon S3


Amazon Redshift Spectrum – Exabyte-Scale In-Place Queries of S3 Data

by Jeff Barr | on | in Amazon Redshift, Amazon S3 | | Comments

Now that we can launch cloud-based compute and storage resources with a couple of clicks, the challenge is to use these resources to go from raw data to actionable results as quickly and efficiently as possible.

Amazon Redshift allows AWS customers to build petabyte-scale data warehouses that unify data from a variety of internal and external sources. Because Redshift is optimized for complex queries (often involving multiple joins) across large tables, it can handle large volumes of retail, inventory, and financial data without breaking a sweat. Once the data is loaded, our customers can make use of a plethora of enterprise reporting and business intelligence tools provided by the Redshift Partners.

One of the most challenging aspects of running a data warehouse involves loading data that is continuously changing and/or arriving at a rapid pace. In order to provide great query performance, loading data into a data warehouse includes compression, normalization, and optimization steps. While these steps can be automated and scaled, the loading process introduces overhead and complexity, and also gets in the way of those all-important actionable results.

Data formats present another interesting challenge. Some applications will process the data in its original form, outside of the data warehouse. Others will issue queries to the data warehouse. This model leads to storage inefficiencies because the data must be stored twice, and can also mean that results from one form of processing may not align with those from another due to delays introduced by the loading process.

Amazon Redshift Spectrum
In order to allow you to process your data as-is, where-is, while taking advantage of the power and flexibility of Amazon Redshift, we are launching Amazon Redshift Spectrum. You can use Spectrum to run complex queries on data stored in Amazon Simple Storage Service (S3), with no need for loading or other data prep.

You simply create a data source and issue your queries to your Redshift cluster as usual. Behind the scenes, Spectrum scales to thousands of instances on a per-query basis, ensuring that you get fast, consistent performance even as your data set grows up to an beyond an exabyte! Being able to query data stored in S3 means that you can scale your compute and your storage independently, with the full power of the Redshift query model and all of the reporting and business intelligence tools at your disposal. Your queries can reference any combination of data stored in Redshift tables and in S3.

When you issue a query, Redshift rips it apart and generates a query plan that minimizes the amount of S3 data that will be read, taking advantage of both column-oriented formats and data that is partitioned by date or another key. Then Redshift requests Spectrum workers from a large, shared pool and directs them to project, filter, and aggregate the S3 data. The final processing is performed within the Redshift cluster and the results are returned to you.

Because Spectrum operates on data that is stored in S3, you can process the data using other AWS services such as Amazon EMR and Amazon Athena. You can also implement hybrid models where frequently queried data is kept in Redshift local storage and the rest is S3, or where dimension tables are in Redshift along with the recent portions of the fact tables, with older data in S3. In order to drive even higher levels of concurrency, you can point multiple Redshift clusters at the same stored data.

Spectrum supports open, common data types including CSV/TSV, Parquet, SequenceFile, and RCFile. Files can be compressed using GZip or Snappy, with other data types and compression methods in the works.

Spectrum in Action
In order to get some first-hand experience with Spectrum I loaded up a sample data set and ran some queries!

I started by creating an external schema and database:

Then I created an external table within the database:

I ran a simple query to get a feel for the size of the data set (6.1 billion rows):

And then I ran a query that processed all of the rows:

As you can see, Spectrum was able to churn through all 6 billion rows in about 15 seconds. I checked my cluster’s performance metrics and it looked like I had enough CPU power to run many such queries simultaneously:

Available Now
Amazon Redshift Spectrum is available now and you can start using it today!

Spectrum pricing is based on the amount of data pulled from S3 during query processing and is charged at the rate of $5 per terabyte (you can save money by compressing your data and/or storing it in column-oriented form). You pay the usual charges to run your Redshift cluster and to store your data in S3, but there are no Spectrum charges when you are not running queries.

Jeff;

PS – Several people have asked about the relationship between Spectrum and Athena, and the applicability of both tools to different workloads. Fortunately, the newly updated Redshift FAQ addresses this question; see When should I use Amazon Athena vs. Redshift Spectrum? for more info.

 

S3 Storage Management Update – Analytics, Object Tagging, Inventory, and Metrics

by Jeff Barr | on | in Amazon S3 | | Comments

Today I would like to tell you about four S3 features that will give you detailed insights into your storage and your access patterns. You can see what and how much you are storing, how it is being used, and you can make informed decisions about the use of S3 storage classes as a result. These features will be of value to everyone who uses S3, whether they have tens, thousands, millions, or billions of objects in their buckets. Here’s an overview:

S3 Analytics – You can analyze the storage and retrieval patterns for your objects and use the results to choose the most appropriate storage class. You can inspect the results of the analysis from within the S3 Console, or you can load them into your favorite BI tool and dive deep. Either way, you now have the means to gain a deep understanding of your storage patterns and to see how they relate to usage and growth.

S3 Object Tagging – You can associate multiple key-value pairs (tags) with each of your S3 objects, with ability to change them at any time. The tags can be used to manage and control access, set up S3 Lifecycle policies, customize the S3 Analytics, and filter the CloudWatch metrics. You can think of the bucket as a data lake, and use tags to create a taxonomy of the objects within the lake. This is more flexible than using the bucket and a prefix, and allows you to make semantic-style changes without renaming, moving, or copying objects.

S3 Inventory – You can speed up your business workflows and your big data jobs using S3 Inventory. This feature provides you with a CSV-formatted flat-file representation of the contents of all or part (as identified by a prefix) of a bucket, on a daily or weekly basis.

S3 CloudWatch Metrics – You can improve the performance of your S3-powered applications by monitoring and alarming on 13 new Amazon CloudWatch metrics.

Let’s dive in…

S3 Analytics
As an S3 user, you have your choice of three storage classes (Standard, Standard – Infrequent Access, and Glacier) and the ability to use S3’s Object Lifecycle Management feature to indicate when objects should be either expired or transitioned to Standard – Infrequent Access or Glacier.

The S3 Analytics feature gives you the information that you need to have in order to choose between Standard and Standard – Infrequent Access for your objects. Because many customers store several different types or categories of information in a single bucket for ease of processing, you have the ability to run the analytics on subsets (defined by a common prefix or tag value) of the objects in a bucket. Each subset is defined by a filter; each bucket can have up to 1000 filters. Here are the filters on my jbarr-public bucket:

Here’s how I define a simple filter that analyzes objects that are prefixed with www:

I could choose to filter by tag name and value instead. Here’s how I would analyze objects that have a tag named type with the value page:

Each filter can have at most one prefix, along with any desired number of tags. I can also choose to have the analytics exported to a file each day:

Once one or more filters are in place, the analytics are run on a daily basis and I can view them by simply clicking on the filter. For example, here’s what I see when I click on my Archival filter:

There’s a lot of good information in this analysis! I can see that:

  • Looking back 127 days, most of my objects that are older than 30 days are accessed infrequently. I can save money by creating a Lifecycle rule that will transition these objects to Standard – Infrequent Access storage.
  • At this point, the majority (6.39 PB) of my storage is in Standard storage; just a tiny fraction (3.24 GB) is in Standard – Infrequent Access storage (I don’t actually have that much data in my bucket! The S3 team was kind enough to load up my account with some test metrics that were, shall we say, very generous).
  • Over the 127 day observation period, between 16% and 30% of the Standard storage was retrieved on a per-day basis.
  • Objects between 30-45 days old, 45-60 days old, 60-90 days old, 90-180 days old, and over 180 days old are all infrequently accessed and good candidates for Standard – Infrequent Access storage.

You can set up storage class analysis from the Console, the CLI, or through the S3 APIs.

To learn more, read S3 Storage Class Analysis.

S3 Object Tagging
Tags supplement the existing location-based S3 object management model (buckets and prefixes) with the ability to manage storage based on what the object represents. For example, you can tag objects with the name of a department and then construct IAM policies that grant access based on the tag. This gives you a lot of flexibility , including the ability to effect changes by simply altering tags.

You can create, update, or delete them during any part of the object’s life cycle. Tags can be referenced in IAM policies, S3 Lifecycle policies, and in the Storage Analytics filters that I just showed you. Each object can have up to ten tags and each version of an object has a distinct set of tags. You can use tags to manage the expiration of objects via a lifecycle policy and you can set up CloudWatch metrics that reflect the activity around a particular tag.

The Properties page for each object displays the current set of tags and allows me to edit them:

You can also set up and access tags from the CLI or through the S3 APIs. When you use either of these two methods, you must always work in terms of the full tag set. For example, if an object has four tags and you want to add a fifth one, you must read the current set, add the new one, and then update the entire set.

Tags can be replicated across regions via Cross-Region Replication, but your IAM policy on the source side must enable the s3:GetObjectVersionTagging and s3:ReplicateTags actions.

Tags cost $0.01 per 10,000 tags per month. Requests that add or update tags (PUT and GET, respectively) are charged at the usual rates.

To learn more, read about S3 Object Tagging.

S3 Inventory
S3’s LIST operation is synchronous, and returns up to 1000 objects at a time, along with a key that can be used to initiate a second LIST that picks up where the first one leaves off. While this works well for small to medium-sized buckets and single-threaded programs, it can be challenging to use in conjunction with huge buckets or multiple threads.

The S3 team told me that many of our customers store tens or hundreds of millions of objects in a single bucket, and that buckets with a billion or more objects are not unusual. These buckets are often processed daily or weekly as part of a larger workflow and can benefit from a more direct way to gain access to the list of objects in a bucket.

With S3 Inventory, you can now arrange to receive daily or weekly inventory reports for any of your buckets. You can use a prefix to filter the report and you can choose to include optional fields such as size, storage class, and replication status. Reports can be sent to an S3 bucket in your account or (with proper permission settings) in another account.

Here’s how I request a daily inventory report called WebStuff, for all objects that start with www:

I also need to give S3 permission to write to the destination bucket (jbarr-s3-inventory). Here’s the policy that I used (see Granting Permission for Amazon S3 Inventory and Amazon S3 Analytics to learn more):

The inventory mechanism creates three files: a manifest, a checksum for the manifest, and a data file. The manifest contains the location of the data file along with a checksum for it:

{
   "sourceBucket":"jbarr-public",
   "destinationBucket":"arn:aws:s3:::jbarr-s3-inventory",
   "version":"2016-11-30",
   "fileFormat":"CSV",
   "fileSchema":"Bucket, Key, Size, LastModifiedDate, ETag, StorageClass",
   "files":[
      {
         "key":"jbarr-public/DailyInventory/data/cf1da322-f52b-4c61-802a-b5c14f4f32e2.csv.gz",
         "size":2270,
         "MD5checksum":"be6d0eb3f9c4f497fe40658baa5a2e7c"
      }
   ]
}

Here’s what the data file looks like after it has been unzipped:

If you are using the inventory reports to power your daily or weekly processing workflow, don’t forget about S3 Notifications! The checksum file is written after the other two, so you can safely use it to get things moving. Also, don’t forget to use a lifecycle policy to manage your collection of inventory files since they will accumulate over time.

As an added bonus, using daily or weekly reports can save you up to 50%, when compared to multiple LIST operations. To learn more about this feature, read about S3 Storage Inventory.

S3 CloudWatch Metrics
S3 can now publish storage, request, and data transfer metrics to CloudWatch. The storage metrics are reported daily and are available at no extra cost. The request and data transfer metrics are available at one minute intervals (after some processing latency) and are billed at the standard CloudWatch rate. As is the case for the S3 Analytics, the CloudWatch metrics can be collected and reported for an entire bucket or for a subset selected via prefix or tags.

Here is the full set of metrics:

Storage Requests Data Transfer
  • Bucket Size Bytes
  • Number Of Objects
  • GET
  • LIST
  • PUT
  • POST
  • DELETE
  • ALL
  • HEAD
  • 4xx Errors
  • 5xx Errors
  • Total Request Latency
  • First Byte Latency
  • Bytes Uploaded
  • Bytes Downloaded

The metrics are available within the S3 and CloudWatch consoles. Here’s what the Total Request Latency looks like in the S3 Console for my bucket:

I can also click on View in CloudWatch and set CloudWatch alarms for any desired metric. Perhaps I want to be notified if my bucket grows beyond 100 MB:

To learn more, read How Do I Configure Request Metrics for an S3 Bucket?

Available Now
You have had access to these features since late last year!. They are accessible through the updated S3 Console, which also includes many other new features (watch Introduction to the New Amazon S3 Console to learn more).

Jeff;

 

AWS Storage Update – S3 & Glacier Price Reductions + Additional Retrieval Options for Glacier

by Jeff Barr | on | in Amazon Glacier, Amazon S3, Launch, Price Reduction | | Comments

Back in 2006, we launched S3 with a revolutionary pay-as-you-go pricing model, with an initial price of 15 cents per GB per month. Over the intervening decade, we reduced the price per GB by 80%, launched S3 in every AWS Region, and enhanced the original one-size-fits-all model with user-driven features such as web site hosting, VPC integration, and IPv6 support, while adding new storage options including S3 Infrequent Access.

Because many AWS customers archive important data for legal, compliance, or other reasons and reference it only infrequently, we launched Glacier in 2012, and then gave you the ability to transition data between S3, S3 Infrequent Access, and Glacier by using lifecycle rules.

Today I have two big pieces of news for you: we are reducing the prices for S3 Standard Storage and for Glacier storage. We are also introducing additional retrieval options for Glacier.

S3 & Glacier Price Reduction
As long-time AWS customers already know, we work relentlessly to reduce our own costs, and to pass the resulting savings along in the form of a steady stream of AWS Price Reductions.

We are reducing the per-GB price for S3 Standard Storage in most AWS regions, effective December 1, 2016. The bill for your December usage will automatically reflect the new, lower prices. Here are the new prices for Standard Storage:

Regions 0-50 TB
($ / GB / Month)
51 – 500 TB
($ / GB / Month)
500+ TB
($ / GB / Month)
  • US East (Northern Virginia)
  • US East (Ohio)
  • US West (Oregon)
  • EU (Ireland)

(Reductions range from 23.33% to 23.64%)

 $0.0230 $0.0220 $0.0210
  • US West (Northern California)

(Reductions range from 20.53% to 21.21%)

 $0.0260 $0.0250 $0.0240
  • EU (Frankfurt)

(Reductions range from 24.24% to 24.38%)

 $0.0245 $0.0235 $0.0225
  • Asia Pacific (Singapore)
  • Asia Pacific (Tokyo)
  • Asia Pacific (Sydney)
  • Asia Pacific (Seoul)
  • Asia Pacific (Mumbai)

(Reductions range from 16.36% to 28.13%)

 $0.0250 $0.0240 $0.0230

As you can see from the table above, we are also simplifying the pricing model by consolidating six pricing tiers into three new tiers.

We are also reducing the price of Glacier storage in most AWS Regions. For example, you can now store 1 GB for 1 month in the US East (Northern Virginia), US West (Oregon), or EU (Ireland) Regions for just $0.004 (less than half a cent) per month, a 43% decrease. For reference purposes, this amount of storage cost $0.010 when we launched Glacier in 2012, and $0.007 after our last Glacier price reduction (a 30% decrease).

The lower pricing is a direct result of the scale that comes about when our customers trust us with trillions of objects, but it is just one of the benefits. Based on the feedback that I get when we add new features, the real value of a cloud storage platform is the rapid, steady evolution. Our customers often tell me that they love the fact that we anticipate their needs and respond with new features accordingly.

New Glacier Retrieval Options
Many AWS customers use Amazon Glacier as the archival component of their tiered storage architecture. Glacier allows them to meet compliance requirements (either organizational or regulatory) while allowing them to use any desired amount of cloud-based compute power to process and extract value from the data.

Today we are enhancing Glacier with two new retrieval options for your Glacier data. You can now pay a little bit more to expedite your data retrieval. Alternatively, you can indicate that speed is not of the essence and pay a lower price for retrieval.

We launched Glacier with a pricing model for data retrieval that was based on the amount of data that you had stored in Glacier and the rate at which you retrieved it. While this was an accurate reflection of our own costs to provide the service, it was somewhat difficult to explain. Today we are replacing the rate-based retrieval fees with simpler per-GB pricing.

Our customers in the Media and Entertainment industry archive their TV footage to Glacier. When an emergent situation calls for them to retrieve a specific piece of footage, minutes count and they want fast, cost-effective access to the footage. Healthcare customers are looking for rapid, “while you wait” access to archived medical imagery and genome data; photo archives and companies selling satellite data turn out to have similar requirements. On the other hand, some customers have the ability to plan their retrievals ahead of time, and are perfectly happy to get their data in 5 to 12 hours.

Taking all of this in to account, you can now select one of the following options for retrieving your data from Glacier (The original rate-based retrieval model is no longer applicable):

Standard retrieval is the new name for what Glacier already provides, and is the default for all API-driven retrieval requests. You get your data back in a matter of hours (typically 3 to 5), and pay $0.01 per GB along with $0.05 for every 1,000 requests.

Expedited retrieval addresses the need for “while you wait access.” You can get your data back quickly, with retrieval typically taking 1 to 5 minutes.  If you store (or plan to store) more than 100 TB of data in Glacier and need to make infrequent, yet urgent requests for subsets of your data, this is a great model for you (if you have less data, S3’s Infrequent Access storage class can be a better value). Retrievals cost $0.03 per GB and $0.01 per request.

Retrieval generally takes between 1 and 5 minutes, depending on overall demand. If you need to get your data back in this time frame even in rare situations where demand is exceptionally high, you can provision retrieval capacity. Once you have done this, all Expedited retrievals will automatically be served via your Provisioned capacity. Each unit of Provisioned capacity costs $100 per month and ensures that you can perform at least 3 Expedited Retrievals every 5 minutes, with up to 150 MB/second of retrieval throughput.

Bulk retrieval is a great fit for planned or non-urgent use cases, with retrieval typically taking 5 to 12 hours at a cost of $0.0025 per GB (75% less than for Standard Retrieval) along with $0.025 for every 1,000 requests. Bulk retrievals are perfect when you need to retrieve large amounts of data within a day, and are willing to wait a few extra hours in exchange for a very significant discount.

If you do not specify a retrieval option when you call InitiateJob to retrieve an archive, a Standard Retrieval will be initiated. Your existing jobs will continue to work as expected, and will be charged at the new rate.

To learn more, read about Data Retrieval in the Glacier FAQ.

As always, I am thrilled to be able to share this news with you, and I hope that you are equally excited!

If you want to learn more- we have a webinar coming up December 12th. Register here.

Jeff;

 

CloudTrail Update – Capture and Process Amazon S3 Object-Level API Activity

by Jeff Barr | on | in Amazon CloudTrail, Amazon CloudWatch, Amazon S3, AWS Lambda | | Comments

I would like to show you how several different AWS services can be used together to address a challenge faced by many of our customers. Along the way I will introduce you to a new AWS CloudTrail feature that launches today and show you how you can use it in conjunction with CloudWatch Events.

The Challenge
Our customers store many different types of mission-critical data in Amazon Simple Storage Service (S3) and want to be able to track object-level activity on their data. While some of this activity is captured and stored in the S3 access logs, the level of detail is limited and log delivery can take several hours. Customers, particularly in financial services and other regulated industries, are asking for additional detail, delivered on a more timely basis. For example, they would like to be able to know when a particular IAM user accesses sensitive information stored in a specific part of an S3 bucket.

In order to meet the needs of these customers, we are now giving CloudTrail the power to capture object-level API activity on S3 objects, which we call Data events (the original CloudTrail events are now called Management events). Data events include “read” operations such as GET, HEAD, and Get Object ACL as well as “write” operations such as PUT and POST. The level of detail captured for these operations is intended to provide support for many types of security, auditing, governance, and compliance use cases. For example, it can be used to scan newly uploaded data for Personally Identifiable Information (PII), audit attempts to access data in a protected bucket, or to verify that the desired access policies are in effect.

Processing Object-Level API Activity
Putting this all together, we can easily set up a Lambda function that will take a custom action whenever an S3 operation takes place on any object within a selected bucket or a selected folder within a bucket.

Before starting on this post, I created a new CloudTrail trail called jbarr-s3-trail:

I want to use this trail to log object-level activity on one of my S3 buckets (jbarr-s3-trail-demo). In order to do this I need to add an event selector to the trail. The selector is specific to S3, and allows me to focus on logging the events that are of interest to me. Event selectors are a new CloudTrail feature and are being introduced as part of today’s launch, in case you were wondering.

I indicate that I want to log both read and write events, and specify the bucket of interest. I can limit the events to part of the bucket by specifying a prefix, and I can also specify multiple buckets. I can also control the logging of Management events:

CloudTrail supports up to 5 event selectors per trail. Each event selector can specify up to 50 S3 buckets and optional bucket prefixes.

I set this up, opened my bucket in the S3 Console, uploaded a file, and took a look at one of the entries in the trail. Here’s what it looked like:

{
  "eventVersion": "1.05",
  "userIdentity": {
    "type": "Root",
    "principalId": "99999999999",
    "arn": "arn:aws:iam::99999999999:root",
    "accountId": "99999999999",
    "username": "jbarr",
    "sessionContext": {
      "attributes": {
        "creationDate": "2016-11-15T17:55:17Z",
        "mfaAuthenticated": "false"
      }
    }
  },
  "eventTime": "2016-11-15T23:02:12Z",
  "eventSource": "s3.amazonaws.com",
  "eventName": "PutObject",
  "awsRegion": "us-east-1",
  "sourceIPAddress": "72.21.196.67",
  "userAgent": "[S3Console/0.4]",
  "requestParameters": {
    "X-Amz-Date": "20161115T230211Z",
    "bucketName": "jbarr-s3-trail-demo",
    "X-Amz-Algorithm": "AWS4-HMAC-SHA256",
    "storageClass": "STANDARD",
    "cannedAcl": "private",
    "X-Amz-SignedHeaders": "Content-Type;Host;x-amz-acl;x-amz-storage-class",
    "X-Amz-Expires": "300",
    "key": "ie_sb_device_4.png"
  }

Then I create a simple Lambda function:

Next, I create a CloudWatch Events rule that matches the function name of interest (PutObject) and invokes my Lambda function (S3Watcher):

Now I upload some files to my bucket and check to see that my Lambda function has been invoked as expected:

I can also find the CloudWatch entry that contains the output from my Lambda function:

Pricing and Availability
Data events are recorded only for the S3 buckets that you specify, and are charged at the rate of $0.10 per 100,000 events. This feature is available in all commercial AWS Regions.

Jeff;

Human Longevity, Inc. – Changing Medicine Through Genomics Research

by Jeff Barr | on | in Amazon EC2, Amazon EMR, Amazon S3, AWS OpsWorks, Customer Success | | Comments

Human Longevity, Inc. (HLI) is at the forefront of genomics research and wants to build the world’s largest database of human genomes along with related phenotype and clinical data, all in support of preventive healthcare. In today’s guest post, Yaron Turpaz,  Bryan Coon, and Ashley Van Zeeland talk about how they are using AWS to store the massive amount of data that is being generated as part of this effort to revolutionize medicine.

Jeff;


When Human Longevity, Inc. launched in 2013, our founders recognized the challenges ahead. A genome contains all the information needed to build and maintain an organism; in humans, a copy of the entire genome, which contains more than three billion DNA base pairs, is contained in all cells that have a nucleus. Our goal is to sequence one million genomes and deliver that information—along with integrated health records and disease-risk models—to researchers and physicians. They, in turn, can interpret the data to provide targeted, personalized health plans and identify the optimal treatment for cancer and other serious health risks far earlier than has been possible in the past. The intent is to transform medicine by fostering preventive healthcare and risk prevention in place of the traditional “sick care” model, when people wind up seeing their doctors only after symptoms manifest.

Our work in developing and applying large-scale computing and machine learning to genomics research entails the collection, analysis, and storage of immense amounts of data from DNA-sequencing technology provided by companies like Illumina. Raw data from a single genome consumes about 100 gigabytes; that number increases as we align the genomic information with annotation and phenotype sources and analyze it for health insights.

From the beginning, we knew our choice of compute and storage technology would have a direct impact on the success of the company. Using the cloud was clearly the best option. We’re experts in genomics, and don’t want to spend resources building and maintaining an IT infrastructure. We chose to go all in on AWS for the breadth of the platform, the critical scalability we need, and the expertise AWS has developed in big data. We also saw that the pace of innovation at AWS—and its deliberate strategy of keeping costs as low as possible for customers—would be critical in enabling our vision.

Leveraging the Range of AWS Services

 Spectral karyotype analysis / Image courtesy of Human Longevity, Inc.

Spectral karyotype analysis / Image courtesy of Human Longevity, Inc.

Today, we’re using a broad range of AWS services for all kinds of compute and storage tasks. For example, the HLI Knowledgebase leverages a distributed system infrastructure comprised of Amazon S3 storage and a large number of Amazon EC2 nodes. This helps us achieve resource isolation, scalability, speed of provisioning, and near real-time response time for our petabyte-scale database queries and dynamic cohort builder. The flexibility of AWS services makes it possible for our customized Amazon Machine Images and pre-built, BTRFS-partitioned Amazon EBS volumes to achieve turn-up time in seconds instead of minutes. We use Amazon EMR for executing Spark queries against our data lake at the scale we need. AWS Lambda is a fantastic tool for hooking into Amazon S3 events and communicating with apps, allowing us to simply drop in code with the business logic already taken care of. We use Auto Scaling based on demand, and AWS OpsWorks for managing a Docker pipeline.

We also leverage the cost controls provided by Amazon EC2 Spot and Reserved Instance types. When we first started, we used on-demand instances, but the costs started to grow significantly. With Spot and Reserved Instances, we can allocate compute resources based on specific needs and workflows. The flexibility of AWS services enables us to make extensive use of dockerized containers through the resource-management services provided by Apache Mesos. Hundreds of dynamic Amazon EC2 nodes in both our persistent and spot abstraction layers are dynamically adjusted to scale up or down based on usage demand and the latest AWS pricing information. We achieve substantial savings by sharing this dynamically scaled compute cluster with our Knowledgebase service and the internal genomic and oncology computation pipelines. This flexibility gives us the compute power we need while keeping costs down. We estimate these choices have helped us reduce our compute costs by up to 50 percent from the on-demand model.

We’ve also worked with AWS Professional Services to address a particularly hard data-storage challenge. We have genomics data in hundreds of Amazon S3 buckets, many of them in the petabyte range and containing billions of objects. Within these collections are millions of objects that are unused, or used once or twice and never to be used again. It can be overwhelming to sift through these billions of objects in search of one in particular. It presents an additional challenge when trying to identify what files or file types are candidates for the Amazon S3-Infrequent Access storage class. Professional Services helped us with a solution for indexing Amazon S3 objects that saves us time and money.

Moving Faster at Lower Cost
Our decision to use AWS came at the right time, occurring at the inflection point of two significant technologies: gene sequencing and cloud computing. Not long ago, it took a full year and cost about $100 million to sequence a single genome. Today we can sequence a genome in about three days for a few thousand dollars. This dramatic improvement in speed and lower cost, along with rapidly advancing visualization and analytics tools, allows us to collect and analyze vast amounts of data in close to real time. Users can take that data and test a hypothesis on a disease in a matter of days or hours, compared to months or years. That ultimately benefits patients.

Our business includes HLI Health Nucleus, a genomics-powered clinical research program that uses whole-genome sequence analysis, advanced clinical imaging, machine learning, and curated personal health information to deliver the most complete picture of individual health. We believe this will dramatically enhance the practice of medicine as physicians identify, treat, and prevent diseases, allowing their patients to live longer, healthier lives.

— Yaron Turpaz (Chief Information Officer), Bryan Coon (Head of Enterprise Services), and Ashley Van Zeeland (Chief Technology Officer).

Learn More
Learn more about how AWS supports genomics in the cloud, and see how genomics innovator Illumina uses AWS for accelerated, cost-effective gene sequencing.

 

Amazon QuickSight Now Generally Available – Fast & Easy to Use Business Analytics for Big Data

by Jeff Barr | on | in Amazon QuickSight, Amazon RDS, Amazon Redshift, Amazon S3, Launch | | Comments

After a preview period that included participants from over 1,500 AWS customers ranging from startups to global enterprises, I am happy to be able to announce that Amazon QuickSight is now generally available! When I invited you to join the preview last year, I wrote:

In the past, Business Intelligence required an incredible amount of undifferentiated heavy lifting. You had to pay for, set up and run the infrastructure and the software, manage scale (while users fret), and hire consultants at exorbitant rates to model your data. After all that your users were left to struggle with complex user interfaces for data exploration while simultaneously demanding support for their mobile devices. Access to NoSQL and streaming data? Good luck with that!

Amazon QuickSight provides you with very fast, easy to use, cloud-powered business analytics at 1/10th the cost of traditional on-premises solutions. QuickSight lets you get started in minutes. You log in, point to a data source, and begin to visualize your data. Behind the scenes, the SPICE (Super-fast, Parallel, In-Memory Calculation Engine) will run your queries at lightning speed and provide you with highly polished data visualizations.

Deep Dive into Data
Every customer that I speak with wants to get more value from their stored data. They realize that the potential value locked up within the data is growing by the day, but are sometimes disappointed to learn that finding and unlocking that value can be expensive and difficult. On-premises business analytics tools are expensive to license and can place a heavy load on existing infrastructure. Licensing costs and the complexity of the tools can restrict the user base to just a handful of specialists.  Taken together, all of these factors have led many organizations to conclude that they are not ready to make the investment in a true business analytics function.

QuickSight is here to change that! It runs as a service and makes business analytics available to organizations of all shapes and sizes. It is fast and easy to use, does not impose a load on your existing infrastructure, and is available for a monthly fee that starts at just $9 per user.

As you’ll see in a moment, QuickSight allows you to work on data that’s stored in many different services and locations. You can get to your Amazon Redshift data warehouse, your Amazon Relational Database Service (RDS) relational databases, or your flat files in S3. You can also use a set of connectors to access data stored in on-premises MySQL, PostgreSQL, and SQL Server databases, Microsoft Excel spreadsheets, Salesforce and other services.

QuickSight is designed to scale with you. You can add more users, more data sources, and more data without having to purchase more long-term licenses or roll more hardware into your data center.

Take the Tour
Let’s take a tour through QuickSight. The administrator for my organization has already invited me to use QuickSight, so I am ready to log in and get started. Here’s the main screen:

I’d like to start by getting some data from a Redshift cluster. I click on Manage data and review my existing data sets:

I don’t see what I am looking for, so I click on New data set and review my options:

I click on Redshift (manual connect) and enter the credentials so that I can access my data warehouse (if I had a Redshift cluster running within my AWS account it would be available as an auto-discovered source):

QuickSight queries the data warehouse and shows me the schemas (sets of tables) and the tables that are available to me. I’ll select the public schema and the all_flights table to get started:

Now I have two options. I can pull the table in to SPICE for quick analysis or I can query it directly. I’ll pull it in to SPICE:

Again, I have two options! I can click on Edit/Preview data and select the rows and columns to import, or I can click on Visualize to import all of the data and proceed to the fun part! I’ll go for Edit/Preview. I can see the fields (on the left), and I can select only those that are interest using the checkboxes:

I can also click on New Filter, select a field from the popup menu, and then create a filter:

Both options (selecting fields and filtering on rows) allow me to control the data that I pull in to SPICE. This allows me to control the data that I want to visualize and also helps me to make more efficient use of memory. Once I am ready to proceed, I click on Prepare data & visualize. At this point the data is loaded in to SPICE and I’m ready to start visualizing it. I simply select a field to get started.  For example,  I can select the origin_state_abbr field and see how many flights originate in each state:

The miniaturized view on the right gives me some additional context. I can scroll up or down or select the range of values to display.  I can also click on a second field to learn more. I’ll click on flights, set the sort order to descending, and scroll to the top. Now I can see how many of the flights in my data originated in each state:

QuickSight’s AutoGraph feature automatically generates an appropriate visualization based on the data selected. For example, if I add the fl_date field, I get a state-by-state line chart over time:

Based on my query, the data types, and properties of the data, QuickSight also proposes alternate visualizations:

I also have my choice of many other visual types including vertical & horizontal bar charts, line charts, pivot tables, tree maps, pie charts, and heat maps:

Once I have created some effective visualizations, I can capture them and use the resulting storyboard to tell a data-driven story:

I can also share my visualizations with my colleagues:

Finally, my visualizations are accessible from my mobile device:

Pricing & SPICE Capacity
QuickSight comes with one free user and 1 GB of SPICE capacity for free, perpetually. This allows every AWS user to analyze their data and to gain business insights at no cost. The Standard Edition of Amazon QuickSight starts at $9 per month and includes 10 GB of SPICE capacity (see the [QuickSight Pricing] page for more info).

It is easy to manage SPICE capacity. I simply click on Manage QuickSight in the menu (I must have the ADMIN role in order to be able to make changes):

Then I can see where I stand:

I can click on Purchase more capacity to do exactly that:

I can also click on Release unused purchased capacity in order to reduce the amount of SPICE capacity that I own:

Get Started Today
Amazon QuickSight is now available in the US East (Northern Virginia), US West (Oregon), and EU (Ireland) regions and you can start using it today.

Despite the length of this blog post I have barely scratched the surface of QuickSight. Given that you can use it at no charge, I would encourage you to sign up, load some of your data, and take QuickSight for a spin!

We have a webinar coming up on January 16th where you can learn even more! Sign up here.

Jeff;

 

Genome Engineering Applications: Early Adopters of the Cloud

by Jeff Barr | on | in Amazon DynamoDB, Amazon S3, AWS Lambda, Customer Success | | Comments

Our friends at the Commonwealth Scientific and Industrial Research Organization (CSIRO) in Australia sent along the guest post below to tell us about how AWS powers an important new genome editing technique.

— Jeff


 

Recent developments in molecular engineering technology now enables the accurate editing of genomes. The new technology, called CRISPR-Cas9, can be programmed to recognize and edit specific locations in the genome by pattern-matching unique sequences of DNA. While this is a powerful new tool for researchers, the ability to scan and identify targets across the entire genome has created unprecedented demand for large-scale computation. Earlier this year, the US National Institutes of Health (NIH) has approved the use of these technologies for human health. This has the potential to revolutionize cancer treatments and also adds a new time-critical dimension to the compute requirements.

A New Approach to Cancer Treatments
Approximately two in five people will be diagnosed with cancer at some point during their lifetime and while overall cancer survival has doubled, there are still cancer types with very low survival rate, for example just 1% for pancreatic cancer. This is mainly due to the difficulty of finding therapeutic interventions that kill cancer cells but not harm the healthy tissue in the body.

The new NIH approved trial will leverage breakthroughs in the genome editing technology, CRISPR-Cas9, to develop a different treatment approach. In this, the patient’s own immune system is boosted through specific modifications of the cells that natively fight cancer. This has the potential of being effective for a wide range of different tumors, with the current trial including patients with specific blood and solid cancers, as well as melanoma.

Cloud Services for Computationally Guided Genome Engineering
This new application in human health requires an increase in robustness and efficiency of CRISPR-Cas9 design in order to meet the time constraints of clinical care. Built on AWS cloud-services, researchers in the eHealth program of the Commonwealth Scientific and Industrial Research Organization (CSIRO) in Australia, developed GT-Scan2, a novel software tool to address this issue.

“Compared to other available methods, GT-Scan2 identifies genomic location with higher sensitivity and specificity,” says Dr. Denis Bauer who is leading the transformational bioinformatics team.

GT-Scan2 shows the identified CRISPR target sites at the genomic position and annotates them with high or low activity as well as their off-target potential.

GT-Scan2 improves the effectiveness of the system by finding sites that are unique in the genome. This avoids diluting the effect due to “off-target”, which are other sites in the genome with high sequence similarity. It also optimizes robustness by finding sites that are easier to modify.

“While it was known that the three-dimensional genome organization plays a role in CRISPR binding, GT-Scan2 is the first tool to also leverage other components that are crucial for Cas9 activity,” says Dr. Laurence Wilson whose research focuses on computational genome engineering.

Specifically the off-target search is a compute intensive task traditionally reserved for researchers at large institutes with high-performance-compute infrastructure as every location in the 3 billion letter long genomic sequence needs to be investigated. GT-Scan2 democratizes the ability to find optimal sites by offering this complex computation as a cloud-service using AWS Lambda functions.

Scaling Instantaneously for Personalized Treatments
GT-Scan2 leverages the instantaneous scalability that the event-driven AWS Lambda service offers. This is crucial for personalized treatment, as complexity of the targeted gene can vary dramatically.

“The off-target search as well as the robustness analysis can be subdivided into independent, modular tasks that can run in parallel” says Aidan O’Brien who designed and implemented the system within weeks after its official Asia-Pacific launch in April this year at the AWS Summit 2016 attesting to the intuitive nature of the service. A typical job takes less than a minute and the variation between jobs range from 1 second to 5 minutes. This fast fluctuation in load over minutes rather than hours ruled out an EC2-based solution as new instances would come online too slowly to keep the runtime stable.

GT-Scan2 is served directly from S3 making it a static web app without server-side processing. It retrieves the dynamic content (such as job results and parameters) via API calls using API Gateway from a database (DynamoDB) using a JavaScript framework.

When a user submits a job, GT-Scan2 inserts the job parameters as an item into a DynamoDB table via an API call. This allows the solution to be freely scalable without creating a bottleneck. The database entry triggers the first Lambda function, which finds all putative CRISPR targets in the user-specified DNA sequence (fetched automatically upon user submission). Potential CRISPR target sites have fixed rules and can be easily found using a regular expression that completes in seconds and are inserted into a second DynamoDB table.

Adapting to leverage the power of Lambda-based microservices

All potential targets need to be evaluated for their off-target risk using the efficient string matching tool, Bowtie. Though Bowtie only requires a reduced representation of the 3 billion letter genomic sequence, the sizes of these index files exceed the storage limitation for each Lambda instance. “GT-Scan2 divides the genome into smaller blocks to fit the Lambda specifications” explains Adrian White (Research & Technical Computing, APAC) who supported the CSIRO team during development. For an average run, GT-Scan2 hence triggers 500-1000 individual Lambda functions, which simultaneously update the scores for the different putative targets in DynamoDB. During this process, the frontend is polling this table via API Gateway and updating the webpage as results come in, eliminating the need for server-side compute.

“AWS’s Lambda has given us a great framework to develop a future-ready software package able to support medical genome engineering applications,” says Dr. Bauer. “We are specifically impressed with the ability to instantaneously scale at run time by spawning more Lambda functions to cope with the varying complexity of the different genes.” Other benefits Dr. Bauer quotes include only paying for storage during periods of no use and jobs not competing with web server resources as the website is a static page with dynamic content updated through Angular 2 and the API Gateway, as well as not needing to maintain compute instances (security patches of OS).

“One of the best things about Lambda is that users will be able to easily swap-in different machine learning algorithms that are better suited for specific CRISPR applications” says Dr. Wilson.

The GT-Scan2 Team, from left, Denis Bauer, Laurence Wilson, Aidan O’Brien

“The computational genome engineering community is one of the early adopters of our AWS Lambda technology,” explains Dr. Mia Champion (Technical Business Development Manager, Scientific Computing). “GT-Scan2’s use of API Gateway and DynamoDB is a very neat solution to ensure scalability and their clever use of epigenomics really sets them apart from other recent applications using lambda to perform CRISPR searches. I am looking forward to seeing GT-Scan2 adopted in medical applications.”

Amazon Aurora Update – Call Lambda Functions From Stored Procedures; Load Data From S3

by Jeff Barr | on | in Amazon Aurora, Amazon S3, AWS Lambda |

Many AWS services work just fine by themselves, but even better together! This important aspect of our model allows you to select a single service, learn about it, get some experience with it, and then extend your span to other related services over time. On the other hand, opportunities to make the services work together are ever-present, and we have a number of them on our customer-driven roadmap.

Today I would like to tell you about two new features for Amazon Aurora, our MySQL-compatible relational database:

Lambda Function Invocation – The stored procedures that you create within your Amazon Aurora databases can now invoke AWS Lambda functions.

Load Data From S3 – You can now import data stored in an Amazon Simple Storage Service (S3) bucket into a table in an Amazon Aurora database.

Because both of these features involve Amazon Aurora and another AWS service, you must grant Amazon Aurora permission to access the service by creating an IAM Policy and an IAM Role, and then attaching the Role to your Amazon Aurora database cluster. To learn how to do this, see Authorizing Amazon Aurora to Access Other AWS Services On Your Behalf.

Lambda Function Integration
Relational databases use a combination of triggers and stored procedures to enable the implementation of higher-level functionality. The triggers are activated before or after some operations of interest are performed on a particular database table. For example, because Amazon Aurora is compatible with MySQL, it supports triggers on the INSERT, UPDATE, and DELETE operations. Stored procedures are scripts that can be run in response to the activation of a trigger.

You can now write stored procedures that invoke Lambda functions. This new extensibility mechanism allows you to wire your Aurora-based database to other AWS services. You can send email using Amazon Simple Email Service (SES), issue a notification using Amazon Simple Notification Service (SNS), insert publish metrics to Amazon CloudWatch, update a Amazon DynamoDB table, and more.

At the appliction level, you can implement complex ETL jobs and workflows, track and audit actions on database tables, and perform advanced performance monitoring and analysis.

Your stored procedure must call the mysql_lambda_async procedure. This procedure, as the name implies, invokes your desired Lambda function asynchronously, and does not wait for it to complete before proceeding. As usual, you will need to give your Lambda function permission to access any desired AWS services or resources.

To learn more, read Invoking a Lambda Function from an Amazon Aurora DB Cluster.

Load Data From S3
As another form of integration, data stored in an S3 bucket can now be imported directly in to Aurora (up until now you would have had to copy the data to an EC2 instance and import it from there).

The data can be located in any AWS region that is accessible from your Amazon Aurora cluster and can be in text or XML form.

To import data in text form, use the new LOAD DATA FROM S3 command. This command accepts many of the same options as MySQL’s LOAD DATA INFILE, but does not support compressed data. You can specify the line and field delimiters and the character set, and you can ignore any desired number of lines or rows at the start of the data.

To import data in XML form,  use the new LOAD XML from S3 command. Your XML can look like this:

<row column1="value1" column2="value2" />
...
<row column1="value1" column2="value2" />

Or like this:

<row>
  <column1>value1</column1>
  <column2>value2</column2>
</row>
...

Or like this:

<row>
  <field name="column1">value1</field>
  <field name="column2">value2</field>
</row>
...

To learn more, read Loading Data Into a DB Cluster From Text Files in an Amazon S3 Bucket.

Available Now
These new features are available now and you can start using them today!

There is no charge for either feature; you’ll pay the usual charges for the use of Amazon Aurora, Lambda, and S3.

Jeff;

 

IPv6 Support Update – CloudFront, WAF, and S3 Transfer Acceleration

by Jeff Barr | on | in Amazon CloudFront, Amazon S3, AWS Web Application Firewall |

As a follow-up to our recent announcement of IPv6 support for Amazon S3, I am happy to be able to tell you that IPv6 support is now available for Amazon CloudFront, Amazon S3 Transfer Acceleration, and AWS WAF and that all 60+ CloudFront edge locations now support IPv6. We are enabling IPv6 in a phased rollout that starts today and will extend across all of the networks over the next few weeks.

CloudFront IPv6 Support
You can now enable IPv6 support for individual Amazon CloudFront distributions. Viewers and networks that connect to a CloudFront edge location over IPv6 will automatically be served content over IPv6. Those that connect over IPv4 will continue to work as before. Connections to your origin servers will be made using IPv4.

Newly created distributions are automatically enabled for IPv6; you can modify an existing distribution by checking Enable IPv6 in the console or setting it via the CloudFront API:

Here are a couple of important things to know about this new feature:

  • Alias Records – After you enable IPv6  support for a distribution, the DNS entry for the distribution will be updated to include an AAAA record. If you are using Amazon Route 53 and an alias record to map all or part of your domain to the distribution, you will need to add an AAAA alias to the domain.
  • Log Files – If you have enabled CloudFront Access Logs, IPv6 addresses will start to show up in the c-ip field; make sure that your log processing system knows what to do with them.
  • Trusted Signers -If you make use of Trusted Signers in conjunction with an IP address whitelist, we strongly recommend the use of an IPv4-only distribution for Trusted Signer URLs that have an IP whitelist and a separate, IPv4/IPv6 distribution for the actual content. This model sidesteps an issue that would arise if the signing request arrived over an IPv4 address and was signed as such, only to have the request for the content arrive via a different, IPv6 address that is not on the whitelist.
  • CloudFormation – CloudFormation support is in the works. With today’s launch, distributions that are created from a CloudFormation template will not be enabled for IPv6. If you update an existing stack, the setting will remain as-is for any distributions referenced in the stack..
  • AWS WAF – If you use AWS WAF in conjunction with CloudFront, be sure to update your WebACLs and your IP rulesets as appropriate in order to whitelist or blacklist IPv6 addresses.
  • Forwarded Headers – When you enable IPv6 for a distribution, the X-Forwarded-For header that is presented to the origin will contain an IPv6 address. You need to make sure that the origin is able to process headers of this form.

To learn more, read IPv6 Support for Amazon CloudFront.

AWS WAF IPv6 Support
AWS WAF helps you to protect your applications from application-layer attacks (read New – AWS WAF to learn more).

AWS WAF can now inspect requests that arrive via IPv4 or IPv6 addresses. You can create web ACLs that match IPv6 addresses, as described in Working with IP Match Conditions:

All existing WAF features will work with IPv6 and there will be no visible change in performance. The IPv6 will appear in the Sampled Requests collected and displayed by WAF:

S3 Transfer Acceleration IPv6 Support
This important new S3 feature (read AWS Storage Update – Amazon S3 Transfer Acceleration + Larger Snowballs in More Regions for more info) now has IPv6 support. You can simply switch to the new dual-stack endpoint for your uploads. Simply change:

https://BUCKET.s3-accelerate.amazonaws.com

to

https://BUCKET.s3-accelerate.dualstack.amazonaws.com

Here’s some code that uses the AWS SDK for Java to create a client object and enable dual-stack transfer:

AmazonS3Client s3 = new AmazonS3Client();
s3.setS3ClientOptions(S3ClientOptions.builder().enableDualstack().setAccelerateModeEnabled(true).build());

Most applications and network stacks will prefer IPv6 automatically, and no further configuration should be required. You should plan to take a look at the IAM policies for your buckets in order to make sure that they will work as expected in conjunction with IPv6 addresses.

To learn more, read about Making Requests to Amazon S3 over IPv6.

Don’t Forget to Test
As a reminder, if IPv6 connectivity to any AWS region is limited or non-existent, IPv4 will be used instead. Also, as I noted in my earlier post, the client system can be configured to support IPv6 but connected to a network that is not configured to route IPv6 packets to the Internet. Therefore, we recommend some application-level testing of end-to-end connectivity before you switch to IPv6.

Jeff;

 

Streaming Real-time Data into an S3 Data Lake at MeetMe

by Jeff Barr | on | in Amazon Kinesis, Amazon S3, Guest Post |

In today’s guest post, Anton Slutsky of MeetMe describes the implementation process for their Data Lake.

Jeff;


Anton Slutsky is an experienced information technologist with nearly two decades of experience in the field. He has an MS in Computer Science from Villanova University and a PhD in Information Science from Drexel University.

Modern Big Data systems often include structures called Data Lakes. In the industry vernacular, a Data Lake is a massive storage and processing subsystem capable of absorbing large volumes of structured and unstructured data and processing a multitude of concurrent analysis jobs. Amazon Simple Storage Service (S3) is a popular choice nowadays for Data Lake infrastructure as it provides a highly scalable, reliable, and low-latency storage solution with little operational overhead. However, while S3 solves a number of problems associated with setting up, configuring and maintaining petabyte-scale storage, data ingestion into S3 is often a challenge as types, volumes, and velocities of source data differ greatly from one organization to another.

In this blog, I will discuss our solution, which uses Amazon Kinesis Firehose to optimize and streamline large-scale data ingestion at MeetMe, which is a popular social discovery platform that caters to more than a million active daily users. The Data Science team at MeetMe needed to collect and store approximately 0.5 TB per day of various types of data in a way that would expose it to data mining tasks, business-facing reporting and advanced analytics. The team selected Amazon S3 as the target storage facility and faced a challenge of collecting the large volumes of live data in a robust, reliable, scalable and operationally affordable way.

The overall aim of the effort was to set up a process to push large volumes of streaming data into AWS data infrastructure with as little operational overhead as possible. While many data ingestion tools, such as Flume, Sqoop and others are currently available, we chose Amazon Kinesis Firehose because of its automatic scalability and elasticity, ease of configuration and maintenance, and out-of-the-box integration with other Amazon services, including S3, Amazon Redshift, and Amazon Elasticsearch Service.

Business Value / Justification
As it is common for many successful startups, MeetMe focuses on delivering the most business value at the lowest possible cost. With that, the Data Lake effort had the following goals:

  • Empowering business users with high-level business intelligence for effective decision making.
  • Enabling the Data Science team with data needed for revenue generating insight discovery.

When considering commonly used data ingestion tools, such as Scoop and Flume, we estimated that, the Data Science team would need to add an additional full-time BigData engineer in order to set up, configure, tune and maintain the data ingestion process with additional time required from engineering to enable support redundancy. Such operational overhead would increase the cost of the Data Science efforts at MeetMe and would introduce unnecessary scope to the team affecting the overall velocity.

Amazon Kinesis Firehose service alleviated many of the operational concerns and, therefore, reduced costs. While we still needed to develop some amount of in-house integration, scaling, maintaining, upgrading and troubleshooting of the data consumers would be done by Amazon, thus significantly reducing the Data Science team size and scope.

Configuring an Amazon Kinesis Firehose Stream
Kinesis Firehose offers the ability to create multiple Firehose streams each of which could be aimed separately at different S3 locations, Redshift tables or Amazon Elasticsearch Service indices. In our case, our primary goal was to store data in S3 with an eye towards other services mentioned above in the future.

Firehose delivery stream setup is a 3-step process. In Step 1, it is necessary to choose the destination type, which lets you define whether you want your data to end up in an S3 bucket, a Redshift table or an Elasticsearch index. Since we wanted the data in S3, we chose “Amazon S3” as the destination option. If S3 is chosen as the destination, Firehose prompts for other S3 options, such as the S3 bucket name. As described in the Firehose documentation, Firehose will automatically organize the data by date/time and the “S3 prefix” setting serves as the global prefix that will be prepended to all S3 keys for a given Firehose stream object. It is possible to change the prefix at a later date even on a live stream that is in the process of consuming data, so there is little need to overthink the naming convention early on.

The next step of the process is the stream configuration. Here, it is possible to override various defaults and specify other meaningful values. For example, selecting GZIP compression instead of the uncompressed default will greatly reduce the S3 storage footprint and, consequently, the S3 costs. Enabling Data encryption will encrypt the data at rest, which is important for sensitive data.

One important note is that the choice of the compression algorithm affects the resulting filenames (S3 keys) for the stream objects. Therefore, while it is possible to change these settings later on a live stream, it may be prudent to settle on the compression/encryption approach early to avoid possible issues with processing scripts.

As mentioned in Amazon Kinesis Firehose Limits, Kinesis Firehose has a set of default throughput quotas. Once those quotas are exceeded, Firehose will respond with an error message “ServiceUnavailableException: Slow down.” and will drop data. Therefore, in order to avoid data loss, it is important to estimate individual throughput requirements. If those requirements are likely to exceed the default quotas, it is possible to request additional throughput by submitting a limit increase request as described in the limits.

The final step (not shown) is to review the desired configuration and create the stream.

Setting up the Upload Process
At MeetMe, RabbitMQ serves as a service bus for most of the data that flows through the system. Therefore, the task of data collection for the most part amounts to consuming large volumes of RabbitMQ messages and uploading them to S3 with the help of Firehose streams. To accomplish this, we developed lightweight RabbitMQ consumers. While RabbitMQ consumers are implemented elsewhere (such as Flume), we opted to develop our own versions to enable integration with Firehose API.

Firehose provides two ways of upload data – single record and bulk. With the single record approach, each individual record is packaged into Amazon Firehose API framework objects and each object is serialized to a Firehose end-point via HTTP/Rest. While this approach may be appropriate for some applications, we achieved better performance by using the bulk API methods. The bulk methods allow up to 500 records to be uploaded to Firehose with a single request.

To upload a batch of messages, the lightweight RabbitMQ consumer maintains a small internal buffer, which gets serialized to Firehose as often as possible by a predefined set of processor threads. Here’s the code:

new Thread(new Runnable()
{
  public void run()
  {
    logger.info("Kinesis writer thread started.  Waiting for records to process...");
    while(true)
    {
      try
      {
        if(!recordsQueue.isEmpty())
        {
           if(logger.isDebugEnabled())
             logger.debug("Uploading current batch to AWS: "+recordsQueue.size());
        
           List<MMMessage> records = recordsQueue;
           recordsQueue = new CopyOnWriteArrayList<MMMessage>();
        
           final int uploadThreshold = 499;
        
           List<Record> buffer = new ArrayList<Record>(uploadThreshold);
        
           for(int i = 0; i < records.size(); i++)
           {
             // Get a proprietary message object from an internal queue
             MMessage mmmessage = records.get(i);
                 
             // Get the bytes
             String message = new String(mmmessage.body, "UTF-8");
                 
             // Check for new line and tab characters in data to avoid
             // issues with Hadoop/Spark processing line-based processing
             // later on
             message = CharMatcher.anyOf("\n").replaceFrom(message, "\\n");
             message = CharMatcher.anyOf("\t").replaceFrom(message, "\\t");
 
             // Wrap the message bytes with Amazon Firehose API wrapper    
             Record record = new Record().withData(ByteBuffer.wrap(message.getBytes()));
 
             buffer.add(record);
                 
             // If the current buffer is large enough,
             if(buffer.size() == uploadThreshold)
             {
               // send it to Firehose
               uploadBuffer(buffer);
               // and instantiate a new buffer
               buffer = new ArrayList<Record>(uploadThreshold);
             }
           }
           // don't forget to upload last buffer!
           uploadBuffer(buffer);                                
         }
       }
       catch(Exception e)
      {
        logger.error("Error in sending to Kinesis:"+e.getMessage(), e);
      }
    }
  }
}).start();

The uploadBuffer method is a simple wrapper over the bulk upload Firehose API:

private void uploadBuffer(final List<Record> buffer)
{
  // Make a new request object
  PutRecordBatchRequest request = new PutRecordBatchRequest();
  // Specify the stream name
  request.setDeliveryStreamName("MEETME_STREAM");
        
  // Set the data buffer
  request.setRecords(buffer);
 
  // Attempt to send to Firehose
  PutRecordBatchResult result = getAmazonClient().putRecordBatch(request);
        
  // Always check for failures!
  Integer failed = result.getFailedPutCount();
  if (failed != null && failed.intValue() > 0)
  {
    // If there are failures, find out what caused them
    logger.warn("AWS upload of [" + buffer.size() + "] resulted in " + failed + " failures");
                 
    // Dig into the responses to see if there are various kinds of failures
    List<PutRecordBatchResponseEntry> res = result.getRequestResponses();
    if (res != null)
    {
      for (PutRecordBatchResponseEntry r : res)
      {
        if (r != null)
        {
          logger.warn("Report " + r.getRecordId() + ", " + r.getErrorCode() + ", " + r.getErrorMessage()
                      + ", " + r.toString());
        }
        else
        {
          logger.warn("Report NULL");
        }
      }
    }
    else
    {
      logger.warn("BatchReport NULL");
    }
  }
}

Monitoring Firehose Streams
Once the Firehose streams are set up and internal consumer processes begin to send data, a common task is to attempt to monitor the data flow. Some of the reasons for paying attention to data flow are data volume considerations, potential error conditions, capturing failures among many others. With Amazon Firehose, monitoring is accomplished with the help of Amazon CloudWatch. Common delivery stream metrics are available under the Monitoring tab in each Firehose stream configuration with additional metrics available through the CloudWatch Console.

While AWS provides an extensive set of monitoring facilities, in our experience it turned out to be important to carefully monitor internal data producer logs for errors. Such close monitoring using the syslog facility, Splunk, and other log monitoring tools allowed us to capture and fix specific errors and reduce the number of individual record failures to tolerable levels. Further, internal log monitoring allowed us to recognize early that our volumes were quickly exceeding default Firehose throughput quotas (see above).

Anton Slutsky, Director of Data Science, MeetMe