Category: Launch

Amazon Kinesis Analytics – Process Streaming Data in Real Time with SQL

As you may know, Amazon Kinesis greatly simplifies the process of working with real-time streaming data in the AWS Cloud. Instead of setting up and running your own processing and short-term storage infrastructure, you simply create a Kinesis Stream or Kinesis Firehose, arrange to pump data in to it, and then build an application to process or analyze it.

While it is relatively easy to build streaming data solutions using Kinesis Streams and Kinesis Firehose, we want to make it even easier. We want you, whether you are a procedural developer, a data scientist, or a SQL developer, to be able to process voluminous clickstreams from web applications, telemetry and sensor reports from connected devices, server logs, and more using a standard query language, all in real time!

Amazon Kinesis Analytics
Today I am happy to be able to announce the availability of Amazon Kinesis Analytics. You can now run continuous SQL queries against your streaming data, filtering, transforming, and summarizing the data as it arrives. You can focus on processing the data and extracting business value from it instead of wasting your time on infrastructure. You can build a powerful, end-to-end stream processing pipeline in 5 minutes without having to write anything more complex than a SQL query.

When I think of running a series of SQL queries against a database table, I generally think of the data as staying more or less static while the queries come and go pretty quickly. Rows are added, changed, and deleted all the time, but this does not generally matter when considering a single query that runs at a particular point in time. Running a Kinesis Analytics query against streaming data turns this model sideways.  The queries are long-running and the data changes many times per second as new records, observations, or log entries arrive. Once you wrap your head around this, you will see that the query processing model is very easy to understand: You build persistent queries that process records as they arrive.

In order to control the set of records that will be processed by a given query, you make use of a processing “window.” Kinesis Analytics supports three different types of windows:

Tumbling windows are used for periodic reports. You could use a tumbling window to summarize data over time. Perhaps you get thousands or millions of requests per second, and would like to know how many arrive each minute. When the current tumbling window closes, the next one begins after it. A new result is generated each time the window fills up.

Sliding windows are used for monitoring and other types of trend detection. For example, you could use a sliding window to compute a real-time moving average for an error rate. Records enter the window, contribute to the result as long as they are within it, and the window advances. A new result is generated each time a new record enters the window. You can adjust the size of the window to control the sensitive of the results.

Custom windows are used when the appropriate grouping is not strictly based on time. If you are processing clickstream data or server logs, you can use a custom window to perform an action known as sessionization. In other words, you can bound each query by the first and last actions performed by each user, as identified by a session identifier within the incoming data. You can write a query that computes the number of pages visited by each user or the time that they spend on your site.

While all of this might sound somewhat complicated, it is actually pretty easy to implement. Kinesis Analytics will analyze a sample of the incoming records and then propose a suitable schema. You can use it as-is, or you can fine-tune it to better reflect your actual data model. Once the schema has been defined, you can use the built-in SQL editor (complete with syntax checking and easy testing against live data). You can configure Kinesis Analytics to route the results of the query to up to four destinations including Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, or an Amazon Kinesis Stream.

When you build your first Amazon Kinesis Analytics application you need to write a pair of cooperating SQL statements (more complex applications can use more, but all it takes is two to get up and running):

A statement to create an in-application stream to store intermediate SQL results (a stream is a like a SQL table that is continuously updated, which you can select from and insert into).

Your SQL query, which selects from one in-application stream and inserts into another in-application stream.

Your SQL statements can also JOIN the records to reference data that originates in S3. This can be handy when you want to enhance or modify the records to include additional, perhaps more descriptive, information.

Amazon Kinesis Analytics in Action
Let’s spend a few minutes looking at Amazon Kinesis Analytics in action!

I log in to the Amazon Kinesis Analytics Console and clicking on Create new application. Then I enter a name and a description for my app:

Now I can manage my data source, my queries, and the destination(s):

I can select one of my existing input streams:

Or I can configure a new one (I’ll do that):

I click on Create demo stream to create a stream that will be populated with sample stock ticker data. This takes 30 to 40 seconds!

Kinesis Analytics peeks at the stream and proposes a schema. I can accept it as-is or fine tune it:

Then I hop over to the SQL editor. It offers to start my app. That seems like a good idea, so I agree  and click on Yes, start application:

Here’s the actual SQL editor:

I can write my query from scratch or I can use a template:

I picked Continuous filter; here’s the SQL:

I inspected it, nodded in agreement, and then clicked on Save and run SQL.  Within seconds, results began to flow in and were visible in the Console:

I used the SQL editor to modify the query to remove the sector and price columns and ran the query again. When I did this I learned that I needed to remove the columns from the CREATE STREAM statement (this is obvious in retrospect but it was the end of a long day).

Here’s the revised result set:

In most cases the next step would be to route the results to a new or existing stream. I can do that from the Console:

With just a couple of clicks and a little bit of typing, I have created an Amazon Kinesis Analytics app that is capable of process a production-scale stock ticker stream. This “demo” needs no changes whatsoever before being used in production. I think that’s kind of cool.

Learn More & Try it Yourself
As usual, I have barely scratched the surface of this exciting new service!  To learn more, you should read the new post, Writing SQL on Streaming Data with Amazon Kinesis Analytics.

You should be able to replicate my steps above in 5 minutes or less and I strongly recommend that you do so. Create your application, customize the SQL query, and learn how to process streaming data at scale.

Available Now
Amazon Kinesis Analytics is available now and you can start running queries against your streaming data today!


Powerful AWS Platform Features, Now for Containers

Containers are great but they come with their own management challenges. Our customers have been using containers on AWS for quite some time to run workloads ranging from microservices to batch jobs. They told us that managing a cluster, including the state of the EC2 instances and containers, can be tricky, especially as the environment grows. They also told us that integrating the capabilities you get with the AWS platform, such as load balancing, scaling, security, monitoring, and more, with containers is a key requirement. Amazon ECS was designed to meet all of these needs and more.

We created Amazon ECS  to make it easy for customers to run containerized applications in production. There is no container management software to install and operate because it is all provided to you as a service. You just add the EC2 capacity you need to your cluster and upload your container images. Amazon ECS takes care of the rest, deploying your containers across a cluster of EC2 instances and monitoring their health. Customers such as Expedia and Remind have built Amazon ECS into their development workflow, creating PaaS platforms on top of it. Others, such as Prezi and Shippable, are leveraging ECS to eliminate operational complexities of running containers, allowing them to spend more time delivering features for their apps.

AWS has highly reliable and scalable fully-managed services for load balancing, auto scaling, identity and access management, logging, and monitoring. Over the past year, we have continued to natively integrate the capabilities of the AWS platform with your containers through ECS, giving you the same capabilities you are used to on EC2 instances.

Amazon ECS recently delivered container support for application load balancing (Today), IAM roles (July), and Auto Scaling (May). We look forward to bringing more of the AWS platform to containers over time.

Let’s take a look at the new capabilities!

Application Load Balancing
Load balancing and service discovery are essential parts of any microservices architecture. Because Amazon ECS uses Elastic Load Balancing, you don’t need to manage and scale your own load balancing layer. You also get direct access to other AWS services that support ELB such as AWS Certificate Manager (ACM) to automatically manage your service’s certificates and Amazon API Gateway to authenticate callers, among other features.

Today, I am happy to announce that ECS supports the new application load balancer, a high-performance load balancing option that operates at the application layer and allows you to define content-based routing rules. The application load balancer includes two features that simplify running microservices on ECS: dynamic ports and the ability for multiple services to share a single load balancer.

Dynamic ports makes it easier to start tasks in your cluster without having to worry about port conflicts. Previously, to use Elastic Load Balancing to route traffic to your applications, you had to define a fixed host port in the ECS task. This added operational complexity, as you had to track the ports each application used, and it reduced cluster efficiency, as only one task could be placed per instance. Now, you can specify a dynamic port in the ECS task definition, which gives the container an unused port when it is scheduled on the EC2 instance. The ECS scheduler automatically adds the task to the application load balancer’s target group using this port. To get started, you can create an application load balancer from the EC2 Console or using the AWS Command Line Interface (CLI). Create a task definition in the ECS console with a container that sets the host port to 0. This container automatically receives a port in the ephemeral port range when it is scheduled.

Previously, there was a one-to-one mapping between ECS services and load balancers. Now, a load balancer can be shared with multiple services, using path-based routing. Each service can define its own URI, which can be used to route traffic to that service. In addition, you can create an environment variable with the service’s DNS name, supporting basic service discovery. For example, a stock service could be and a weather service could be, both served from the same load balancer. A news portal could then use the load balancer to access both the stock and weather services.

IAM Roles for ECS Tasks
In Amazon ECS, you have always been able to use IAM roles for your Amazon EC2 container instances to simplify the process of making API requests from your containers. This also allows you to follow AWS best practices by not storing your AWS credentials in your code or configuration files, as well as providing benefits such as automatic key rotation.

With the introduction of the recently launched IAM roles for ECS tasks, you can secure your infrastructure by assigning an IAM role directly to the ECS task rather than to the EC2 container instance. This way, you can have one task that uses a specific IAM role for access to, let’s say, S3 and another task that uses an IAM role to access a DynamoDB table, both running on the same EC2 instance.

Service Auto Scaling
The third feature I want to highlight is Service Auto Scaling. With Service Auto Scaling and Amazon CloudWatch alarms, you can define scaling policies to scale your ECS services in the same way that you scale your EC2 instances up and down. With Service Auto Scaling, you can achieve high availability by scaling up when demand is high, and optimize costs by scaling down your service and the cluster, when demand is lower, all automatically and in real-time.

You simply choose the desired, minimum and maximum number of tasks, create one or more scaling policies, and Service Auto Scaling handles the rest. The service scheduler is also Availability Zone–aware, so you don’t have to worry about distributing your ECS tasks across multiple zones.

Available Now
These features are available now and you can start using them today!


New – AWS Application Load Balancer

We launched Elastic Load Balancing (ELB) for AWS in the spring of 2009 (see New Features for Amazon EC2: Elastic Load Balancing, Auto Scaling, and Amazon CloudWatch to see just how far AWS has come since then). Elastic Load Balancing has become a key architectural component for many AWS-powered applications. In conjunction with Auto Scaling, Elastic Load Balancing greatly simplifies the task of building applications that scale up and down while maintaining high availability.

On the Level
Per the well-known OSI model, load balancers generally run at Layer 4 (transport) or Layer 7 (application).

A Layer 4 load balancer works at the network protocol level and does not look inside of the actual network packets, remaining unaware of the specifics of HTTP and HTTPS. In other words, it balances the load without necessarily knowing a whole lot about it.

A Layer 7 load balancer is more sophisticated and more powerful. It inspects packets, has access to HTTP and HTTPS headers, and (armed with more information) can do a more intelligent job of spreading the load out to the target.

Application Load Balancing for AWS
Today we are launching a new Application Load Balancer option for ELB. This option runs at Layer 7 and supports a number of advanced features. The original option (now called a Classic Load Balancer) is still available to you and continues to offer Layer 4 and Layer 7 functionality.

Application Load Balancers support content-based routing, and supports applications that run in containers. They support a pair of industry-standard protocols (WebSocket and HTTP/2) and also provide additional visibility into the health of the target instances and containers. Web sites and mobile apps, running in containers or on EC2 instances, will benefit from the use of Application Load Balancers.

Let’s take a closer look at each of these features and then create a new Application Load Balancer of our very own!

Content-Based Routing
An Application Load Balancer has access to HTTP headers and allows you to route requests to different backend services accordingly. For example, you might want to send requests that include /api in the URL path to one group of servers (we call these target groups) and requests that include /mobile to another. Routing requests in this fashion allows you to build applications that are composed of multiple microservices that can run and be scaled independently.

As you will see in a moment, each Application Load Balancer allows you to define up to 10 URL-based rules to route requests to target groups. Over time, we plan to give you access to other routing methods.

Support for Container-Based Applications
Many AWS customers are packaging up their microservices into containers and hosting them on Amazon EC2 Container Service. This allows a single EC2 instance to run one or more services, but can present some interesting challenges for traditional load balancing with respect to port mapping and health checks.

The Application Load Balancer understands and supports container-based applications. It allows one instance to host several containers that listen on multiple ports behind the same target group and also performs fine-grained, port-level health checks

Better Metrics
Application Load Balancers can perform and report on health checks on a per-port basis. The health checks can specify a range of acceptable HTTP responses, and are accompanied by detailed error codes.

As a byproduct of the content-based routing, you also have the opportunity to collect metrics on each of your microservices. This is a really nice side-effect that each of the microservices can be running in its own target group, on a specific set of EC2 instances. This increased visibility will allow you to do a better job of scaling up and down in response to the load on individual services.

The Application Load Balancer provides several new CloudWatch metrics including overall traffic (in GB), number of active connections, and the connection rate per hour.

Support for Additional Protocols & Workloads
The Application Load Balancer supports two additional protocols: WebSocket and HTTP/2.

WebSocket allows you to set up long-standing TCP connections between your client and your server. This is a more efficient alternative to the old-school method which involved HTTP connections that were held open with a “heartbeat” for very long periods of time. WebSocket is great for mobile devices and can be used to deliver stock quotes, sports scores, and other dynamic data while minimizing power consumption. ALB provides native support for WebSocket via the ws:// and wss:// protocols.

HTTP/2 is a significant enhancement of the original HTTP 1.1 protocol. The newer protocol feature supports multiplexed requests across a single connection. This reduces network traffic, as does the binary nature of the protocol.

The Application Load Balancer is designed to handle streaming, real-time, and WebSocket workloads in an optimized fashion. Instead of buffering requests and responses, it handles them in streaming fashion. This reduces latency and increases the perceived performance of your application.

Creating an ALB
Let’s create an Application Load Balancer and get it all set up to process some traffic!

The Elastic Load Balancing Console lets me create either type of load balancer:

I click on Application load balancer, enter a name (MyALB), and choose internet-facing. Then I add an HTTPS listener:

On the same screen, I choose my VPC (this is a VPC-only feature) and one subnet in each desired Availability Zone, tag my Application Load Balancer, and proceed to Configure Security Settings:

Because I created an HTTPS listener, my Application Load Balancer needs a certificate. I can choose an existing certificate that’s already in IAM or AWS Certificate Manager (ACM),  upload a local certificate, or request a new one:

Moving right along, I set up my security group. In this case I decided to create a new one. I could have used one of my existing VPC or EC2 security groups just as easily:

The next step is to create my first target group (main) and to set up its health checks (I’ll take the defaults):

Now I am ready to choose the targets—the set of EC2 instances that will receive traffic through my Application Load Balancer. Here, I chose the targets that are listening on port 80:

The final step is to review my choices and to Create my ALB:

After I click on Create the Application Load Balancer is provisioned and becomes active within a minute or so:

I can create additional target groups:

And then I can add a new rule that routes /api requests to that target:

Application Load Balancers work with multiple AWS services including Auto Scaling, Amazon ECS, AWS CloudFormation, AWS CodeDeploy, and AWS Certificate Manager (ACM). Support for and within other services is in the works.

Moving on Up
If you are currently using a Classic Load Balancer and would like to migrate to an Application Load Balancer, take a look at our new Load Balancer Copy Utility. This Python tool will help you to create an Application Load Balancer with the same configuration as an existing Classic Load Balancer. It can also register your existing EC2 instances with the new load balancer.

Availability & Pricing
The Application Load Balancer is available now in all commercial AWS regions and you can start using it today!

The hourly rate for the use of an Application Load Balancer is 10% lower than the cost of a Classic Load Balancer.

When you use an Application Load Balancer, you will be billed by the hour and for the use of Load Balancer Capacity Units, also known as LCU’s. An LCU measures the number of new connections per second, the number of active connections, and data transfer. We measure on all three dimensions, but bill based on the highest one. One LCU is enough to support either:

  • 25 connections/second with a 2 KB certificate, 3,000 active connections, and 2.22 Mbps of data transfer or
  • 5 connections/second with a 4 KB certificate, 3,000 active connections, and 2.22 Mbps of data transfer.

Billing for LCU usage is fractional, and is charged at $0.008 per LCU per hour. Based on our calculations, we believe that virtually all of our customers can obtain a net reduction in their load balancer costs by switching from a Classic Load Balancer to an Application Load Balancer.





Amazon Elastic Block Store (EBS) Update – Snapshot Price Reduction & More PIOPS/GiB

Amazon Elastic Block Store (EBS) makes it easy for you to create persistent block level storage volumes and attach them to your EC2 instances. Among many other features, EBS allows you to provision SSD-backed volumes with the desired level of performance (Provisioned IOPS or PIOPS) and to create snapshot backups either manually or programmatically.

Today I would like to tell you about some improvements and changes that we are making to both of these features. We are increasing the number of IOPS that you can provision per GB of storage and we are reducing the price of snapshot storage by 47%. Together, these changes make EBS more powerful and even more economical.

More PIOPS per GiB
We introduced the concept of Provisioned IOPS back in 2012 (read my post, Fast Forward – Provisioned IOPS for EBS Volumes, to learn more). With Provisioned IOPS, you can dial in the precise level of performance that you need for each EBS volume. The number of PIOPS that you can configure is a function of the volume size; the larger the volume, the more PIOPS you can configure, up to a per-volume maximum of 20,000 PIOPS.

Until now, you could provision up to 30 IOPS per GiB of SSD-backed storage. Now, you can provision up to 50 IOPS per GiB for new EBS volumes (a 66% increase). As has always been the case, Provisioned IOPS SSD (io1) volumes are designed to deliver within 10% of the provisioned IOPS performance 99.9% of the time in a given year (read the EBS Performance FAQ to learn more).

Some AWS customers create small EBS volumes that are intended to run extremely “hot”, maxing out the available PIOPS either occasionally or continuously. With this change, these small, hot volumes can be considerably smaller while still delivering the same level of performance. For example, if you need 20,000 PIOPS you can create a 400 GiB volume instead of a 667 GiB volume:

This change applies to all newly created SSD-backed volumes in all commercial AWS Regions.

Snapshot Price Reduction
We are reducing the prices for EBS snapshots by 47% for all AWS Regions.

With this change, snapshots are even more economical! As a result, you can take backups more frequently in order to reduce recovery time after human errors. If you are not making backups of your EBS volumes on a regular basis, now is a good time to start!

This price reduction is retroactive to August 1, 2016 and will be applied automatically; it also applies to snapshots of Gateway-Cached volumes that are used with the AWS Storage Gateway.

Join our Team
If you are a developer, development manager, or product manager and would like to build systems like this, please visit the EBS Jobs page.



Amazon CloudFront Expands to Canada

With a long feature list (powered in large part by customer requests) Amazon CloudFront is well-suited to delivering your static, dynamic, and interactive content to users all over the world at high speed and with low latency. As part of the AWS Free Tier, you can handle up to 2 million HTTP and HTTPS requests and transfer up to 50 GB of data each month at no charge.

I am happy to announce that we are adding CloudFront edge locations in Toronto and Montreal in order to better serve our users in the region, bringing the global count up to 59 (full list). This includes a second edge location in São Paolo, Brazil that we recently brought online. Pricing for the locations in Toronto and Montreal is the same as for our US edge locations (see CloudFront Pricing for more info). The edge locations in Canada fall within Price Class 100.

If your application already uses CloudFront you need not do anything special in order to take advantage of the new locations. Your users will enjoy fast, low-latency access to your static, dynamic, or streamed content regardless of their location. As a developer, you will find CloudFront to be simple to use as well as cost-effective. Because it is elastic, you don’t need to over-provision in order to handle unpredictable traffic loads.

Before you ask, these new locations will also support Amazon Route 53 in the future. Again, you won’t need to do anything special in order to take advantage of the new locations!



PS – You can learn more about CloudFront at our monthly Office Hours (register now). The next session will be held at 10 AM PT on August 30th, 2016.

Amazon CloudFront se développe au Canada

Grâce à ses nombreuses fonctionnalités (développées en partie à la demande des clients) Amazon CloudFront est parfaitement adapté pour offrir un contenu statique, dynamique et interactif à haut débit et faible latence aux utilisateurs du monde entier. Dans le cadre du niveau gratuit AWS, vous pouvez traiter jusqu’à deux millions de requêtes HTTP et HTTPS, et transférer gratuitement jusqu’à 50 Go de données par mois.

Afin de mieux répondre à nos utilisateurs, j’ai le plaisir d’annoncer l’ajout d’emplacements périphériques Amazon CloudFront à Toronto et Montréal, portant ainsi leur nombre total à 59 (liste complète). Cela comprend la mise en service récente d’un second emplacement périphérique à São Paulo, Brésil. La tarification pour les emplacements à Toronto et Montréal est la même que pour nos emplacements périphériques aux USA (pour en savoir plus consultez la Tarification CloudFront). Les emplacements périphériques au Canada relèvent de la catégorie de tarifs 100.

Si votre application utilise déjà CloudFront, aucune action supplémentaire n’est nécessaire pour que vous profitiez des nouveaux emplacements. Vos utilisateurs apprécieront la rapidité daccès avec faible latence à votre contenu statique, dynamique ou diffusé, quel que soit le lieu où ils se trouvent. En tant que développeur, CloudFront vous paraîtra simple d’utilisation et économique. CloudFront étant un produit élastique, il ne vous sera pas nécessaire de surenchérir pour gérer les pics de trafic.

Pour votre information, ces nouveaux emplacements prendront également en charge Amazon Route 53 dans le futur. Encore une fois, aucune action supplémentaire n’est nécessaire pour que vous profitiez de ces nouveaux emplacements !



PS – Vous pourrez en savoir plus sur CloudFront durant nos sessions mensuelles (Inscrivez-vous maintenant). La prochaine session aura lieu à 10 h HAP le 30 août 2016.

New – Just-in-Time Certificate Registration for AWS IoT

We launched AWS IoT at re:Invent (read AWS IoT – Cloud Services for Connected Devices for an introduction) and made it generally available last December.

Earlier this year my colleague Olawale Oladehin showed you how to Use Your Own Certificate with AWS IoT. Before that, John Renshaw talked about Predictive Maintenance with AWS IoT and Amazon Machine Learning.

Just-in-Time Registration
Today we are making AWS IoT even more flexible by giving you the ability to do Just-in-Time registration of device certificates. This expands on the feature described by Olawale, and simplifies the process of building systems that make use of millions of connected devices. Instead of having to build a separate database to track the certificates and the associated devices, you can now arrange to automatically register new certificates as part of the initial  communication between the device and AWS IoT.

In order to do this, you start with a CA (Certificate Authority) certificate that you later use to sign the per-device certificates (this is a great example of the chain of trust model that is fundamental to the use of digital certificates).

Putting this new feature to use is pretty easy, but you do have to take care of some important details. Here are the principal steps:

  1. Register & activate the CA certificate that will sign the other certificates.
  2. Use the certificate to generate and sign certificates for each device.
  3. Have the device present the certificate to AWS IoT and then activate it.

The final step can be implemented using a AWS Lambda function. The function simply listens on a designated MQTT topic using an AWS IoT Rule Engine Action. A notification will be sent to the topic each time a new certificate is presented to AWS IoT. The function can then activate the device certificate and take care of any other initialization or registration required by your application.

Learn More
To learn more about this important new feature and to review all of the necessary steps in detail, read about Just in Time Registration of Device Certificates on AWS IoT on The Internet of Things on AWS Blog.



Amazon EMR 5.0.0 – Major App Updates, UI Improvements, Better Debugging, and More

The Amazon EMR team has been cranking out new releases at a fast and furious pace! Here’s a quick recap of this year’s launches:

  • EMR 4.7.0 – Updates to Apache Tez, Apache Phoenix, Presto, HBase, and Mahout (June).
  • EMR 4.6.0 – HBase for realtime access to massive datasets (April).
  • EMR 4.5.0 – Updates to Hadoop, Presto; addition of Spark and EMRFS (April).
  • EMR 4.4.0 – Sqoop, HCatalog, Java 8, and more (March).
  • EMR 4.3.0 – Updates to Spark, Presto, and Ganglia (January).

Today the team is announcing and releasing EMR 5.0.0. This is a major release that includes support for 16 open source Hadoop ecosystem projects, major version upgrades for Spark and Hive, use of Tez by default for Hive and Pig, user interface improvements to Hue and Zeppelin, and enhanced debugging functionality.

Here’s a map that shows how EMR has progressed over the course of the past couple of releases:

Let’s check out the new features in EMR 5.0.0!

Support for 16 Open Source Hadoop Ecosystem Projects
We started using Apache Bigtop to manage the EMR build and packaging process during the development of EMR 4.0.0. The use of Bigtop helped us to accelerate the release cycle while we continued to add additional packages from the Hadoop ecosystem, with a goal of making the newest GA (generally available) open source versions accessible to you as quickly as possible.

In accord with our goal, EMR 5.0 includes support for 16 Hadoop ecosystem projects including Apache Hadoop, Apache Spark, Presto, Apache Hive, Apache HBase, and Apache Tez. You can choose the desired set of apps when you create a new EMR cluster:

Major Version Upgrade for Spark and Hive
This release of EMR updates Hive (a SQL-like interface for Tez and Hadoop MapReduce) from 1.0 to 2.1, accompanied by a move to Java 8. It also updates Spark (an engine for large-scale data processing) from 1.6.2 to 2.0, with a similar move to Scala 2.11. The Spark and Hive updates are both major releases and include new features, performance enhancements, and bug fixes. For example, Spark now includes a Structured Streaming API, better SQL support, and more. Be aware that the new versions of Spark and Hive are not 100% backward compatible with the old ones; check your code and upgrade to EMR 5.0.0 with care.

With this release, Tez is now the default execution engine for Hive 2.1 and Pig 0.16, replacing Hadoop MapReduce and resulting in better performance, including reduced query latency. With this update, EMR uses MapReduce only when running a Hadoop MapReduce job directly (Hive and Pig now use Tez; Spark has its own framework).

User Interface Improvements
EMR 5.0.0 also updates Apache Zeppelin (a notebook for interactive data analytics) from 0.5.6 to 0.6.1, and Hue (an interface for analyzing data with Hadoop) from 3.7.1 to 3.10. The new versions of both of these web-based tools include new features and lots of smaller improvements.

Zeppelin is often used with Spark; Hue works well with Hive, Pig, and HBase. The new version of Hue includes a notebooks feature that allows you to have multiple queries on the same page:

Hue can also help you to design Oozie workflows:

Enhanced Debugging Functionality
Finally, EMR 5.0.0 includes some better debugging functionality, making it easier for you to figure out why a particular step of your EMR job failed. The console now displays a partial stack track and links to the log file (stored in Amazon S3) in order to help you to find, troubleshoot, and fix errors:

Launch a Cluster Today
You can launch an EMR 5.0.0 cluster today in any AWS Region! Open up the EMR Console, click on Create cluster, and choose emr-5.0.0 from the Release menu:

Learn More
To learn more about this powerful new release of EMR, plan to attend our webinar or August 23rd, Introducing Amazon EMR Release 5.0: Faster, Easier, Hadoop, Spark, and Presto.