Many Amazon Web Services (AWS) customers use batch data reports to gain strategic insight into long-term business trends, and a growing number of customers also require streaming data to obtain actionable insights from their data in real time. AWS provides many of the building blocks required to build a secure, flexible, cost-effective data-processing architecture in the cloud. These include AWS managed services that help ingest, store, process, and analyze both real-time and batch data.

This webpage introduces an AWS solution that automatically deploys a highly available, cost-effective batch and real-time data analytics architecture on the AWS Cloud that leverages Apache Spark Streaming and Amazon Kinesis. The following section assumes basic knowledge of architecting on the AWS Cloud, streaming data, and data analysis.

Choose a unified framework that combines batch and real-time processing in the cloud. It is easier to develop and maintain a unified framework than it is to integrate bespoke data-processing solutions. Additionally, consider the following best practices for implementing a batch and real-time processing solution on AWS:

  • Secure your data processing resources, and plan how to grant users secure access to your real-time and batch data and processing systems. Consider implementing granular access-control policies and encryption to protect your data.
  • Understand the compute and memory-processing requirements for your data. For example, some batch and real-time processing frameworks are memory intensive and would benefit from memory optimized compute resources.
  • Understand your query and processing requirements and timeframes so you can more effectively leverage elastic resources to process your data.

AWS offers a solution that automatically configures a batch and real-time data-processing architecture on AWS. The Real-Time Analytics with Spark Streaming solution is designed to support custom Apache Spark Streaming applications, and leverages Amazon EMR for processing vast amounts of data across dynamically scalable Amazon Elastic Compute Cloud (Amazon EC2) instances. The diagram below presents the Real-Time Analytics architecture you can deploy in minutes using the solution's implementation guide and accompanying AWS CloudFormation template.

  1. This solution deploys an Amazon Virtual Private Cloud (Amazon VPC) network with one public and one private subnet. The public subnet contains a NAT gateway and a bastion host. The private subnet hosts the Amazon EMR cluster.
  2. Use your custom Spark Streaming application, or deploy the AWS-provided demo application to launch an example data-processing environment. The application is deployed on the Amazon EMR cluster.
  3. Amazon Kinesis Streams collects data from data sources, sends the data through the NAT gateway to the Amazon EMR cluster, and uses an Amazon DynamoDB table for checkpointing.
  4. After the Spark Streaming application processes the data, it stores the data in an Amazon S3 bucket.
Deploy Solution
Implementation Guide

What you'll accomplish:

Deploy Real-Time Analytics with Spark Streaming using AWS CloudFormation. The CloudFormation template will automatically launch and configure the components necessary to process real-time and batch data in a single framework.

Build a unified framework for processing batch and real-time data on AWS with all the components necessary to support a Spark Streaming application for data processing and analysis.

Automatically analyze real-time and batch data with Apache Spark Streaming and Spark SQL on the AWS Cloud.

What you'll need before starting:

An AWS account: You will need an AWS account to begin provisioning resources. Sign up for AWS.

A Spark Streaming application: This solution is designed to use your own application written in Java or Scala, but it also includes a demo application that you can deploy for testing purposes.

Skill level: This solution is intended for IT infrastructure professionals who have practical experience architecting on the AWS Cloud, and are familiar with streaming data and Apache Spark.

Q: Can I use my custom Spark Streaming application with the solution?

Yes. This solution is design to allow you to use your own Spark Streaming application written in Java or Scala. We recommend that you use the latest version of Apache Spark for your application. The solution also includes an AWS CloudFormation template that deploys a demo application for testing purposes. You can modify the demo application to for your specific needs.

Q: Can I deploy this solution with multiple Spark Streaming applications?

No. The Real-Time Analytics solution is designed to work with only one Spark Streaming application at a time. If you want to change applications, you must first stop the running application and then deploy the solution again with a new application. This also applies if you deploy the demo application: you must stop the running demo application before you can deploy the demo application again.

Q: Can I deploy the Real-Time Analytics solution in any AWS Region?

Customers can deploy the Real-Time Analytics with Spark Streaming CloudFormation template only in AWS Regions where AWS Lambda is available. For more information, please see AWS service offerings by region.

Need more resources to get started with AWS? Visit the Getting Started Resource Center to find tutorials, projects and videos to get started with AWS.

Tell us what you think