What does this AWS Solution do?
Many Amazon Web Services (AWS) customers use batch data reports to gain strategic insight into long-term business trends, and a growing number of customers also require streaming data to obtain actionable insights from their data in real time. AWS provides many of the building blocks required to build a secure, flexible, cost-effective data-processing architecture in the cloud. These include AWS managed services that help ingest, store, process, and analyze both real-time and batch data.
This AWS solution automatically deploys a highly available, cost-effective batch and real-time data analytics architecture on the AWS Cloud that leverages Apache Spark Streaming and Amazon Kinesis. The following section assumes basic knowledge of architecting on the AWS Cloud, streaming data, and data analysis.
AWS Solution overview
This solution automatically configures a batch and real-time data-processing architecture on AWS. The Real-Time Analytics with Spark Streaming solution is designed to support custom Apache Spark Streaming applications, and leverages Amazon EMR for processing vast amounts of data across dynamically scalable Amazon Elastic Compute Cloud (Amazon EC2) instances. The diagram below presents the Real-Time Analytics architecture you can deploy in minutes using the solution's implementation guide and accompanying AWS CloudFormation template.
Real-Time Analytics with Spark Streaming solution architecture
This solution deploys an Amazon Virtual Private Cloud (Amazon VPC) network with one public and one private subnet. The public subnet contains a NAT gateway and a bastion host. The private subnet hosts the Amazon EMR cluster.
Use your custom Spark Streaming application, or deploy the AWS-provided demo application to launch an example data-processing environment. The application is deployed on the Amazon EMR cluster, the Amazon EMR cluster.
Amazon Kinesis Data Streams collects data from data sources, sends the data through the NAT gateway to the Amazon EMR cluster, and uses an Amazon DynamoDB table for checkpointing. After the Spark Streaming application processes the data, it stores the data in an Amazon S3 bucket.
Real-Time Analytics with Spark Streaming reference implementation
Spark Streaming application
Apache Zeppelin support
Browse our portfolio of AWS-built solutions to common architectural problems.
Find AWS certified consulting and technology partners to help you get started.
Sign-up and start exploring our services.