What does this AWS Solutions Implementation do?

Many organizations use batch data and real-time data streaming reports to gain strategic and actionable insights into long-term business trends. A growing number of customers use streaming data processing with new and dynamic data generated on a continual basis in big data use cases. The streaming data is used to produces reports, perform actions based on thresholds or perform more sophisticated forms of data analysis, like applying machine learning algorithms.

The Real-Time Analytics with Spark Streaming solution automatically configures the AWS services necessary to easily ingest, store, process, and analyze both real-time and batch data using functions from business intelligence architecture and big data architecture. This solution deploys a highly available, secure, flexible, cost-effective streaming data analytics architecture on the AWS Cloud that leverages Apache Spark Streaming and Amazon Kinesis. The following section assumes basic knowledge of architecting on the AWS Cloud, streaming data, and data analysis.

AWS Solutions Implementation overview

This solution automatically configures a batch and real-time data-processing architecture on AWS. The Real-Time Analytics with Spark Streaming solution is designed to support custom Apache Spark Streaming applications, and leverages Amazon EMR for processing vast amounts of data across dynamically scalable Amazon Elastic Compute Cloud (Amazon EC2) instances. The diagram below presents the Real-Time Analytics architecture you can deploy in minutes using the solution's implementation guide and accompanying AWS CloudFormation template.

Real-Time Analytics with Spark Streaming | Architecture Diagram
 Click to enlarge

Real-Time Analytics with Spark Streaming solution architecture

This solution deploys an Amazon Virtual Private Cloud (Amazon VPC) network with one public and one private subnet. The public subnet contains a NAT gateway and a bastion host. The private subnet hosts the Amazon EMR cluster with Apache Zeppelin.

Use your custom Spark Streaming application, or deploy the AWS-provided demo application to launch an example data-processing environment. The application is deployed on the Amazon EMR cluster.

Amazon Kinesis Data Streams collects data from data sources and sends the data through the NAT gateway to the Amazon EMR cluster. After the Spark Streaming application processes the data, it stores the data in an Amazon S3 bucket.

Real-Time Analytics with Spark Streaming

Version 1.1.0
Last updated: 04/2020
Author: AWS

Estimated deployment time: 15 min

Use the button below to subscribe to solution updates.

Note: To subscribe to RSS updates, you must have an RSS plug-in enabled for the browser you are using.  

Did this Solutions Implementation help you?
Provide feedback 

Features

Real-Time Analytics with Spark Streaming reference implementation

The Real-Time Analytics with Spark Streaming solution is an AWS-provided reference implementation that automatically provisions and configures the AWS services necessary to start processing real-time and batch data in minutes.

Spark Streaming application

This solution is designed to use your own application written in Java or Scala, but it also includes a demo application that you can deploy for testing purposes.

Apache Zeppelin support

The solution leverages Apache Zeppelin, a web-based notebook for interactive data analytics, to enable customers to visualize both their real-time and batch data.
Build icon
Deploy a Solution yourself

Browse our library of AWS Solutions Implementations to get answers to common architectural problems.

Learn more 
Find an APN partner
Find an APN Partner

Find AWS certified consulting and technology partners to help you get started.

Learn more 
Explore icon
Explore Solutions Consulting Offers

Browse our portfolio of Consulting Offers to get AWS-vetted help with solution deployment.

Learn more