AWS Solutions Library

Real-Time Analytics with Spark Streaming

Many organizations use batch data and real-time data streaming reports to gain strategic and actionable insights into long-term business trends. A growing number of customers use streaming data processing with new and dynamic data generated on a continual basis in big data use cases. The streaming data is used to produces reports, perform actions based on thresholds or perform more sophisticated forms of data analysis, like applying machine learning algorithms.

The Real-Time Analytics with Spark Streaming guidance automatically configures the AWS services necessary to easily ingest, store, process, and analyze both real-time and batch data using functions from business intelligence architecture and big data architecture. This guidance deploys a highly available, secure, flexible, cost-effective streaming data analytics architecture on the AWS Cloud that leverages Apache Spark Streaming and Amazon Kinesis.

Overview

The diagram below presents the architecture you can build using the example code on GitHub.

Real-Time Analytics with Spark Streaming | Architecture Diagram

Real-Time Analytics with Spark Streaming Guidance architecture

This guidance deploys an Amazon Virtual Private Cloud (Amazon VPC) network with one public and one private subnet. The public subnet contains a NAT gateway and a bastion host. The private subnet hosts the Amazon EMR cluster with Apache Zeppelin.

Amazon Kinesis Data Streams collects data from data sources and sends the data through the NAT gateway to the Amazon EMR cluster. After the Spark Streaming application processes the data, it stores the data in an Amazon S3 bucket.

Show less

Real-Time Analytics with Spark Streaming

Version 1.2.0
Last updated: 12/2021
Author: AWS

Example code on GitHub

Implementation resources

Resources and FAQ »
Contact us »

Did this Guidance help you?

Yes

Provide feedback

Features

Real-Time Analytics with Spark Streaming reference implementation

The Real-Time Analytics with Spark Streaming guidance automatically provisions and configures the AWS services necessary to start processing real-time and batch data in minutes.

Apache Zeppelin support

The guidance leverages Apache Zeppelin, a web-based notebook for interactive data analytics, to enable customers to visualize both their real-time and batch data.

Spark Streaming application

This guidance is designed to use your own application written in Java or Scala.

Deploy an AWS Solution yourself

Browse our library of AWS Solutions to get answers to common architectural problems.

Learn more

Find an AWS Partner Solution

Find AWS Partners to help you get started.

Learn more

Explore Guidance

Find prescriptive architectural diagrams, sample code, and technical content for common use cases.

Learn more

Real-Time Analytics with Spark Streaming

Overview

Real-Time Analytics with Spark Streaming Guidance architecture

Real-Time Analytics with Spark Streaming

Implementation resources

Features

Real-Time Analytics with Spark Streaming reference implementation

Apache Zeppelin support

Spark Streaming application

Ending Support for Internet Explorer