What does this AWS Solutions Implementation do?

This solution provides declarative data processing support, codeless extract-transform-load (ETL) capabilities, and workflow orchestration automation to help your business users (such as analysts and data scientists) access their data and create meaningful insights without the need for manual IT processes.

Benefits

Build, test, and debug ETL jobs in Jupyter

Use JupyterHub, a web-based interactive integrated development environment (IDE) to simplify your ETL application development.

Use a SQL-first approach
.

Implement business logic and data quality checks in ETL pipeline development using Spark SQL.

Orchestrate jobs without code
.

Use Argo workflows to schedule jobs and manage complex job dependencies without the need to code.

Auto-deploy Docker images
.

Set up an AWS continuous improvement and continuous development (CI/CD) pipeline to securely store the data framework Docker image in Amazon Elastic Container Registry (Amazon ECR).

AWS Solutions Implementation overview

The diagram below presents the architecture you can automatically deploy using the solution's implementation guide and accompanying AWS CloudFormation template.

SQL-Based ETL with Apache Spark on Amazon EKS | Architecture Diagram
 Click to enlarge

SQL-Based ETL with Apache Spark on Amazon EKS Solutions Implementation architecture

The AWS CloudFormation template deploys a secure, fault-tolerant, auto-scaling environment to support your ETL workloads containing the following components:

  1. A customizable and flexible workflow management layer (refer to the Orchestration on Amazon Elastic Kubernetes Service (Amazon EKS) group in the diagram) includes the Argo Workflows plug-in. This plug-in provides a web-based tool to orchestrate your ETL jobs without the need to write code. Optionally, you can use other workflow tools such as Volcano and Apache Airflow.
  2. A secure data processing workspace is configured to unify data workloads in the same Amazon EKS cluster. This workspace contains a second web-based tool, JupyterHub, for interactive job builds and testing. You can either develop Jupyter notebook using a declarative approach to specify ETL tasks or programmatically write your ETL steps using PySpark. This workspace also provides Spark job automations that are managed by the Argo Workflows tool.
  3. A set of security functions are deployed in the solution. Amazon Elastic Container Registry (Amazon ECR) maintains and secures a data processing framework Docker image. The AWS Identity and Access Management (IAM) roles for service accounts (IRSA) feature on Amazon EKS provides token authorization with fine-grained access control to other AWS services. For example, Amazon EKS integration with Amazon Athena is password-less to mitigate the risk of exposing AWS credentials in a connection string. Jupyter fetches login credentials from AWS Secrets Manager into Amazon EKS on-the-fly. Amazon CloudWatch monitors applications on Amazon EKS using the activated CloudWatch Container Insights feature.
  4. The analytical workloads on the Amazon EKS cluster outputs data results to an Amazon Simple Storage Service (Amazon S3) data lake. A data schema entry (metadata) is created in an AWS Glue Data Catalog via Amazon Athena.

SQL-Based ETL with Apache Spark on Amazon EKS

Version 1.0.0
Released: 07/2021
Author: AWS

Estimated deployment time: 30 min

Estimated cost Source code  CloudFormation template 
Use the button below to subscribe to updates for this Solutions Implementation.
Note: To subscribe to RSS updates, you must have an RSS plug-in enabled for the browser you are using.
Did this Solutions Implementation help you?
Provide feedback 
Build icon
Deploy a Solution yourself

Browse our library of AWS Solutions Implementations to get answers to common architectural problems.

Learn more 
Find an APN partner
Find an APN Partner

Find AWS certified consulting and technology partners to help you get started.

Learn more 
Explore icon
Explore Solutions Consulting Offers

Browse our portfolio of Consulting Offers to get AWS-vetted help with solution deployment.

Learn more