Guidance for SQL-Based ETL with Apache Spark on Amazon EKS

Unlock efficient data workflows and faster insights with a scalable, enterprise-grade extract, transform, and load (ETL) solution

Overview

This Guidance helps address the gap between data consumption requirements and low-level data processing activities performed by common ETL practices. For organizations operating on SQL-based data management systems, adapting to modern data engineering practices can slow down the progress of harnessing powerful insights from their data. This Guidance provides a quality-aware design for increasing data process productivity through the open-source data framework Arc for a user-centered ETL approach. The Guidance accelerates interaction with ETL practices, fostering simplicity and raising the level of abstraction for unifying ETL activities in both batch and streaming.

We also offer options for an optimal design using efficient compute instances (such as AWS Graviton Processors) that allow you to optimize the performance and cost of running ETL jobs at scale on Amazon EKS.

How it works

This architecture diagram accelerates data processing with Apache Spark on Amazon EKS.

Download the architecture diagram

Well-Architected Pillars

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Within the Amazon EKS clusters, Amazon Elastic Compute Cloud (Amazon EC2) instances (X86_64 CPU, Graviton ARM64) act as compute nodes, running Guidance workloads. Spark jobs are executed on elastically provisioned Amazon EC2 Spot instances based on workload demands. CodeBuild and CodePipeline automate the GitOps process, building container images from Git code updates and pushing them to the Amazon ECR private registry. Argo Workflows schedules ETL jobs on Amazon EKS, automatically pulling the Arc Docker image from Amazon ECR, downloading ETL assets from the artifact S3 bucket, and sending application logs to CloudWatch. This automated deployment and execution of Data ETL jobs minimizes operational overhead and improves productivity. Further, the CI/CD pipeline using CodeBuild and CodePipeline helps ensure continuous improvement and development while securely storing the Guidance's Arc Docker image in Amazon ECR.

Read the Operational Excellence whitepaper

The Amazon EKS cluster resources are deployed within an Amazon VPC, providing logical networking isolation from the public internet. Amazon VPC supports security features like VPC endpoint (keeping traffic within the AWS network), security groups, network access control lists (ACLs), and AWS Identity and Access Management (IAM) roles and policies for controlling inbound and outbound traffic and authorization. Amazon ECR image registry offers container-level security features such as vulnerability scanning. Amazon ECR and Amazon EKS follow Open Container Initiative (OCI) registry and Kubernetes API standards, incorporating strict security protocols. IAM provides access control for Amazon S3 application data, while AWS Key Management Service (AWS KMS) encrypts data at rest on Amazon S3. IAM Roles for Service Accounts (IRSA) on Amazon EKS clusters enables fine-grained access control for pods, enforcing role-based access control and limiting unauthorized Amazon S3 data access. Secrets Manager securely stores and manages credentials. CloudFront provides SSL-encoded secure entry points for Jupyter and Argo Workflows web tools.

Read the Security whitepaper

Amazon EKS enables highly available topologies by deploying the Kubernetes Control and Compute Planes across multiple Availability Zones (AZs). This helps ensures continuous availability for data applications—even if an AZ experiences an interruption—resulting in a reliable multi-AZ EC2 instance deployment on Amazon EKS. For data storage, Amazon S3 provides high durability and availability, automatically replicating data objects across multiple AZs within a Region. Additionally, Amazon ECR hosts Docker images in a highly available and scalable architecture, reliably supporting container-based application deployment and increments. Amazon S3, Amazon EKS, and Amazon ECR are fully managed services designed for high service level agreements (SLAs) with reduced operational costs. They enable deployment of business-critical applications to meet high availability requirements.

Read the Reliability whitepaper

The Amazon EKS cluster's Amazon EC2 compute nodes can dynamically scale up and down based on application workload. Graviton-based EC2 instances provide increased performance efficiency through custom-designed Arm-based processors, optimized hardware, and architectural enhancements. A decoupled compute-storage pattern (with input and output data stored in Amazon S3) enhances dynamic compute scaling efficiency. Data Catalog streamlines metadata management, seamlessly integrating with Athena for simplified metadata management and enhanced query performance. Data Catalog automates crawling and maintaining technical metadata for efficient data processing and querying. Athena offers fast querying against Amazon S3 data without moving it, further enhancing analytics workflow efficiency.

Read the Performance Efficiency whitepaper

Amazon ECR is a managed service for securing and supporting container applications with a fixed monthly fee for storing and serving container images. Amazon EKS cluster compute nodes can scale up and down based on Spark workloads, offering cost-efficient Graviton and Spot instance types. Data Catalog provides a serverless, fully managed metadata repository, eliminating the need to set up and maintain a long-running metadata database and reducing operational overhead and costs. CodeBuild and CodePipeline automate the build and deploy of the Arc ETL Framework's Docker image in a serverless environment, eliminating the need for provisioning and managing build servers in addition to reducing infrastructure maintenance costs.

Read the Cost Optimization whitepaper

This Guidance runs an Amazon EKS cluster with efficient compute types based on Graviton processors. Amazon ECR eliminates the need for custom hardware or physical server management. Data Catalog and Athena are serverless services, further reducing energy and environmental impact. Optimizing the Amazon EKS compute layer for large-scale Apache Spark workloads minimizes the environmental impact of analytics workloads. You have the flexibility to choose Arm-based processors based on performance needs and your sustainability priorities.

Read the Sustainability whitepaper

Implementation resources

The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.

Open implementation guide

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages

Guidance for SQL-Based ETL with Apache Spark on Amazon EKS

Overview

How it works

Well-Architected Pillars

Implementation resources

Disclaimer

Did you find what you were looking for today?

Learn

Resources

Developers

Help