SQL-Based ETL with Apache Spark on Amazon EKS provides declarative data processing support, codeless extract-transform-load (ETL) capabilities, and workflow orchestration automation to help your business users (such as analysts and data scientists) access their data and create meaningful insights without the need for manual IT processes.
Use JupyterHub, a web-based interactive integrated development environment (IDE) to simplify your ETL application development.
Implement business logic and data quality checks in ETL pipeline development using Spark SQL.
Use Argo workflows to schedule jobs and manage complex job dependencies without the need to code.
Set up an AWS continuous improvement and continuous development (CI/CD) pipeline to securely store the data framework Docker image in Amazon Elastic Container Registry (Amazon ECR).
The diagram below presents the architecture you can build using the example code on GitHub.
SQL-Based ETL with Apache Spark on Amazon EKS architecture
SQL-Based ETL with Apache Spark on Amazon EKS deploys a secure, fault-tolerant, auto-scaling environment to support your ETL workloads containing the following components:
- A customizable and flexible workflow management layer (refer to the Orchestration on Amazon Elastic Kubernetes Service (Amazon EKS) group in the diagram) includes the Argo Workflows plug-in. This plug-in provides a web-based tool to orchestrate your ETL jobs without the need to write code. Optionally, you can use other workflow tools such as Volcano and Apache Airflow.
- A secure data processing workspace is configured to unify data workloads in the same Amazon EKS cluster. This workspace contains a second web-based tool, JupyterHub, for interactive job builds and testing. You can either develop Jupyter notebook using a declarative approach to specify ETL tasks or programmatically write your ETL steps using PySpark. This workspace also provides Spark job automations that are managed by the Argo Workflows tool.
- A set of security functions are deployed in the solution. Amazon Elastic Container Registry (Amazon ECR) maintains and secures a data processing framework Docker image. The AWS Identity and Access Management (IAM) roles for service accounts (IRSA) feature on Amazon EKS provides token authorization with fine-grained access control to other AWS services. For example, Amazon EKS integration with Amazon Athena is password-less to mitigate the risk of exposing AWS credentials in a connection string. Jupyter fetches login credentials from AWS Secrets Manager into Amazon EKS on-the-fly. Amazon CloudWatch monitors applications on Amazon EKS using the activated CloudWatch Container Insights feature.
- The analytical workloads on the Amazon EKS cluster outputs data results to an Amazon Simple Storage Service (Amazon S3) data lake. A data schema entry (metadata) is created in an AWS Glue Data Catalog via Amazon Athena.
SQL-Based ETL with Apache Spark on Amazon EKS
Browse our library of AWS Solutions to get answers to common architectural problems.
Find AWS Partners to help you get started.
Find prescriptive architectural diagrams, sample code, and technical content for common use cases.