Posted On: Nov 29, 2022

Amazon EMR announces Amazon Redshift integration with Apache Spark. This integration helps data engineers build and run Spark applications that can consume and write data from an Amazon Redshift cluster. Starting with Amazon EMR 6.9, this integration is available across all three deployment models for EMR - EC2, EKS, and Serverless.

You can use this integration to build applications that directly write to Redshift tables as a part of your ETL workflows or to combine data in Redshift with data in other source. Developers can load data from Redshift tables to Spark data frames or write data to Redshift tables. Developers don’t have to worry about downloading open source connectors to connect to Redshift.

Amazon Redshift integration for Apache Spark enables applications on Amazon EMR that access Redshift data to run up to 10x faster compared to existing Redshift-Spark connectors. It supports pushing down relational operations such as joins, aggregations, sort and scalar functions from Spark to Redshift to improve your query performance. It supports IAM-based roles to enable single sign on capabilities and integrates with AWS Secrets Manager for securely managing keys.

Amazon Redshift integration for Apache Spark is available in all regions where Amazon EMR, Amazon EMR on EKS and Amazon Serverless are available. To get started, refer to our documentation for Amazon EMR, Amazon EMR on EKS and Amazon EMR Serverless.