Posted On: Nov 29, 2022

Amazon Redshift integration for Apache Spark helps developers seamlessly build and run Apache Spark applications on Amazon Redshift data. If you are using AWS analytics and machine learning (ML) services—such as Amazon EMR, AWS Glue, and Amazon SageMaker—you can now build Apache Spark applications that read from and write to your Amazon Redshift data warehouse without compromising on the performance of your applications or transactional consistency of your data. Amazon Redshift integration for Apache Spark builds on an existing open source connector project and enhances it for performance and security, helping customers gain up to 10x faster application performance. We thank the original contributors on the project who collaborated with us to make this happen. As we make further enhancements we will continue to contribute back into the open source project.

Amazon Redshift integration for Apache Spark minimizes the cumbersome and often manual process of setting up a spark-redshift open-source connector and reduces the time needed to prepare for analytics and ML tasks. You only need to specify the connection to your data warehouse and can start working with Amazon Redshift data from your Apache Spark-based applications in seconds. You can use several pushdown capabilities for operations such as sort, aggregate, limit, join, and scalar functions so that only the relevant data is moved from your Amazon Redshift data warehouse to the consuming Spark application. This allows you to improve the performance of your applications. You can also help make your applications more secure by using AWS Identity Access and Management (IAM) credentials to connect to Amazon Redshift.

To get started, go to Amazon EMR 6.9, EMR Serverless, or AWS Glue 4.0, use data frame or Spark SQL code in an Apache Spark job or Notebook to connect to the Amazon Redshift data warehouse, and start running queries in minutes. To learn more, see Amazon Redshift or Amazon Redshift Integration for Apache Spark.