AWS Big Data Blog
Get started with the Amazon Redshift Data API
Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that enables you to analyze your data at scale. Tens of thousands of customers use Amazon Redshift to process exabytes of data to power their analytical workloads.
|Since you’re reading this post, you may also be interested in the following AWS Online Tech Talks video for more info:|
With the Data API, you can programmatically access data in your Amazon Redshift cluster from different AWS services such as AWS Lambda, Amazon SageMaker notebooks, AWS Cloud9, and also your on-premises applications using the AWS SDK. This allows you to build cloud-native, containerized, serverless, web-based, and event-driven applications on the AWS Cloud.
The Data API also enables you to run analytical queries on Amazon Redshift’s native tables, external tables in your data lake via Amazon Redshift Spectrum, and also across Amazon Redshift clusters, which is known as data sharing. You can also perform federated queries with external data sources such as Amazon Aurora.
In an earlier, post, we shared in great detail on how you can use the Data API to interact with your Amazon Redshift data warehouse. In this post, we learn how to get started with the Data API in different languages and also discuss various use cases in which customers are using this to build modern applications combining modular, serverless, and event-driven architectures. The Data API was launched in September 2020, and thousands of our customers are already using it for a variety of use cases:
- Extract, transform, and load (ETL) orchestration with AWS Step Functions
- Access Amazon Redshift from applications
- Access Amazon Redshift from SageMaker Jupyter notebooks
- Access Amazon Redshift with REST endpoints
- Event-driven extract, load, transformation
- Event-driven web application design
- Serverless data processing workflows
Key features of the Data API
In this section, we discuss the key features of the Data API.
Use different programming language of your choice
Run individual or batch SQL statements
With the Data API, you can run individual queries from your application or submit a batch of SQL statements within a transaction, which is useful to simplify your workload.
Run SQL statements with parameters
With the Data API, you can run parameterized SQL queries, which brings the ability to write reusable code when developing ETL code by passing parameters into a SQL template instead of concatenating parameters into each query on their own. This also makes it easier to migrate code from existing applications that needs parameterization. In addition, parameterization also makes code secure by eliminating malicious SQL injection.
No drivers needed
With the Data API, you can interact with Amazon Redshift without having to configure JDBC or ODBC drivers. The Data API eliminates the need for configuring drivers and managing database connections. You can run SQL commands to your Amazon Redshift cluster by calling a Data API secured API endpoint.
No network or security group setup
The Data API doesn’t need a persistent connection with Amazon Redshift. Instead, it provides a secure HTTP endpoint, which you can use to run SQL statements. Therefore, you don’t need to set up and manage a VPC, security groups, and related infrastructure to access Amazon Redshift with the Data API.
No password management
The Data API provides two options to provide credentials:
- IAM temporary credentials – With this approach, you only need to provide the username and AWS Identity and Access Management (IAM) GetClusterCredentials permission to access Amazon Redshift with no password. The Data API automatically handles retrieving your credentials using IAM temporary credentials.
- AWS Secrets Manager secret – With this approach, you store your username and password credentials in an AWS Secrets Manager secret and allow the Data API to access those credentials from that secret.
You can also use the Data API when working with federated logins through IAM credentials. You don’t have to pass database credentials via API calls when using identity providers such as Okta, Azure Active Directory, or database credentials stored in Secrets Manager. If you’re using Lambda, the Data API provides a secure way to access your database without the additional overhead of launching Lambda functions in Amazon Virtual Private Cloud (Amazon VPC).
The Data API is asynchronous. You can run long-running queries without having to wait for it to complete, which is key in developing a serverless, microservices-based architecture. Each query results in a query ID, and you can use this ID to check the status and response of the query. In addition, query results are stored for 24 hours.
You can monitor Data API events in Amazon EventBridge, which delivers a stream of real-time data from your source application to targets such as Lambda. This option is available when you’re running your SQL statements in the Data API using the
WithEvent parameter set to
true. When a query is complete, the Data API can automatically send event notifications to EventBridge, which you may use to take further actions. This helps you design event-driven applications with Amazon Redshift. For more information, see Monitoring events for the Amazon Redshift Data API in Amazon EventBridge.
Get started with the Data API
The following code is an example using the AWS CLI:
The following example code uses Python:
Use cases with the Data API
You can use the Data API to modernize and simplify your application architectures by creating modular, serverless, event-driven applications with Amazon Redshift. In this section, we discuss some common use cases.
ETL orchestration with Step Functions
When performing ETL workflows, you have to complete a number of steps. As your business scales, the steps and dependencies often become complex and difficult to manage. With the Data API and Step Functions, you can easily orchestrate complex ETL workflows. You can explore the following example use case and AWS CloudFormation template demonstrating ETL orchestration using the Data API and Step Functions.
Access Amazon Redshift from custom applications
If you’re designing your custom application in any programming language that is supported by the AWS SDK, the Data API simplifies data access from your applications, which may be an application hosted on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon Elastic Container Service (Amazon ECS) and other compute services or a serverless application built with Lambda. You can explore an example use case and CloudFormation template showcasing how to easily work with the Data API from Amazon EC2 based applications.
Access Amazon Redshift from SageMaker Jupyter notebooks
SageMaker notebooks are very popular among the data science community to analyze and solve machine learning problems. The Data API makes it easy to access and visualize data from your Amazon Redshift data warehouse without troubleshooting issues on password management or VPC or network issues. You can learn more about this use case along with a CloudFormation template showcasing how to use the Data API to interact from a SageMaker Jupyter notebook.
Access Amazon Redshift with REST endpoints
With the AWS SDK, you can use the Data APIs to directly invoke them as REST API calls such as GET or POST methods. For more information, see REST for Redshift Data API.
Event–driven applications are popular with many customers, where applications run in response to events. A primary benefit of this architecture is the decoupling of producer and consumer processes, which allows greater flexibility in application design and building decoupled processes. For more information, see Building an event-driven application with AWS Lambda and the Amazon Redshift Data API. The following CloudFormation template demonstrates the same.
Event-driven web application design
Similar to event-driven ELT applications, event-driven web applications are also becoming popular, especially if you want to avoid long-running database queries, which create bottlenecks for the application servers. For example, you may be running a web application that has a long-running database query taking a minute to complete. Instead of designing that web application with long-running API calls, you can use the Data API and Amazon API Gateway WebSockets, which creates a lightweight websocket connection with the browser and submits the query to Amazon Redshift using the Data API. When the query is finished, the Data API sends a notification to EventBridge about its completion. When the data is available in the Data API, it’s pushed back to this browser session and the end-user can view the dataset. You can explore an example use case along with a CloudFormation template showcasing how to build an event-driven web application using the Data API and API Gateway WebSockets.
Serverless data processing workflows
With the Data API, you can design a serverless data processing workflow, where you can design an end-to-end data processing pipeline orchestrated using serverless AWS components such as Lambda, EventBridge, and the Data API client. Typically, a data pipeline involves multiple steps, for example:
- Load raw sales and customer data to a data warehouse.
- Integrate and transform the raw data.
- Build summary tables or unload this data to a data lake so subsequent steps can consume this data.
The example use case Serverless Data Processing Workflow using Amazon Redshift Data Api demonstrates how to chain multiple Lambda functions in a decoupled fashion and build an end-to-end data pipeline. In that code sample, a Lambda function is run through a scheduled event that loads raw data from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift. On its completion, the Data API generates an event that triggers an event rule in EventBridge to invoke another Lambda function that prepares and transforms raw data. When that process is complete, it generates another event triggering a third EventBridge rule to invoke another Lambda function and unloads the data to Amazon S3. The Data API enables you to chain this multi-step data pipeline in a decoupled fashion.
The Data API offers many additional benefits when integrating Amazon Redshift into your analytical workload. The Data API simplifies and modernizes current analytical workflows and custom applications. You can perform long-running queries without having to pause your application for the queries to complete. This enables you to build event-driven applications as well as fully serverless ETL pipelines.
The Data API functionalities are available in many different programming languages to suit your environment. As mentioned earlier, there are a wide variety of use cases and possibilities where you can use the Data API to improve your analytical workflow. To learn more, see Using the Amazon Redshift Data API.
About the Authors
David Zhang is an AWS Solutions Architect who helps customers design robust, scalable, and data-driven solutions across multiple industries. With a background in software engineering, David is an active leader and contributor to AWS open-source initiatives. He is passionate about solving real-world business problems and continuously strives to work from the customer’s perspective.
Bipin Pandey is a Data Architect at AWS. He loves to build data lake and analytics platform for his customers. He is passionate about automating and simplifying customer problems with the use of cloud solutions.
Manash Deb is a Senior Analytics Specialist Solutions Architect at AWS. He has worked on building end-to-end data-driven solutions in different database and data warehousing technologies for over 15 years. He loves to learn new technologies and solving, automating, and simplifying customer problems with easy-to-use cloud data solutions on AWS.
Bhanu Pittampally is Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytical solutions. His background is in data warehouse – architecture, development and administration. He is in data and analytical field for over 13 years. His Linkedin profile is here.
Chao Duan is a software development manager at Amazon Redshift, where he leads the development team focusing on enabling self-maintenance and self-tuning with comprehensive monitoring for Redshift. Chao is passionate about building high-availability, high-performance, and cost-effective database to empower customers with data-driven decision making.
Debu Panda, a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.