Using the Amazon Redshift Data API to interact with Amazon Redshift clusters
This post was updated on July 28, 2021, to include multi-statement and parameterization support.
Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.
As a data engineer or application developer, for some use cases, you want to interact with Amazon Redshift to load or query data with a simple API endpoint without having to manage persistent connections. With Amazon Redshift Data API, you can interact with Amazon Redshift without having to configure JDBC or ODBC. This makes it easier and more secure to work with Amazon Redshift and opens up new use cases.
This post explains how to use the Amazon Redshift Data API from the AWS Command Line Interface (AWS CLI) and Python. We also explain how to use AWS Secrets Manager to store and retrieve credentials for the Data API.
Introducing the Data API
The Amazon Redshift Data API enables you to painlessly access data from Amazon Redshift with all types of traditional, cloud-native, and containerized, serverless web service-based applications and event-driven applications. The following diagram illustrates this architecture.
The Amazon Redshift Data API simplifies data access, ingest, and egress from programming languages and platforms supported by the AWS SDK such as Python, Go, Java, Node.js, PHP, Ruby, and C++.
The Data API simplifies access to Amazon Redshift by eliminating the need for configuring drivers and managing database connections. Instead, you can run SQL commands to an Amazon Redshift cluster by simply calling a secured API endpoint provided by the Data API. The Data API takes care of managing database connections and buffering data. The Data API is asynchronous, so you can retrieve your results later. Your query results are stored for 24 hours. The Data API federates AWS Identity and Access Management (IAM) credentials so you can use identity providers like Okta or Azure Active Directory or database credentials stored in Secrets Manager without passing database credentials in API calls.
For customers using AWS Lambda, the Data API provides a secure way to access your database without the additional overhead for Lambda functions to be launched in an Amazon Virtual Private Cloud (Amazon VPC). Integration with the AWS SDK provides a programmatic interface to run SQL statements and retrieve results asynchronously.
Relevant use cases
The Amazon Redshift Data API is not a replacement for JDBC and ODBC drivers, and is suitable for use cases where you don’t need a persistent connection to a cluster. It’s applicable in the following use cases:
- Building a serverless data processing workflow.
- Designing asynchronous web dashboards because the Data API lets you run long-running queries without having to wait for it to complete.
- Running your query one time and retrieving the results multiple times without having to run the query again within 24 hours.
- Building your ETL pipelines with AWS Step Functions, Lambda, and stored procedures.
- Having simplified access to Amazon Redshift from Amazon SageMaker and Jupyter notebooks.
- Building event-driven applications with Amazon EventBridge and Lambda.
- Scheduling SQL scripts to simplify data load, unload, and refresh of materialized views.
The Data API GitHub repository provides examples for different use cases.
Create an Amazon Redshift cluster
If you haven’t already created an Amazon Redshift cluster, or want to create a new one, see Step 1: Create an IAM role. In this post, we create a table and load data using the COPY command. Make sure that the IAM role you attach to your cluster has
Prerequisites for using the Data API
You must be authorized to access the Amazon Redshift Data API. Amazon Redshift provides the
RedshiftDataFullAccess managed policy, which offers full access to Data APIs. This policy also allows access to Amazon Redshift clusters, Secrets Manager, and IAM API operations needed to authenticate and access an Amazon Redshift cluster by using temporary credentials. If you want to use temporary credentials with the managed policy
RedshiftDataFullAccess, you have to create one with the user name in the database as
You can also create your own IAM policy that allows access to specific resources by starting with
RedshiftDataFullAccess as a template. For details, refer to Querying a database using the query editor.
The Data API allows you to access your database either using your IAM credentials or secrets stored in Secrets Manager. In this post, we use Secrets Manager.
For instructions on using database credentials for the Data API, see How to rotate Amazon Redshift credentials in AWS Secrets Manager.
Use the Data API from the AWS CLI
You can use the Data API from the AWS CLI to interact with the Amazon Redshift cluster. For instructions on configuring the AWS CLI, see Setting up the Amazon Redshift CLI. The Amazon Redshift CLI (
aws redshift) is a part of AWS CLI that lets you manage Amazon Redshift clusters, such as creating, deleting, and resizing them. The Data API now provides a command line interface to the AWS CLI (
redshift-data) that allows you to interact with the databases in an Amazon Redshift cluster.
Before we get started, ensure that you have the updated AWS SDK configured.
You can invoke help using the following command:
The following table shows you different commands available with the Data API CLI.
||Lists the databases in a cluster.|
||Lists the schemas in a database. You can filter this by a matching schema pattern.|
||Lists the tables in a database. You can filter the tables list by a schema name pattern, a matching table name pattern, or a combination of both.|
||Describes the detailed information about a table including column metadata.|
||Runs a SQL statement, which can be SELECT,DML, DDL, COPY, or UNLOAD.|
||Runs multiple SQL statements in a batch as a part of single transaction. The statements can be SELECT, DML, DDL, COPY, or UNLOAD.|
|Cancels a running query. To be canceled, a query must be in the RUNNING state.|
||Describes the details of a specific SQL statement run. The information includes when the query started, when it finished, the number of rows processed, and the SQL statement.|
||Lists the SQL statements. By default, only finished statements are shown.|
||Fetches the temporarily cached result of the query. The result set contains the complete result set and the column metadata. You can paginate through a set of records to retrieve the entire result as needed.|
If you want to get help on a specific command, run the following command:
Now we look at how you can use these commands. First, get the secret key ARN by navigating to your key on the Secrets Manager console.
Most organizations use a single database in their Amazon Redshift cluster. You can use the following command to list the databases you have in your cluster. This operation requires you to connect to a database and therefore requires database credentials:
Similar to listing databases, you can list your schemas by using the list-schemas command:
You have several schemas that match
demo3, and so on). You can optionally provide a pattern to filter your results matching to that pattern:
The Data API provides a simple command,
list-tables, to list tables in your database. You might have thousands of tables in a schema; the Data API lets you paginate your result set or filter the table list by providing filter conditions.
You can search across your schema with
table-pattern; for example, you can filter the table list by all tables across all your schemas in the database. See the following code:
You can filter your tables list in a specific schema pattern:
Run SQL commands
You can run SELECT, DML, DDL, COPY, or UNLOAD commands for Amazon Redshift with the Data API. You can optionally specify a name for your statement. You can optionally specify a name for your statement, and if you want to send an event to EventBridge after the query runs. The query is asynchronous, and you get a query ID after running a query.
Create a schema
Let’s now use the Data API to see how you can create a schema. The following command lets you create a schema in your database. You don’t have to run this SQL if you have pre-created the schema.
The following shows an example output. We will discuss later how you can check the status of a SQL that you executed with execute-statement
We discuss later how you can check the status of a SQL that you ran with
Create a table
You can use the following command to create a table with the CLI.
Load sample data
The COPY command lets you load bulk data into your table in Amazon Redshift. You can use the following command to load data into the table we created earlier:
The following query uses the table we created earlier:
If you’re fetching a large amount of data, using UNLOAD is recommended. You can unload data into Amazon Simple Storage Service (Amazon S3) either using CSV or Parquet format. UNLOAD uses the MPP capabilities of your Amazon Redshift cluster and is faster than retrieving a large amount of data to the client side.
The following shows an example output:
You can fetch results using the query ID that you receive as an output of
Check the status of a statement
You can check the status of your statement by using
describe-statement. The output for
describe-statement provides additional details such as PID, query duration, number of rows in and size of the result set, and the query ID given by Amazon Redshift. See the following command:
The following is an example output:
The status of a statement can be FINISHED, RUNNING, or FAILED.
Run SQL statements with parameters
You can run SQL statements with parameters. The following example uses two named parameters in the SQL that is specified using a name-value pair:
QueryParameters along with
You can map the name-value pair in the parameters list to one or more parameters in the SQL text, and the name-value parameter can be in random order. You can specify type cast, for example,
:sellerid::BIGINT, with a parameter. You can also specify a comment in the SQL text while using parameters. You can’t specify a NULL value or zero-length value as a parameter.
Cancel a running statement
If your query is still running, you can use
cancel-statement to cancel a SQL query. See the following command:
Fetch results from your query
You can fetch the query results by using
get-statement-result. The query result is stored for 24 hours. See the following command:
The output of the result contains metadata such as the number of records fetched, column metadata, and a token for pagination.
Run multiple SQL statements
You can run multiple SELECT, DML, DDL, COPY, or UNLOAD commands for Amazon Redshift in a batch with the Data API. The
batch-execute-statement enables you to create tables and run multiple COPY commands or create temporary tables as a part of your reporting system and run queries on that temporary table. See the following code:
describe-statement for a multi-statement query shows the status of all sub-statements:
In the preceding example, we had two SQL statements and therefore the output includes the ID for the SQL statements as
23d99d7f-fd13-4686-92c8-e2c279715c21:2. Each sub-statement of a batch SQL statement has a status, and the status of the batch statement is updated with the status of the last sub-statement. For example, if the last statement has status FAILED, then the status of the batch statement shows as FAILED.
You can fetch query results for each statement separately. In our example, the first statement is a a SQL statement to create a temporary table, so there are no results to retrieve for the first statement. You can retrieve the result set for the second statement by providing the statement ID for the sub-statement:
Export the data
Amazon Redshift allows you to export from database tables to a set of files in an S3 bucket by using the UNLOAD command with a SELECT statement. You can unload data in either text or Parquet format. The following command shows you an example of how you can use the data lake export with the Data API:
You can use the
batch-execute-statement if you want to use multiple statements with UNLOAD or combine UNLOAD with other SQL statements.
Use the Data API from the AWS SDK
You can use the Data API in any of the programming languages supported by the AWS SDK. For this post, we use the AWS SDK for Python (Boto3) as an example to illustrate the capabilities of the Data API.
We first import the Boto3 package and establish a session:
Get a client object
You can create a client object from the
boto3.Session object and using
If you don’t want to create a session, your client is as simple as the following code:
Run a statement
The following example code uses the Secrets Manager key to run a statement. For this post, we use the table we created earlier. You can use DDL, DML, COPY, and UNLOAD as a parameter:
As we discussed earlier, running a query is asynchronous; running a statement returns an
ExecuteStatementOutput, which includes the statement ID.
If you want to publish an event to
EventBridge when the statement is complete, you can use the additional parameter
WithEvent set to
Use IAM credentials
Amazon Redshift allows users to get temporary database credentials using
GetClusterCredentials. We recommend scoping the access to a specific cluster and database user if you’re allowing your users to use temporary credentials. The following example code gets temporary IAM credentials. As you can see in the code, we use
redshift_data_api_user. The managed policy
RedshiftDataFullAccess scopes to use temporary credentials only to
Describe a statement
You can use
describe_statement to find the status of the query and number of records retrieved:
Fetch results from your query
You can use
get_statement_result to retrieve results for your query if your query is complete:
command returns a JSON object that includes metadata for the result and the actual result set. You might need to process the data to format the result if you want to display it in a user-friendly format.
Fetch and format results
For this post, we demonstrate how to format the results with the Pandas framework. The
post_process function processes the metadata and results to populate a DataFrame. The query function retrieves the result from a database in an Amazon Redshift cluster. See the following code:
import pandas as pd
In this post, we demonstrated using the Data API with Python. However, you can use the Data API with other programming languages supported by the AWS SDK.
We recommend the following best practices when using the Data API:
- Federate your IAM credentials to the database to connect with Amazon Redshift. Amazon Redshift allows users to get temporary database credentials with
GetClusterCredentials. We recommend scoping the access to a specific cluster and database user if you’re granting your users temporary credentials. For more information, see Example policy for using GetClusterCredentials.
- Use a custom policy to provide fine-grained access to the Data API in the production environment if you don’t want your users to use temporary credentials. You have to use Secrets Manager to manage your credentials in such use cases.
- Ensure that the record size that you retrieve is smaller than 64 KB.
- Don’t retrieve a large amount of data from your client and use the UNLOAD command to export the query results to Amazon S3. You’re limited to retrieving only 100 MB of data with the Data API.
- Don’t forget to retrieve your results within 24 hours; results are stored only for 24 hours.
Datacoral is a fast-growing startup that offers an AWS-native data integration solution for analytics. Datacoral integrates data from databases, APIs, events, and files into Amazon Redshift while providing guarantees on data freshness and data accuracy to ensure meaningful analytics. With the Data API, they can create a completely event-driven and serverless platform that makes data integration and loading easier for our mutual customers.
Founder and CEO Raghu Murthy says, “As an Amazon Redshift Ready Advanced Technology Partner, we have worked with the Redshift team to integrate their Redshift API into our product. The Redshift API provides the asynchronous component needed in our platform to submit and respond to data pipeline queries running on Amazon Redshift. It is the last piece of the puzzle for us to offer our customers a fully event-driven and serverless platform that is robust, cost-effective, and scales automatically. We are thrilled to be part of the launch.”
Zynga Inc. is an American game developer running social video game services, founded in April 2007. Zynga uses Amazon Redshift as its central data warehouse for game event, user, and revenue data. They use the data in the data warehouse for analytics, BI reporting, and AI/ML across all games and departments. Zynga wants to replace any programmatic access clients connected to Amazon Redshift with the new Data API. Currently, Zynga’s services connect using a wide variety of clients and drivers, and they plan to consolidate all of them. This will remove the need for Amazon Redshift credentials and regular password rotations.
Johan Eklund, Senior Software Engineer, Analytics Engineering team in Zynga, who participated in the beta testing, says, “The Data API would be an excellent option for our services that will use Amazon Redshift programmatically. The main improvement would be authentication with IAM roles without having to involve the JDBC/ODBC drivers since they are all AWS hosted. Our most common service client environments are PHP, Python, Go, plus a few more.”
In this post, we introduced you to the newly launched Amazon Redshift Data API. We also demonstrated how to use the Data API from the Amazon Redshift CLI and Python using the AWS SDK. We also provided best practices for using the Data API.
About the Authors
Debu Panda, a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies. He has more than 20 years of experience in the IT industry and has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).
Martin Grund is a Principal Engineer working in the Amazon Redshift team on all topics related to data lake (e.g. Redshift Spectrum), AWS platform integration and security.
Chao Duan is a software development manager at Amazon Redshift, where he leads the development team focusing on enabling self-maintenance and self-tuning with comprehensive monitoring for Redshift. Chao is passionate about building high-availability, high-performance, and cost-effective database to empower customers with data-driven decision making.
Daisy Yanrui Zhang is a software Dev Engineer working in the Amazon Redshift team on database monitoring, serverless database and database user experience.