AWS Big Data Blog

Using the Amazon Redshift Data API to interact with Amazon Redshift clusters

June 2023: This post was reviewed and updated for accuracy.
July 2021: This post was reviewed and updated to include multi-statement and parameterization support.

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.

As a data engineer or application developer, for some use cases, you want to interact with Amazon Redshift to load or query data with a simple API endpoint without having to manage persistent connections. With Amazon Redshift Data API, you can interact with Amazon Redshift without having to configure JDBC or ODBC. This makes it easier and more secure to work with Amazon Redshift and opens up new use cases.

This post explains how to use the Amazon Redshift Data API from the AWS Command Line Interface (AWS CLI) and Python. We also explain how to use AWS Secrets Manager to store and retrieve credentials for the Data API.

Introducing the Data API

The Amazon Redshift Data API enables you to painlessly access data from Amazon Redshift with all types of traditional, cloud-native, and containerized, serverless web service-based applications and event-driven applications. The following diagram illustrates this architecture.

The Amazon Redshift Data API simplifies data access, ingest, and egress from programming languages and platforms supported by the AWS SDK such as Python, Go, Java, Node.js, PHP, Ruby, and C++.

The Data API simplifies access to Amazon Redshift by eliminating the need for configuring drivers and managing database connections. Instead, you can run SQL commands to an Amazon Redshift cluster by simply calling a secured API endpoint provided by the Data API. The Data API takes care of managing database connections and buffering data. The Data API is asynchronous, so you can retrieve your results later. Your query results are stored for 24 hours. The Data API federates AWS Identity and Access Management (IAM) credentials so you can use identity providers like Okta or Azure Active Directory or database credentials stored in Secrets Manager without passing database credentials in API calls.

For customers using AWS Lambda, the Data API provides a secure way to access your database without the additional overhead for Lambda functions to be launched in an Amazon Virtual Private Cloud (Amazon VPC). Integration with the AWS SDK provides a programmatic interface to run SQL statements and retrieve results asynchronously.

Relevant use cases

The Amazon Redshift Data API is not a replacement for JDBC and ODBC drivers, and is suitable for use cases where you don’t need a persistent connection to a cluster. It’s applicable in the following use cases:

  • Accessing Amazon Redshift from custom applications with any programming language supported by the AWS SDK. This enables you to integrate web service-based applications to access data from Amazon Redshift using an API to run SQL statements. For example, you can run SQL from JavaScript.
  • Building a serverless data processing workflow.
  • Designing asynchronous web dashboards because the Data API lets you run long-running queries without having to wait for it to complete.
  • Running your query one time and retrieving the results multiple times without having to run the query again within 24 hours.
  • Building your ETL pipelines with AWS Step Functions, Lambda, and stored procedures.
  • Having simplified access to Amazon Redshift from Amazon SageMaker and Jupyter notebooks.
  • Building event-driven applications with Amazon EventBridge and Lambda.
  • Scheduling SQL scripts to simplify data load, unload, and refresh of materialized views.

The Data API GitHub repository provides examples for different use cases.

Create an Amazon Redshift cluster

If you haven’t already created an Amazon Redshift cluster, or want to create a new one, see Step 1: Create an IAM role. In this post, we create a table and load data using the COPY command. Make sure that the IAM role you attach to your cluster has AmazonS3ReadOnlyAccess permission.

Prerequisites for using the Data API

You must be authorized to access the Amazon Redshift Data API. Amazon Redshift provides the RedshiftDataFullAccess managed policy, which offers full access to Data APIs. This policy also allows access to Amazon Redshift clusters, Secrets Manager, and IAM API operations needed to authenticate and access an Amazon Redshift cluster by using temporary credentials. If you want to use temporary credentials with the managed policy RedshiftDataFullAccess, you have to create one with the user name in the database as redshift_data_api_user.

You can also create your own IAM policy that allows access to specific resources by starting with RedshiftDataFullAccess as a template. For details, refer to Querying a database using the query editor.

The Data API allows you to access your database either using your IAM credentials or secrets stored in Secrets Manager. In this post, we use Secrets Manager.

For instructions on using database credentials for the Data API, see How to rotate Amazon Redshift credentials in AWS Secrets Manager.

Use the Data API from the AWS CLI

You can use the Data API from the AWS CLI to interact with the Amazon Redshift cluster. For instructions on configuring the AWS CLI, see Setting up the Amazon Redshift CLI. The Amazon Redshift CLI (aws redshift) is a part of AWS CLI that lets you manage Amazon Redshift clusters, such as creating, deleting, and resizing them. The Data API now provides a command line interface to the AWS CLI (redshift-data) that allows you to interact with the databases in an Amazon Redshift cluster.

Before we get started, ensure that you have the updated AWS SDK configured.

You can invoke help using the following command:

aws redshift-data help

The following table shows you different commands available with the Data API CLI.

Command Description
list-databases Lists the databases in a cluster.
list-schemas Lists the schemas in a database. You can filter this by a matching schema pattern.
list-tables Lists the tables in a database. You can filter the tables list by a schema name pattern, a matching table name pattern, or a combination of both.
describe-table Describes the detailed information about a table including column metadata.
execute-statement Runs a SQL statement, which can be SELECT,DML, DDL, COPY, or UNLOAD.
batch-execute-statement Runs multiple SQL statements in a batch as a part of single transaction. The statements can be SELECT, DML, DDL, COPY, or UNLOAD.
cancel-statement Cancels a running query. To be canceled, a query must be in the RUNNING state.
describe-statement Describes the details of a specific SQL statement run. The information includes when the query started, when it finished, the number of rows processed, and the SQL statement.
list-statements Lists the SQL statements. By default, only finished statements are shown.
get-statement-result Fetches the temporarily cached result of the query. The result set contains the complete result set and the column metadata. You can paginate through a set of records to retrieve the entire result as needed.

If you want to get help on a specific command, run the following command:

aws redshift-data list-tables help 

Now we look at how you can use these commands. First, get the secret key ARN by navigating to your key on the Secrets Manager console.

List databases

Most organizations use a single database in their Amazon Redshift cluster. You can use the following command to list the databases you have in your cluster. This operation requires you to connect to a database and therefore requires database credentials:

aws redshift-data list-databases  --cluster-identifier <your-cluster-id>--secret-arn  <replace-with-your-secret ARN> --database <your-db-name> --region us-west-2

List schema

Similar to listing databases, you can list your schemas by using the list-schemas command:

aws redshift-data list-schemas  --cluster-identifier <your-cluster-id> --secret-arn <replace-with-your-secret ARN> --database <your-db-name> --region us-west-2 

You have several schemas that match demo (demo, demo2, demo3, and so on). You can optionally provide a pattern to filter your results matching to that pattern:

aws redshift-data list-schemas --database <your-db-name> --cluster-identifier <your-cluster-id> --secret-arn <replace-with-your-secret ARN> --region <your-region> --schema-pattern "demo%"

List tables

The Data API provides a simple command, list-tables, to list tables in your database. You might have thousands of tables in a schema; the Data API lets you paginate your result set or filter the table list by providing filter conditions.

You can search across your schema with table-pattern; for example, you can filter the table list by all tables across all your schemas in the database. See the following code:

aws redshift-data list-tables --database dev --cluster-identifier <your-cluster-id>--secret-arn <your-secret-arn> --database <your-db-name> --region <your-region> --table-pattern "ven%"

You can filter your tables list in a specific schema pattern:

aws redshift-data list-tables --database <your-db-name> --cluster-identifier <your-cluster-id> --secret-arn <your-secret-arn> --region <your-region> --table-pattern "ven%" --schema-pattern demo

Run SQL commands

You can run SELECT, DML, DDL, COPY, or UNLOAD commands for Amazon Redshift with the Data API. You can optionally specify a name for your statement. You can optionally specify a name for your statement, and if you want to send an event to EventBridge after the query runs. The query is asynchronous, and you get a query ID after running a query.

Create a schema

Let’s now use the Data API to see how you can create a schema. The following command lets you create a schema in your database. You don’t have to run this SQL if you have pre-created the schema.

aws redshift-data execute-statement \
     --database <your-db-name>  \
     --cluster-identifier <your-cluster-id> \
     --secret-arn <your-secret-arn> \
     --sql "CREATE SCHEMA demo;" \
     --region <your-region>

The following shows an example output. We will discuss later how you can check the status of a SQL that you executed with execute-statement

{
    "ClusterIdentifier": "redshift-cluster-1",
    "CreatedAt": "2020-09-11T16:06:28.876000-07:00",
    "Database": "dev",
    "Id": "a6e0072b-4641-4e67-9105-aceb7f57266a",
    "SecretArn": "<Your-ARN->"
}

We discuss later how you can check the status of a SQL that you ran with execute-statement.

Create a table

You can use the following command to create a table with the CLI.

aws redshift-data execute-statement \
    --database <your-db-name> \
    --cluster-identifier <your-cluster-id> \
    --secret-arn <your-secret-arn>  \
    --sql "CREATE TABLE demo.green_201601(
  vendorid                VARCHAR(4), \
  pickup_datetime         TIMESTAMP, \
  dropoff_datetime        TIMESTAMP, \
  store_and_fwd_flag      VARCHAR(1), \
  ratecode                INT, \
  pickup_longitude        FLOAT4, \
  pickup_latitude         FLOAT4, \
  dropoff_longitude       FLOAT4, \
  dropoff_latitude        FLOAT4, \
  passenger_count         INT, \
  trip_distance           FLOAT4, \
  fare_amount             FLOAT4, \
  extra                   FLOAT4, \
  mta_tax                 FLOAT4, \
  tip_amount              FLOAT4, \
  tolls_amount            FLOAT4, \
  ehail_fee               FLOAT4, \
  improvement_surcharge   FLOAT4, \
  total_amount            FLOAT4, \
  payment_type            VARCHAR(4),\
  trip_type               VARCHAR(4));" \
    --region <your-region> 

Load sample data

The COPY command lets you load bulk data into your table in Amazon Redshift. You can use the following command to load data into the table we created earlier:

aws redshift-data execute-statement \
    --database <your-db-name> \
    --cluster-identifier <your-cluster-id> \
    --secret-arn <your-secret-arn>  \
    --region <your-region>  
    --sql "COPY demo.green_201601 \
FROM 's3://us-west-2.serverless-analytics/NYC-Pub/green/green_tripdata_2016-01' \
IAM_ROLE 'arn:aws:iam::<Your_ACCOUNT>:role/<YourRole>' \
DATEFORMAT 'auto' \
IGNOREHEADER 1 \
DELIMITER ',' \
IGNOREBLANKLINES \
REGION 'us-west-2';" 

Retrieve Data

The following query uses the table we created earlier:

aws redshift-data execute-statement \
    --database <your-db-name> \
    --cluster-identifier <your-cluster-id> \
    --secret-arn <your-secret-arn>  \
    --region <your-region> \
    --sql "SELECT ratecode, COUNT(*) FROM demo.green_201601 WHERE \
trip_distance > 5 GROUP BY 1 ORDER BY 1;"
    

If you’re fetching a large amount of data, using UNLOAD is recommended. You can unload data into Amazon Simple Storage Service (Amazon S3) either using CSV or Parquet format. UNLOAD uses the MPP capabilities of your Amazon Redshift cluster and is faster than retrieving a large amount of data to the client side.

The following shows an example output:

{
    "ClusterIdentifier": "redshift-cluster-1",
    "CreatedAt": "2020-09-01T09:39:45.693000-07:00",
    "Database": "dev",
    "Id": "fc4111f0-0e01-456a-83cf-d5922a8b100a",
    "SecretArn": "<your-secret-arn>"
}

You can fetch results using the query ID that you receive as an output of execute-statement.

Check the status of a statement

You can check the status of your statement by using describe-statement. The output for describe-statement provides additional details such as PID, query duration, number of rows in and size of the result set, and the query ID given by Amazon Redshift. See the following command:

 aws redshift-data describe-statement \
    --id 76f59b84-34a1-481b-a37d-a7b7e1ea57dc \

The following is an example output:

{
    "ClusterIdentifier": "redshift-cluster-1",
    "CreatedAt": "2020-08-31T15:03:08.852000-07:00",
    "Duration": 82642162,
    "Id": "76f59b84-34a1-481b-a37d-a7b7e1ea57dc",
    "QueryString": " SELECT ratecode, COUNT(*) FROM demo.green_201601 WHERE trip_distance > 5 GROUP BY 1 ORDER BY 1;",
    "RedshiftPid": 27815,
    "RedshiftQueryId": 709972,
    "ResultRows": 6,
    "ResultSize": 186,
    "SecretArn": "<your-secret-arn>",
    "Status": "FINISHED",
    "UpdatedAt": "2020-08-31T15:03:09.505000-07:00"
}

The status of a statement can be FINISHED, RUNNING, or FAILED.

Run SQL statements with parameters

You can run SQL statements with parameters. The following example uses two named parameters in the SQL that is specified using a name-value pair:

aws redshift-data execute-statement --cluster-identifier redshift-debu-cluster-1 --database dev --db-user awsuser --sql "select sellerid,sum(pricepaid) totalsales from sales where eventid >= :eventid and sellerid>:selrid group by sellerid"  --parameters "[{\"name\": \"selrid\", \"value\": \"100\"},{\"name\": \"eventid\", \"value\": \"100\"}]"

The describe-statement returns QueryParameters along with QueryString:

aws redshift-data describe-statement  --id 19de0a3f-e8d3-47b4-a52d-7025579da222 

{
    "ClusterIdentifier": "redshift-debu-cluster-1",
    "CreatedAt": "2021-07-22T10:28:27.974000-07:00",
    "Duration": 340013362,
    "HasResultSet": true,
    "Id": "64325947-fa69-4794-992d-09c28268cf23",
    "QueryParameters": [
        {
            "name": "selrid",
            "value": "100"
        },
        {
            "name": "eventid",
            "value": "100"
        }
    ],
    "QueryString": "select sellerid,sum(pricepaid) totalsales from sales where eventid >= :eventid and sellerid>:selrid group by sellerid",
    "RedshiftPid": 23989,
    "RedshiftQueryId": 3935637,
    "ResultRows": 43211,
    "ResultSize": 764291,
    "Status": "FINISHED",
    "UpdatedAt": "2021-07-22T10:28:28.811000-07:00"
}

You can map the name-value pair in the parameters list to one or more parameters in the SQL text, and the name-value parameter can be in random order. You can specify type cast, for example, :sellerid::BIGINT, with a parameter. You can also specify a comment in the SQL text while using parameters. You can’t specify a NULL value or zero-length value as a parameter.

Cancel a running statement

If your query is still running, you can use cancel-statement to cancel a SQL query. See the following command:

aws redshift-data cancel-statement --id 39a0de2f-e85e-45ff-a0d7-cd074c348120        --region  <your-region> 

Fetch results from your query

You can fetch the query results by using get-statement-result. The query result is stored for 24 hours. See the following command:

aws redshift-data get-statement-result     --id 7b61da88-1b11-4ade-956a-21085a29118d     --region <your-region> 

The output of the result contains metadata such as the number of records fetched, column metadata, and a token for pagination.

Run multiple SQL statements

You can run multiple SELECT, DML, DDL, COPY, or UNLOAD commands for Amazon Redshift in a batch with the Data API. The batch-execute-statement enables you to create tables and run multiple COPY commands or create temporary tables as a part of your reporting system and run queries on that temporary table. See the following code:

aws redshift-data batch-execute-statement  \
--cluster-identifier <your-cluster-id>  \
--database <your-db-name> \
--db-user <your-db-user>  \
--sql "create temporary table mysales \
(firstname, lastname, total_quantity ) as \
SELECT firstname, lastname, total_quantity \
FROM   (SELECT buyerid, sum(qtysold) total_quantity \
        FROM  sales  \
        GROUP BY buyerid \
        ORDER BY total_quantity desc limit 10) Q, users \
WHERE Q.buyerid = userid \ 
ORDER BY Q.total_quantity desc;" "select * from mysales limit 100;"

The describe-statement for a multi-statement query shows the status of all sub-statements:

"ClusterIdentifier": "redshift-cluster-1",
    "CreatedAt": "2021-07-21T12:41:33.586000-07:00",
    "Duration": 3469770288,
    "Id": "23d99d7f-fd13-4686-92c8-e2c279715c21",
    "RedshiftPid": 17292,
    "RedshiftQueryId": 0,
    "ResultRows": -1,
    "ResultSize": -1,
    "Status": "FINISHED",
    "SubStatements": [
        {
            "CreatedAt": "2021-07-21T12:41:33.746000-07:00",
            "Duration": 224302220,
            "HasResultSet": false,
            "Id": "23d99d7f-fd13-4686-92c8-e2c279715c21:1",
            "QueryString": "create temporary table mysales (firstname, lastname, total_quantity ) as \nSELECT firstname, lastname, total_quantity \nFROM   (SELECT buyerid, sum(qtysold) total_quantity\n        FROM  sales\n        GROUP BY buyerid\n        ORDER BY total_quantity desc limit 10) Q, users\nWHERE Q.buyerid = userid\nORDER BY Q.total_quantity desc;",
            "RedshiftQueryId": 3921572,
            "ResultRows": 0,
            "ResultSize": 0,
            "Status": "FINISHED",
            "UpdatedAt": "2021-07-21T12:41:34.443000-07:00"
        },
        {
            "CreatedAt": "2021-07-21T12:41:33.746000-07:00",
            "Duration": 3245468068,
            "HasResultSet": true,
            "Id": "23d99d7f-fd13-4686-92c8-e2c279715c21:2",
            "QueryString": "select *\nfrom mysales limit 100;",
            "RedshiftQueryId": 3921574,
            "ResultRows": 40,
            "ResultSize": 1276,
            "Status": "FINISHED",
            "UpdatedAt": "2021-07-21T12:41:37.836000-07:00"
        }
    ],
    "UpdatedAt": "2021-07-21T12:41:37.857000-07:00"
}

In the preceding example, we had two SQL statements and therefore the output includes the ID for the SQL statements as 23d99d7f-fd13-4686-92c8-e2c279715c21:1 and 23d99d7f-fd13-4686-92c8-e2c279715c21:2. Each sub-statement of a batch SQL statement has a status, and the status of the batch statement is updated with the status of the last sub-statement. For example, if the last statement has status FAILED, then the status of the batch statement shows as FAILED.

You can fetch query results for each statement separately. In our example, the first statement is a a SQL statement to create a temporary table, so there are no results to retrieve for the first statement. You can retrieve the result set for the second statement by providing the statement ID for the sub-statement:

aws redshift-data get-statement-result --id 23d99d7f-fd13-4686-92c8-e2c279715c21:2

Export the data

Amazon Redshift allows you to export from database tables to a set of files in an S3 bucket by using the UNLOAD command with a SELECT statement. You can unload data in either text or Parquet format. The following command shows you an example of how you can use the data lake export with the Data API:

aws redshift-data execute-statement --database <db-name> --cluster-identifier <cluster-name> \
--secret-arn <your-secret-arn> \ 
 --region <your-region> \
--sql "unload ('select * from demo.green_201601') to '<your-S3-bucket>' iam_role '<your-iam-role>'; " 

You can use the batch-execute-statement if you want to use multiple statements with UNLOAD or combine UNLOAD with other SQL statements.

Use the Data API from the AWS SDK

You can use the Data API in any of the programming languages supported by the AWS SDK. For this post, we use the AWS SDK for Python (Boto3) as an example to illustrate the capabilities of the Data API.

We first import the Boto3 package and establish a session:

def get_client(service, endpoint=None, region="us-west-2"):
    import botocore.session as bc
    session = bc.get_session()

    s = boto3.Session(botocore_session=session, region_name=region)
    if endpoint:
        return s.client(service, endpoint_url=endpoint)
    return s.client(service)

Get a client object

You can create a client object from the boto3.Session object and using RedshiftData:

rsd = get_client('redshift-data')

If you don’t want to create a session, your client is as simple as the following code:

import boto3
client = boto3.client('redshift-data')

Run a statement

The following example code uses the Secrets Manager key to run a statement. For this post, we use the table we created earlier. You can use DDL, DML, COPY, and UNLOAD as a parameter:

resp = rsd.execute_statement(
    SecretArn="<replace-with-your-secret-arn>",
    ClusterIdentifier="<replace-with-your-cluster-name>",
    Database="<replace-with-your-db-name>",
    Sql="SELECT ratecode, COUNT(*) totalrides FROM demo.green_201601 WHERE trip_distance > 5 GROUP BY 1 ORDER BY 1;"

As we discussed earlier, running a query is asynchronous; running a statement returns an ExecuteStatementOutput, which includes the statement ID.

If you want to publish an event to EventBridge when the statement is complete, you can use the additional parameter WithEvent set to true:

resp = rsd.execute_statement(
    Database=database,
    ClusterIdentifier=cluster,
    SecretArn=secretarn,
    Sql=sql,
    WithEvent=True
    ) 

Use IAM credentials

Amazon Redshift allows users to get temporary database credentials using GetClusterCredentials. We recommend scoping the access to a specific cluster and database user if you’re allowing your users to use temporary credentials. The following example code gets temporary IAM credentials. As you can see in the code, we use redshift_data_api_user. The managed policy RedshiftDataFullAccess scopes to use temporary credentials only to redshift_data_api_user.

def query(sql, cluster="redshift-cluster-1",database="dev",dbuser="redshift_data_api_user"):
     resp = rsd.execute_statement(
        Database=database,
        ClusterIdentifier=cluster,
        DbUser=dbuser,
        Sql=sql,
        WithEvent=True
     )

Describe a statement

You can use describe_statement to find the status of the query and number of records retrieved:

id=resp['Id']
desc = rsd.describe_statement(Id=id)
if desc["Status"] == "FINISHED":            
    print(desc["ResultRows"])

Fetch results from your query

You can use get_statement_result to retrieve results for your query if your query is complete:

if desc and desc["ResultRows"]  > 0:
    result = rsd.get_statement_result(Id=qid)

The get_statement_result

command returns a JSON object that includes metadata for the result and the actual result set. You might need to process the data to format the result if you want to display it in a user-friendly format.

Fetch and format results

For this post, we demonstrate how to format the results with the Pandas framework. The post_process function processes the metadata and results to populate a DataFrame. The query function retrieves the result from a database in an Amazon Redshift cluster. See the following code:

import pandas as pd

def post_process(meta, records):
    columns = [k["name"] for k in meta]
    rows = []
    for r in records:
        tmp = []
        for c in r:
            tmp.append(c[list(c.keys())[0]])
        rows.append(tmp)
    return pd.DataFrame(rows, columns=columns)

def query(sql, cluster="redshift-cluster-1", user="awsuser", database="dev"):
    resp = rsd.execute_statement(
        Database=database,
        ClusterIdentifier=cluster,
        DbUser=user,
        Sql=sql
    )
    qid = resp["Id"]
    print(qid)
    desc = None
    while True:
        desc = rsd.describe_statement(Id=qid)
        if desc["Status"] == "FINISHED":
            break
            print(desc["ResultRows"])
    if desc and desc["ResultRows"]  > 0:
        result = rsd.get_statement_result(Id=qid)
        rows, meta = result["Records"], result["ColumnMetadata"]
        return post_process(meta, rows)

pf=query("select venueid,venuename from venue  limit 100;")
print(pf)

In this post, we demonstrated using the Data API with Python. However, you can use the Data API with other programming languages supported by the AWS SDK.

Best practices

We recommend the following best practices when using the Data API:

  • Federate your IAM credentials to the database to connect with Amazon Redshift. Amazon Redshift allows users to get temporary database credentials with GetClusterCredentials. We recommend scoping the access to a specific cluster and database user if you’re granting your users temporary credentials. For more information, see Example policy for using GetClusterCredentials.
  • Use a custom policy to provide fine-grained access to the Data API in the production environment if you don’t want your users to use temporary credentials. You have to use Secrets Manager to manage your credentials in such use cases.
  • Ensure that the record size that you retrieve is smaller than 64 KB.
  • Don’t retrieve a large amount of data from your client and use the UNLOAD command to export the query results to Amazon S3. You’re limited to retrieving only 100 MB of data with the Data API.
  • Don’t forget to retrieve your results within 24 hours; results are stored only for 24 hours.

Customer feedback

Datacoral is a fast-growing startup that offers an AWS-native data integration solution for analytics. Datacoral integrates data from databases, APIs, events, and files into Amazon Redshift while providing guarantees on data freshness and data accuracy to ensure meaningful analytics. With the Data API, they can create a completely event-driven and serverless platform that makes data integration and loading easier for our mutual customers.

Founder and CEO Raghu Murthy says, “As an Amazon Redshift Ready Advanced Technology Partner, we have worked with the Redshift team to integrate their Redshift API into our product. The Redshift API provides the asynchronous component needed in our platform to submit and respond to data pipeline queries running on Amazon Redshift. It is the last piece of the puzzle for us to offer our customers a fully event-driven and serverless platform that is robust, cost-effective, and scales automatically. We are thrilled to be part of the launch.”

Zynga Inc. is an American game developer running social video game services, founded in April 2007. Zynga uses Amazon Redshift as its central data warehouse for game event, user, and revenue data. They use the data in the data warehouse for analytics, BI reporting, and AI/ML across all games and departments. Zynga wants to replace any programmatic access clients connected to Amazon Redshift with the new Data API. Currently, Zynga’s services connect using a wide variety of clients and drivers, and they plan to consolidate all of them. This will remove the need for Amazon Redshift credentials and regular password rotations.

Johan Eklund, Senior Software Engineer, Analytics Engineering team in Zynga, who participated in the beta testing, says, “The Data API would be an excellent option for our services that will use Amazon Redshift programmatically. The main improvement would be authentication with IAM roles without having to involve the JDBC/ODBC drivers since they are all AWS hosted. Our most common service client environments are PHP, Python, Go, plus a few more.”

Conclusion

In this post, we introduced you to the newly launched Amazon Redshift Data API. We also demonstrated how to use the Data API from the Amazon Redshift CLI and Python using the AWS SDK. We also provided best practices for using the Data API.

To learn more, see Using the Amazon Redshift Data API or visit the Data API GitHub repository for code examples. For serverless, refer to Use the Amazon Redshift Data API to interact with Amazon Redshift Serverless.


About the Authors

Debu Panda, a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies. He has more than 20 years of experience in the IT industry and has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

Martin Grund is a Principal Engineer working in the Amazon Redshift team on all topics related to data lake (e.g. Redshift Spectrum), AWS platform integration and security.

Chao Duan is a software development manager at Amazon Redshift, where he leads the development team focusing on enabling self-maintenance and self-tuning with comprehensive monitoring for Redshift. Chao is passionate about building high-availability, high-performance, and cost-effective database to empower customers with data-driven decision making.

Daisy Yanrui Zhang is a software Dev Engineer working in the Amazon Redshift team on database monitoring, serverless database and database user experience.